Skip to main content

Overview

This tutorial walks you through creating a dataset in Dataframer. You’ll learn how to prepare your data, choose the right dataset type, and upload files.

What You’ll Learn

  • How to prepare your data files
  • Choosing the correct dataset type
  • Uploading files via API or UI
  • Verifying dataset creation

Prerequisites

  • API key (see Authentication)
  • Sample data files in supported formats (CSV, JSON, JSONL, TXT, PDF, or MD)

Step 1: Prepare Your Data

Choose Your Dataset Type

SINGLE_FILE: One file containing multiple records
  • Example: customers.csv with 100 customer records
MULTI_FILE: Multiple independent files
  • Example: 50 customer review text files
MULTI_FOLDER: Multiple folders, each containing related files
  • Example: Patient records where each folder = one patient
Not sure which type? See the Datasets guide for detailed comparison.

File Requirements

Ensure your files meet these requirements:
  • Encoding: UTF-8
  • Size: < 100 MB per file
  • Format: Valid file format (no corruption)
  • Naming: Use alphanumeric characters and underscores

Step 2: Create a Single-File Dataset

Via API

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Customer Database' \
  -F 'dataset_type=SINGLE_FILE' \
  -F 'description=Main customer database export' \
  -F '[email protected]'
Response:
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Customer Database",
  "dataset_type": "SINGLE_FILE",
  "description": "Main customer database export",
  "created_at": "2025-11-26T10:00:00Z",
  "file_count": 1
}

Via Python

from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Create dataset with file
dataset = client.dataframer.datasets.create_with_files(
    name="Customer Database",
    dataset_type="SINGLE_FILE",
    description="Main customer database export",
    file=open("customers.csv", "rb")
)

print(f"Created dataset: {dataset.id}")
  1. Log in to https://app.aimon.ai
  2. Navigate to DatasetsCreate New
  3. Enter dataset name: “Customer Database”
  4. Select type: Single File
  5. Add description (optional)
  6. Click Choose File and select customers.csv
  7. Click Create Dataset

Step 3: Create a Multi-File Dataset

When you have multiple independent files:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Customer Reviews' \
  -F 'dataset_type=MULTI_FILE' \
  -F 'description=Product review collection' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]'

Python Example

from pathlib import Path
from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Prepare files
review_files = list(Path("./reviews").glob("*.txt"))
files = [open(f, "rb") for f in review_files]

# Create dataset with multiple files
dataset = client.dataframer.datasets.create_with_files(
    name="Customer Reviews",
    dataset_type="MULTI_FILE",
    description="Product review collection",
    files=files
)

# Close files
for f in files:
    f.close()

print(f"Created dataset with {dataset.file_count} files")

Step 4: Create a Multi-Folder Dataset

For related files grouped in folders:

Prepare Folder Structure

patient_records/
├── patient_001/
│   ├── demographics.json
│   ├── lab_results.csv
│   └── doctor_notes.txt
├── patient_002/
│   ├── demographics.json
│   ├── lab_results.csv
│   └── doctor_notes.txt
└── patient_003/
    ├── demographics.json
    ├── lab_results.csv
    └── doctor_notes.txt

Create ZIP File

# Create ZIP of folder structure
cd patient_records
zip -r ../patient_records.zip .
cd ..

Upload ZIP

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Patient Records' \
  -F 'dataset_type=MULTI_FOLDER' \
  -F 'description=Anonymized patient medical records' \
  -F 'file=@patient_records.zip'

Python Example

from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Upload ZIP file - backend auto-detects MULTI_FOLDER structure
with open("patient_records.zip", "rb") as zip_file:
    dataset = client.dataframer.datasets.create_from_zip(
        name="Patient Records",
        description="Anonymized patient medical records",
        zip_file=zip_file
    )

print(f"Created dataset: {dataset.id}")
print(f"Type: {dataset.dataset_type} (auto-detected)")
print(f"Files: {dataset.file_count} | Folders: {dataset.folder_count}")
For MULTI_FOLDER datasets, upload a single ZIP file containing the folder structure.

Step 5: Verify Dataset Creation

Check that your dataset was created successfully:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/datasets/550e8400-e29b-41d4-a716-446655440000/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Customer Database",
  "dataset_type": "SINGLE_FILE",
  "description": "Main customer database export",
  "created_at": "2025-11-26T10:00:00Z",
  "updated_at": "2025-11-26T10:00:00Z",
  "file_count": 1,
  "files": [
    {
      "id": "file_abc123",
      "name": "customers.csv",
      "file_type": "CSV",
      "size": 1048576
    }
  ]
}

Step 6: Add More Files (Optional)

Add additional files to an existing dataset:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/550e8400-e29b-41d4-a716-446655440000/add_files/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@additional_data.csv'
You cannot add folders to MULTI_FILE datasets. Create a new MULTI_FOLDER dataset instead.

Common Issues

Possible causes:
  • File exceeds size limit (100 MB)
  • File is corrupted
  • Incorrect file format
  • Not UTF-8 encoded
Solution:
  • Check file size: ls -lh yourfile.csv
  • Verify file opens correctly
  • Ensure proper file extension
  • Convert to UTF-8: iconv -f ISO-8859-1 -t UTF-8 input.csv > output.csv
Problem: Created SINGLE_FILE but need MULTI_FILESolution:
  • Delete the dataset
  • Create new dataset with correct type
  • Re-upload files
Dataset type cannot be changed after creation.
Possible causes:
  • ZIP doesn’t contain folders at root level
  • Empty folders in ZIP
  • Incorrect folder structure
Solution:
  • Ensure ZIP root contains folders (not files)
  • Remove empty folders
  • Verify structure: unzip -l yourfile.zip

Best Practices

Name datasets descriptively: Use clear names that indicate content Add descriptions: Include purpose, date range, or other context Verify file quality: Check files open and display correctly Use consistent formats: Keep file formats consistent within a dataset Test with small datasets: Start with 5-10 samples for initial testing

Next Steps

Now that you’ve created a dataset, you can: