Creating a Dataset

Overview

This tutorial walks you through creating a dataset in Dataframer. You’ll learn how to prepare your data, choose the right dataset type, and upload files.

What You’ll Learn

How to prepare your data files
Choosing the correct dataset type
Uploading files via API or UI
Verifying dataset creation

Prerequisites

API key (see Authentication)
Sample data files in supported formats (CSV, JSON, JSONL, TXT, PDF, or MD)

Step 1: Prepare Your Data

Choose Your Dataset Type

SINGLE_FILE: One file containing multiple records

Example: customers.csv with 100 customer records

MULTI_FILE: Multiple independent files

Example: 50 customer review text files

MULTI_FOLDER: Multiple folders, each containing related files

Example: Patient records where each folder = one patient

Not sure which type? See the Datasets guide for detailed comparison.

File Requirements

Ensure your files meet these requirements:

Encoding: UTF-8
Size: < 100 MB per file
Format: Valid file format (no corruption)
Naming: Use alphanumeric characters and underscores

Step 2: Create a Single-File Dataset

Via API

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Customer Database' \
  -F 'dataset_type=SINGLE_FILE' \
  -F 'description=Main customer database export' \
  -F '[email protected]'

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Customer Database",
  "dataset_type": "SINGLE_FILE",
  "description": "Main customer database export",
  "created_at": "2025-11-26T10:00:00Z",
  "file_count": 1
}

Via Python

from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Create dataset with file
dataset = client.dataframer.datasets.create_with_files(
    name="Customer Database",
    dataset_type="SINGLE_FILE",
    description="Main customer database export",
    file=open("customers.csv", "rb")
)

print(f"Created dataset: {dataset.id}")

Via UI

Log in to https://app.aimon.ai
Navigate to Datasets → Create New
Enter dataset name: “Customer Database”
Select type: Single File
Add description (optional)
Click Choose File and select customers.csv
Click Create Dataset

Step 3: Create a Multi-File Dataset

When you have multiple independent files:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Customer Reviews' \
  -F 'dataset_type=MULTI_FILE' \
  -F 'description=Product review collection' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]'

Python Example

from pathlib import Path
from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Prepare files
review_files = list(Path("./reviews").glob("*.txt"))
files = [open(f, "rb") for f in review_files]

# Create dataset with multiple files
dataset = client.dataframer.datasets.create_with_files(
    name="Customer Reviews",
    dataset_type="MULTI_FILE",
    description="Product review collection",
    files=files
)

# Close files
for f in files:
    f.close()

print(f"Created dataset with {dataset.file_count} files")

Step 4: Create a Multi-Folder Dataset

For related files grouped in folders:

Prepare Folder Structure

patient_records/
├── patient_001/
│   ├── demographics.json
│   ├── lab_results.csv
│   └── doctor_notes.txt
├── patient_002/
│   ├── demographics.json
│   ├── lab_results.csv
│   └── doctor_notes.txt
└── patient_003/
    ├── demographics.json
    ├── lab_results.csv
    └── doctor_notes.txt

Create ZIP File

# Create ZIP of folder structure
cd patient_records
zip -r ../patient_records.zip .
cd ..

Upload ZIP

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Patient Records' \
  -F 'dataset_type=MULTI_FOLDER' \
  -F 'description=Anonymized patient medical records' \
  -F 'file=@patient_records.zip'

Python Example

from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Upload ZIP file - backend auto-detects MULTI_FOLDER structure
with open("patient_records.zip", "rb") as zip_file:
    dataset = client.dataframer.datasets.create_from_zip(
        name="Patient Records",
        description="Anonymized patient medical records",
        zip_file=zip_file
    )

print(f"Created dataset: {dataset.id}")
print(f"Type: {dataset.dataset_type} (auto-detected)")
print(f"Files: {dataset.file_count} | Folders: {dataset.folder_count}")

For MULTI_FOLDER datasets, upload a single ZIP file containing the folder structure.

Step 5: Verify Dataset Creation

Check that your dataset was created successfully:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/datasets/550e8400-e29b-41d4-a716-446655440000/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Customer Database",
  "dataset_type": "SINGLE_FILE",
  "description": "Main customer database export",
  "created_at": "2025-11-26T10:00:00Z",
  "updated_at": "2025-11-26T10:00:00Z",
  "file_count": 1,
  "files": [
    {
      "id": "file_abc123",
      "name": "customers.csv",
      "file_type": "CSV",
      "size": 1048576
    }
  ]
}

Step 6: Add More Files (Optional)

Add additional files to an existing dataset:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/550e8400-e29b-41d4-a716-446655440000/add_files/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@additional_data.csv'

You cannot add folders to MULTI_FILE datasets. Create a new MULTI_FOLDER dataset instead.

Common Issues

File upload fails

Possible causes:

File exceeds size limit (100 MB)
File is corrupted
Incorrect file format
Not UTF-8 encoded

Solution:

Check file size: ls -lh yourfile.csv
Verify file opens correctly
Ensure proper file extension
Convert to UTF-8: iconv -f ISO-8859-1 -t UTF-8 input.csv > output.csv

Wrong dataset type chosen

Problem: Created SINGLE_FILE but need MULTI_FILESolution:

Delete the dataset
Create new dataset with correct type
Re-upload files

Dataset type cannot be changed after creation.

ZIP file rejected for MULTI_FOLDER

Possible causes:

ZIP doesn’t contain folders at root level
Empty folders in ZIP
Incorrect folder structure

Solution:

Ensure ZIP root contains folders (not files)
Remove empty folders
Verify structure: unzip -l yourfile.zip

Best Practices

✅ Name datasets descriptively: Use clear names that indicate content ✅ Add descriptions: Include purpose, date range, or other context ✅ Verify file quality: Check files open and display correctly ✅ Use consistent formats: Keep file formats consistent within a dataset ✅ Test with small datasets: Start with 5-10 samples for initial testing

Next Steps

Now that you’ve created a dataset, you can:

Generate a Specification

Create a specification from your dataset.

Dataset Management

Learn more about managing datasets.

Main Docs

API Tutorials

Release Notes

Overview

What You’ll Learn

Prerequisites

Step 1: Prepare Your Data

Choose Your Dataset Type

File Requirements

Step 2: Create a Single-File Dataset

Via API

Via Python

Step 3: Create a Multi-File Dataset

Python Example

Step 4: Create a Multi-Folder Dataset

Prepare Folder Structure

Create ZIP File

Upload ZIP

Python Example

Step 5: Verify Dataset Creation

Step 6: Add More Files (Optional)

Common Issues

Best Practices

Next Steps

Generate a Specification

Dataset Management

Main Docs

API Tutorials

Release Notes

​Overview

​What You’ll Learn

​Prerequisites

​Step 1: Prepare Your Data

​Choose Your Dataset Type

​File Requirements

​Step 2: Create a Single-File Dataset

​Via API

​Via Python

​Step 3: Create a Multi-File Dataset

​Python Example

​Step 4: Create a Multi-Folder Dataset

​Prepare Folder Structure

​Create ZIP File

​Upload ZIP

​Python Example

​Step 5: Verify Dataset Creation

​Step 6: Add More Files (Optional)

​Common Issues

​Best Practices

​Next Steps

Generate a Specification

Dataset Management

Overview

What You’ll Learn

Prerequisites

Step 1: Prepare Your Data

Choose Your Dataset Type

File Requirements

Step 2: Create a Single-File Dataset

Via API

Via Python

Step 3: Create a Multi-File Dataset

Python Example

Step 4: Create a Multi-Folder Dataset

Prepare Folder Structure

Create ZIP File

Upload ZIP

Python Example

Step 5: Verify Dataset Creation

Step 6: Add More Files (Optional)

Common Issues

Best Practices

Next Steps