Datasets

What is a Dataset?

A dataset in Dataframer is a collection of seed data files that serve as the foundation for generating synthetic samples. Datasets provide the examples and patterns that Dataframer analyzes to understand the structure and requirements of your data.

Dataset Types

Dataframer supports three types of datasets, each suited for different use cases:

SINGLE_FILE

A dataset containing one file with structured or unstructured data. Use Cases:

Single CSV file with tabular data
JSON/JSONL with consistent schema
Individual documents (TXT, PDF, MD)

Example:

# Create a single-file dataset
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Product Catalog' \
  -F 'dataset_type=SINGLE_FILE' \
  -F '[email protected]'

For structured files (CSV, JSON, JSONL), short samples are not supported. Use long samples to preserve the data structure.

MULTI_FILE

A dataset containing multiple independent files of the same or different types. Use Cases:

Collection of customer reviews
Multiple conversation transcripts
Set of documents with similar structure

Example:

# Create multi-file dataset
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Customer Feedback' \
  -F 'dataset_type=MULTI_FILE' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]'

Characteristics:

Each file is treated as an independent sample
Files can have different formats
No folder structure supported

MULTI_FOLDER

A dataset containing multiple folders, where each folder represents a complete sample with multiple related files. Use Cases:

Patient medical records (each folder = one patient)
Project documentation sets
Multi-file test cases

Example Structure:

dataset/
├── patient_001/
│   ├── chest_xray_report.txt
│   └── discharge_summary.md
├── patient_002/
│   ├── blood_work.txt
│   └── clinical_notes.md

Example:

# Upload as ZIP file
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Medical Records' \
  -F 'dataset_type=MULTI_FOLDER' \
  -F '[email protected]'

Characteristics:

Each folder is treated as one complete sample
Maintains relationships between files in each folder
Generated samples preserve folder structure

For detailed constraints, file requirements, and step-by-step instructions, see the Multi-Folder Workflow Tutorial.

Supported File Formats

Dataframer supports the following file formats, with availability depending on the dataset type:

Format	Extension	Description	SINGLE_FILE	MULTI_FILE	MULTI_FOLDER
CSV	`.csv`	Comma-separated values	✅	✅	✅
JSON	`.json`	JSON object or array	✅	✅	✅
JSONL	`.jsonl`	JSON Lines (one object per line)	✅	✅	✅
Text	`.txt`	Plain text	❌	✅	✅
Markdown	`.md`	Markdown formatted text	❌	✅	✅

Dataset Type Constraints:

SINGLE_FILE: Only structured formats (.csv, .json, .jsonl)
MULTI_FILE: All text formats (.txt, .md, .json, .csv, .jsonl)
MULTI_FOLDER: All text formats (.md, .txt, .json, .csv, .jsonl)

For detailed MULTI_FOLDER constraints, see the Multi-Folder Workflow Guide.

Files must be UTF-8 encoded. Other encodings may cause processing errors.

Dataset Properties

When creating a dataset, you can specify: name (required): A descriptive name for your dataset dataset_type (required): One of SINGLE_FILE, MULTI_FILE, or MULTI_FOLDER description (optional): Additional context about the dataset file(s) (required): The actual data files to upload

Best Practices

Dataset Size

Minimum: 3-5 samples for meaningful analysis
Recommended: 10-50 samples for best results
Maximum: 1000 samples per dataset

Larger datasets provide more patterns but take longer to analyze.

File Quality

Ensure files are properly formatted
Remove corrupted or incomplete samples
Maintain consistent structure across files
Verify encoding is UTF-8

Choosing Dataset Type

Use SINGLE_FILE for: One CSV/JSON file with multiple rows/objects
Use MULTI_FILE for: Multiple independent documents or conversations
Use MULTI_FOLDER for: Related files that belong together (e.g., medical records, test cases)

File Naming

Use descriptive, consistent filenames
Avoid special characters (stick to alphanumeric and underscores)
Include file extensions
For MULTI_FOLDER, use meaningful folder names

Dataset Management

Viewing Datasets

List all your datasets:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/datasets/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Get details of a specific dataset:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/datasets/{dataset_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Updating Datasets

Add files to an existing dataset:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/{dataset_id}/add_files/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@new_file.txt'

You cannot add folders to MULTI_FILE datasets. Create a new MULTI_FOLDER dataset instead.

Deleting Datasets

curl -X DELETE 'https://df-api.dataframer.ai/api/dataframer/datasets/{dataset_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Deleting a dataset also deletes all associated specifications and generated samples. This action cannot be undone.

Sample Compatibility

Different dataset types support different sample generation modes:

Dataset Type	Short Samples	Long Samples
SINGLE_FILE (structured)	❌	✅
SINGLE_FILE (unstructured)	✅	✅
MULTI_FILE	✅	✅
MULTI_FOLDER	✅	✅

Learn more about sample types in the Generation guide.

Next Steps

Create Specifications

Learn how to generate specifications from your datasets.

Dataset Tutorial

Follow a step-by-step tutorial for creating datasets.

Main Docs

API Tutorials

Release Notes

What is a Dataset?

Dataset Types

SINGLE_FILE

MULTI_FILE

MULTI_FOLDER

Supported File Formats

Dataset Properties

Best Practices

Dataset Management

Viewing Datasets

Updating Datasets

Deleting Datasets

Sample Compatibility

Next Steps

Create Specifications

Dataset Tutorial

Main Docs

API Tutorials

Release Notes

​What is a Dataset?

​Dataset Types

​SINGLE_FILE

​MULTI_FILE

​MULTI_FOLDER

​Supported File Formats

​Dataset Properties

​Best Practices

​Dataset Management

​Viewing Datasets

​Updating Datasets

​Deleting Datasets

​Sample Compatibility

​Next Steps

Create Specifications

Dataset Tutorial

What is a Dataset?

Dataset Types

SINGLE_FILE

MULTI_FILE

MULTI_FOLDER

Supported File Formats

Dataset Properties

Best Practices

Dataset Management

Viewing Datasets

Updating Datasets

Deleting Datasets

Sample Compatibility

Next Steps