Skip to main content

What is a Dataset?

A dataset in Dataframer is a collection of seed data files that serve as the foundation for generating synthetic samples. Datasets provide the examples and patterns that Dataframer analyzes to understand the structure and requirements of your data.

Dataset Types

Dataframer supports three types of datasets, each suited for different use cases:

SINGLE_FILE

A dataset containing one file with structured or unstructured data. Use Cases:
  • Single CSV file with tabular data
  • JSON/JSONL with consistent schema
  • Individual documents (TXT, PDF, MD)
Example:
# Create a single-file dataset
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Product Catalog' \
  -F 'dataset_type=SINGLE_FILE' \
  -F '[email protected]'
For structured files (CSV, JSON, JSONL), short samples are not supported. Use long samples to preserve the data structure.

MULTI_FILE

A dataset containing multiple independent files of the same or different types. Use Cases:
  • Collection of customer reviews
  • Multiple conversation transcripts
  • Set of documents with similar structure
Example:
# Create multi-file dataset
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Customer Feedback' \
  -F 'dataset_type=MULTI_FILE' \
  -F '[email protected]' \
  -F '[email protected]' \
  -F '[email protected]'
Characteristics:
  • Each file is treated as an independent sample
  • Files can have different formats
  • No folder structure supported

MULTI_FOLDER

A dataset containing multiple folders, where each folder represents a complete sample with multiple related files. Use Cases:
  • Patient medical records (each folder = one patient)
  • Project documentation sets
  • Multi-file test cases
Example Structure:
dataset/
├── patient_001/
│   ├── chest_xray_report.txt
│   └── discharge_summary.md
├── patient_002/
│   ├── blood_work.txt
│   └── clinical_notes.md
Example:
# Upload as ZIP file
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/create/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'name=Medical Records' \
  -F 'dataset_type=MULTI_FOLDER' \
  -F '[email protected]'
Characteristics:
  • Each folder is treated as one complete sample
  • Maintains relationships between files in each folder
  • Generated samples preserve folder structure
For detailed constraints, file requirements, and step-by-step instructions, see the Multi-Folder Workflow Tutorial.

Supported File Formats

Dataframer supports the following file formats, with availability depending on the dataset type:
FormatExtensionDescriptionSINGLE_FILEMULTI_FILEMULTI_FOLDER
CSV.csvComma-separated values
JSON.jsonJSON object or array
JSONL.jsonlJSON Lines (one object per line)
Text.txtPlain text
Markdown.mdMarkdown formatted text
Dataset Type Constraints:
  • SINGLE_FILE: Only structured formats (.csv, .json, .jsonl)
  • MULTI_FILE: All text formats (.txt, .md, .json, .csv, .jsonl)
  • MULTI_FOLDER: All text formats (.md, .txt, .json, .csv, .jsonl)
For detailed MULTI_FOLDER constraints, see the Multi-Folder Workflow Guide.
Files must be UTF-8 encoded. Other encodings may cause processing errors.

Dataset Properties

When creating a dataset, you can specify: name (required): A descriptive name for your dataset dataset_type (required): One of SINGLE_FILE, MULTI_FILE, or MULTI_FOLDER description (optional): Additional context about the dataset file(s) (required): The actual data files to upload

Best Practices

  • Minimum: 3-5 samples for meaningful analysis
  • Recommended: 10-50 samples for best results
  • Maximum: 1000 samples per dataset
Larger datasets provide more patterns but take longer to analyze.
  • Ensure files are properly formatted
  • Remove corrupted or incomplete samples
  • Maintain consistent structure across files
  • Verify encoding is UTF-8
  • Use SINGLE_FILE for: One CSV/JSON file with multiple rows/objects
  • Use MULTI_FILE for: Multiple independent documents or conversations
  • Use MULTI_FOLDER for: Related files that belong together (e.g., medical records, test cases)
  • Use descriptive, consistent filenames
  • Avoid special characters (stick to alphanumeric and underscores)
  • Include file extensions
  • For MULTI_FOLDER, use meaningful folder names

Dataset Management

Viewing Datasets

List all your datasets:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/datasets/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Get details of a specific dataset:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/datasets/{dataset_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Updating Datasets

Add files to an existing dataset:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/datasets/{dataset_id}/add_files/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@new_file.txt'
You cannot add folders to MULTI_FILE datasets. Create a new MULTI_FOLDER dataset instead.

Deleting Datasets

curl -X DELETE 'https://df-api.dataframer.ai/api/dataframer/datasets/{dataset_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Deleting a dataset also deletes all associated specifications and generated samples. This action cannot be undone.

Sample Compatibility

Different dataset types support different sample generation modes:
Dataset TypeShort SamplesLong Samples
SINGLE_FILE (structured)
SINGLE_FILE (unstructured)
MULTI_FILE
MULTI_FOLDER
Learn more about sample types in the Generation guide.

Next Steps