What is a Dataset?
A dataset in Dataframer is a collection of seed data files that serve as the foundation for generating synthetic samples. Datasets provide the examples and patterns that Dataframer analyzes to understand the structure and requirements of your data.Dataset Types
Dataframer supports three types of datasets, each suited for different use cases:SINGLE_FILE
A dataset containing one file with structured or unstructured data. Use Cases:- Single CSV file with tabular data
- JSON/JSONL with consistent schema
- Individual documents (TXT, PDF, MD)
For structured files (CSV, JSON, JSONL), short samples are not supported. Use long samples to preserve the data structure.
MULTI_FILE
A dataset containing multiple independent files of the same or different types. Use Cases:- Collection of customer reviews
- Multiple conversation transcripts
- Set of documents with similar structure
- Each file is treated as an independent sample
- Files can have different formats
- No folder structure supported
MULTI_FOLDER
A dataset containing multiple folders, where each folder represents a complete sample with multiple related files. Use Cases:- Patient medical records (each folder = one patient)
- Project documentation sets
- Multi-file test cases
- Each folder is treated as one complete sample
- Maintains relationships between files in each folder
- Generated samples preserve folder structure
For detailed constraints, file requirements, and step-by-step instructions, see the Multi-Folder Workflow Tutorial.
Supported File Formats
Dataframer supports the following file formats, with availability depending on the dataset type:| Format | Extension | Description | SINGLE_FILE | MULTI_FILE | MULTI_FOLDER |
|---|---|---|---|---|---|
| CSV | .csv | Comma-separated values | ✅ | ✅ | ✅ |
| JSON | .json | JSON object or array | ✅ | ✅ | ✅ |
| JSONL | .jsonl | JSON Lines (one object per line) | ✅ | ✅ | ✅ |
| Text | .txt | Plain text | ❌ | ✅ | ✅ |
| Markdown | .md | Markdown formatted text | ❌ | ✅ | ✅ |
Dataset Type Constraints:
- SINGLE_FILE: Only structured formats (
.csv,.json,.jsonl) - MULTI_FILE: All text formats (
.txt,.md,.json,.csv,.jsonl) - MULTI_FOLDER: All text formats (
.md,.txt,.json,.csv,.jsonl)
Dataset Properties
When creating a dataset, you can specify: name (required): A descriptive name for your dataset dataset_type (required): One ofSINGLE_FILE, MULTI_FILE, or MULTI_FOLDER
description (optional): Additional context about the dataset
file(s) (required): The actual data files to upload
Best Practices
Dataset Size
Dataset Size
- Minimum: 3-5 samples for meaningful analysis
- Recommended: 10-50 samples for best results
- Maximum: 1000 samples per dataset
File Quality
File Quality
- Ensure files are properly formatted
- Remove corrupted or incomplete samples
- Maintain consistent structure across files
- Verify encoding is UTF-8
Choosing Dataset Type
Choosing Dataset Type
- Use SINGLE_FILE for: One CSV/JSON file with multiple rows/objects
- Use MULTI_FILE for: Multiple independent documents or conversations
- Use MULTI_FOLDER for: Related files that belong together (e.g., medical records, test cases)
File Naming
File Naming
- Use descriptive, consistent filenames
- Avoid special characters (stick to alphanumeric and underscores)
- Include file extensions
- For MULTI_FOLDER, use meaningful folder names
Dataset Management
Viewing Datasets
List all your datasets:Updating Datasets
Add files to an existing dataset:You cannot add folders to MULTI_FILE datasets. Create a new MULTI_FOLDER dataset instead.
Deleting Datasets
Sample Compatibility
Different dataset types support different sample generation modes:| Dataset Type | Short Samples | Long Samples |
|---|---|---|
| SINGLE_FILE (structured) | ❌ | ✅ |
| SINGLE_FILE (unstructured) | ✅ | ✅ |
| MULTI_FILE | ✅ | ✅ |
| MULTI_FOLDER | ✅ | ✅ |

