Prerequisites
- Access to Dataframer platform
- Sample data file (CSV, JSONL, or text files) with at least 2 examples
Step 1: Upload seed data
Navigate to Seed Datasets and click Upload Dataset. Choose your upload mode:Single File
Single File
Upload one CSV, JSON, or JSONL file containing structured data.Example: CSV with product reviews, JSONL with chat messages
- Max file size: 50MB
Multiple Files
Multiple Files
Upload a folder of independent text files (each file = one sample).Example: Collection of documents, code snippets
- Max 1,000 files
- 1MB per file, 50MB total
- Supported: TXT, MD, JSON, CSV, JSONL
Multiple Folders
Multiple Folders
Upload parent folder containing subfolders (each subfolder = one multi-file sample).Example: Code repositories with multiple files per project
- Min 2 folders required
- Max 20 files per folder
- Max depth: parent/subfolder/file.txt
Step 2: Create specification
Once your dataset is uploaded, click Create Spec.- Name: Give your spec a descriptive name
- Generation objectives (optional but recommended): Guide the analysis
- Example: “Include writing style and formality as properties”
- Example: “Don’t treat length as a variable”
The specification captures data structure, discovered properties, and probability distributions from your seeds. You can edit these to make generated data deviate from the seed patterns.
Step 3: Configure generation run
Once your spec shows “Ready” status, click Create Run. Choose Long Samples mode and set Number of samples to 10 for a quick test. Use default settings for everything else.Step 4: Monitor progress
Your run starts immediately. Watch real-time status (Pending → Running → Finished), progress percentage, and elapsed time on the Runs page. On our SaaS platform, generating 10 samples can take anywhere from 1 minute (for short text samples) to 1 hour (for structured 100K token samples). We place huge emphasis on quality - one very large, complex sample can require 100+ diverse LLM calls to generate it properly.Step 5: Review results
Once finished, click your run to view results. Generated Dataset tab:- Browse generated samples
- Preview samples inline
- See property tags for each sample (formality, complexity, domain, etc.)
- Download individual samples or entire dataset as ZIP
- Distributions: Compare what you specified in the spec vs what our evaluation classifiers measured in generated samples
- Classification: View property assignments per sample
- Chat: Ask questions about your dataset
Next steps
Now that you’ve generated your first dataset:Core Concepts
Learn about data properties, distributions, and generation modes
Complete Workflow
Explore all features and configuration options
Common issues
Can't upload a dataset
Can't upload a dataset
Make sure you’ve selected the correct upload type (Single File / Multiple Files / Multiple Folders) for your data structure, and that you’ve specified a dataset name.
Confused about samples vs datasets
Confused about samples vs datasets
A sample is the unit that gets imitated:
- CSV/JSONL: One row = one sample
- Multiple Files: One file = one sample
- Multiple Folders: One folder = one sample
I don't have 2+ examples
I don't have 2+ examples
We’re working on seedless generation (generate with 0-1 examples) but it’s not yet in production. For now:
- Find at least one more similar example
- Generate a second example with ChatGPT to use as a starting point
- Minimum 2 examples required currently
I don't see the variation in data that I specified
I don't see the variation in data that I specified
Check these:
- Verify variation is specified in Variable data properties, NOT Shared data properties. Variable properties change across samples; shared properties stay constant.
- Ensure you have enough seed examples - the empirical distribution of properties in your generated data will only approximate your specified distribution if you have sufficient examples.
Complex structure or features not reproduced correctly
Complex structure or features not reproduced correctly
If your data has very complex structure and generated samples don’t capture it:
- Enable revisions if not already enabled
- Increase max revision cycles to 5

