Overview
Dataframer generates synthetic data using distribution-based generation. Instead of templates or rules, it learns patterns and probability distributions from your seed data, then generates new samples that match those distributions. A sample is the unit that gets imitated: in CSV/JSONL it’s a row, in multiple files it’s a file, in multiple folders it’s a folder.Three-stage pipeline
1
Seed Data
You provide examples that define what “good” looks like. Minimum 2 examples required. These seeds should be clean, consistent, and representative of your target data.
2
Specification
AI analyzes seeds to extract:
- Data structure and format
- Properties (tone, sentiment, domain, length, etc.)
- Probability distributions for each property
- Dependencies between properties (optional)
3
Generation
Generate new samples at scale:
- Sample property values from distributions
- Generate content matching those properties
- Validate and refine for quality
Data properties and distributions
What are data properties?
Data properties (also called “axes”) are the dimensions of variation in your data. Dataframer automatically discovers these from seeds. Examples:- Product reviews: Formality (formal/casual), Sentiment (positive/negative/neutral), Product category (electronics/clothing/food)
- Code samples: Language (Python/JavaScript/Java), Framework (Django/Flask/React), Complexity (simple/moderate/complex)
- Legal documents: Document type (contract/brief/motion), Jurisdiction (federal/state), Complexity level (basic/intermediate/advanced)
Base distributions
Each property has a base distribution - the default probability for each value. These are automatically generated during spec creation and can be edited manually in the spec editor. Example:Conditional distributions
Advanced feature: Model dependencies between properties. These are automatically generated when enabled during spec creation and can be edited manually in the spec editor. Example: If language is Python, framework distribution changes:- Compatibility constraints (Python → .py extension)
- Domain dependencies (Medical records → medical terminology)
- Realistic co-occurrence patterns
Conditional distributions are optional. Enable “Include conditional distributions” during spec creation to use this feature.
Generation mode
We recommend using Long Samples mode for all workloads. It handles both short and long content with advanced features like revisions, outlines, and validation. You can select this when creating a run.
Long Samples Mode
How it works:- Outline model creates document blueprint with parts (200-1000 tokens each)
- Generation model creates each part independently
- Parts are concatenated with intelligent boundaries
- Revision model performs quality improvement cycles
- Final output after all revisions complete
- Outline model: Creates structure (can be faster/cheaper model)
- Generation model: Produces content
- Enable revisions: Turn on quality improvement (recommended)
- Revision model: Performs refinements
- Max revision cycles: 1-5 passes (1 = good balance, 3+ = high quality)
- SQL validation level: For SQL generation (syntax / syntax+schema / syntax+schema+execute)
- Seed shuffling: Control randomization (sample/field/prompt levels)
- Part concatenation - smooth boundaries
- Coherence & flow - logical progression
- Consistency - fix contradictions
- Distinguishability - reduce AI “tells”
- Conformance - match sampled properties
Long samples mode works for all content lengths and provides better quality through revisions and validation.
Generation objectives
During spec creation, you can provide natural language guidance to influence property discovery. Examples:Extrapolation settings
Control how creative the spec creation should be: Generate new data properties (default: OFF):- When enabled: Discovers properties not explicit in seeds
- Example: Seeds have colors → adds “brightness” or “saturation”
- When enabled: Suggests values not present in seeds
- Example: Seeds have red/blue → adds green, yellow, purple
Extrapolation is OFF by default for conservative generation. Turn it ON if you want the system to discover implicit properties and suggest new values beyond what’s in your seeds.
Model selection
Dataframer supports multiple LLMs for different tasks:| Model | Speed | Cost | Quality | Best For |
|---|---|---|---|---|
| Claude Sonnet 4.5 | Medium | Medium | Excellent | Default choice (recommended) |
| Claude Haiku 4.5 | Fast | Low | Good | High volume, testing, evaluation |
| Claude Sonnet 4.5 Thinking | Slow | High | Outstanding | Spec creation, complex analysis |
| DeepSeek V3.1 | Medium | Very Low | Good | Generating finetuning data w/o license restrictions |
| Kimi K2 Instruct | Fast | Low | Good | Generating finetuning data w/o license restrictions |
| Qwen 2.5 72B | Fast | Low | Good | Generating finetuning data w/o license restrictions |
Quality, cost, generation time trade-offs (Long Samples mode)
You can mix models within same run to optimize cost. Recommendations:- Use Sonnet Thinking for spec creation, outlines, and revisions
- Use Sonnet for generation in most cases (balanced quality/cost)
- Use Haiku for quick experiments
- Revision cycles: 0 (cheap, fast) → 5 (expensive, slow)

