Overview
DataFramer generates synthetic data using distribution-based generation. Instead of templates or rules, it infers patterns and probability distributions from your seed data (or from your description in seedless mode), then generates new samples that match those distributions. A sample is the unit that gets imitated: in CSV/JSONL it’s a row, in multiple files it’s a file, in multiple folders it’s a folder.Three-stage pipeline
Seed Data and Generation Objectives
Provide examples that define what “good” looks like, or describe in text what dataset you want.
Specification
AI analyzes your seeds and objectives to infer:
- Data structure and format
- Properties (tone, sentiment, domain, length, etc.)
- Probability distributions for each property (optional)
- Dependencies between properties (optional)
Data properties and distributions
What are data properties?
Data properties (also called “axes”) are the dimensions of variation in your data. DataFramer automatically discovers these from seeds or your generation objectives. Examples:- Product reviews: Formality (formal/casual), Sentiment (positive/negative/neutral), Product category (electronics/clothing/food)
- Code samples: Language (Python/JavaScript/Java), Framework (Django/Flask/React), Complexity (simple/moderate/complex)
- Legal documents: Document type (contract/brief/motion), Jurisdiction (federal/state), Complexity level (basic/intermediate/advanced)
Base distributions
Each property has a base distribution - the default probability for each value. These are automatically generated during spec creation and can be edited manually in the spec editor. Example:Conditional distributions
Model dependencies between properties. These are automatically generated during spec creation by default and can be edited manually in the spec editor. Example: If language is Python, framework distribution changes:- Compatibility constraints (Python → .py extension)
- Domain dependencies (Medical records → medical terminology)
- Realistic co-occurrence patterns
Generation objectives
During spec creation, you can provide natural language guidance to influence property discovery. Examples:Extrapolation settings
Control how creative the spec creation should be: Generate new data properties (default: OFF):- When enabled: Discovers properties not explicit in seeds
- Example: Seeds have colors → adds “brightness” or “saturation”
- When enabled: Suggests values not present in seeds
- Example: Seeds have red/blue → adds green, yellow, purple
Model Selection
We usually have latest versions of Gemini and Claude models, as well as a selection of open-source models (Deepseek, etc). The “thinking” versions of the models have reasoning enabled—you should typically enable it if there’s an option. You can assign different models for different roles within the same run to maximize quality and minimize cost. Read more on model choice in the Complete Workflow article.Next steps
Complete Workflow
Step-by-step guide for all features

