Upload seed data (optional)
You can upload data serving as examples of what a “good” generated sample means for you. If you don’t have such data, skip to the Create Specifications section.Choosing upload mode
DataFramer supports three upload modes depending on your data structure:Single File Mode
Single File Mode
When to use: Structured datasets in tabular or line-delimited formatSupported formats:
- CSV: Tabular data with optional headers
- JSONL: One flat JSON object per line (no nesting)
- JSON: Array of flat objects only (no nesting)
- Max file size: 50MB
- Max columns/fields: 40
- Product reviews CSV with columns: review_text, rating, date
- Chat conversations JSONL with fields: user_message, assistant_response, tone
- API responses JSON with array of objects
Multiple Files Mode
Multiple Files Mode
When to use: Collection of independent samples, each in its own fileSupported formats: TXT, MD, JSON, CSV, JSONL, PDFConstraints:
- Min 2 files required
- Max 1,000 files
- 1MB per file
- 50MB total
- All files must use same format
- Folder of 100 product descriptions (100 .txt files)
- Collection of Python functions (100 .py files)
- Set of news articles (100 .md files)
Multiple Folders Mode
Multiple Folders Mode
When to use: Multi-file samples where each sample consists of related filesSupported formats: MD, TXT, JSON, CSV, JSONL, PDFConstraints:
- Minimum 2 folders required
- Max 20 files per folder
- Max 1,000 files total
- 1MB per file
- 50MB total across all folders
- Max depth: parent/subfolder/file.txt (3 levels)
- Code repositories (homogeneous): Each folder contains main.py + utils.py + config.json
- Mixed document sets (heterogeneous): Folder 1 has report.pdf, Folder 2 has article.md + references.json, Folder 3 has presentation.pptx + notes.txt
- Flexible data samples (heterogeneous): Folder 1 has data.csv, Folder 2 has schema.sql + queries.sql, Folder 3 has output.json
Seed data best practices: quality over quantity
Generation heavily depends on the quality of the seeds.- Of course, “quality” is defined relative to what you want to achieve—you may want to generate data with imperfections, then your seed examples should display those imperfections.
- In some cases you can fix seed issues by using the “Generation Objectives” feature, but in general it is recommended to have very high-quality seed examples.
- 2 samples is minimum, even 2 samples is often enough, but loading a more substantial number of seed examples is encouraged.
- More samples improve inference of data properties and distributions.
Create specifications
Specification creation workflow
- Navigate to your uploaded dataset
- Click Create Spec
- Configure spec settings
- Submit and wait for spec generation to finish (1-5 minutes)
- Review generated spec
- Edit if needed
Configuration options
Generation Objectives
Generation Objectives
Natural language guidance to influence property discovery.Purpose: Help the analyzer understand what matters in your dataExamples:(Last example only works if “Generate probability distributions” is enabled)(This example works if “Include conditional distributions” is enabled - creates dependencies between properties)How objectives flow into generation:Generation objectives influence the specification contents (both shared properties and variable properties). The spec in turn influences the distribution of generated data. Objectives only indirectly affect generation through the spec.Important: After spec creation completes, always review the generated spec to verify your objectives were correctly captured. If the spec doesn’t match your intent, edit it manually or regenerate with refined objectives.Best practices:
- Be specific about properties you want captured
- Explicitly exclude properties you don’t want
- Feel free to list specific values, or give instructions for generating those values
- Suggest probability adjustments if you have strong preferences
- Review the generated spec to ensure objectives were met
Spec Generation Model
Spec Generation Model
Model used to analyze seeds (if present) and create the specification.The default is Claude Sonnet (thinking variant). Claude Opus or Gemini 3 are also good choices. Gemini 3 has a tendency to generate fewer data properties.
Generate Probability Distributions
Generate Probability Distributions
Create explicit probability distributions for each property value.Default: ONWhen enabled:
- Each property gets probabilities:
formal: 0.6, casual: 0.4 - Properties are sampled independently using these probabilities (unless conditional distributions are enabled)
- Enables more controlled generation
- Properties discovered but no explicit probabilities
- Generation samples uniformly from observed values, unless you manually add probabilities in the spec
Include Conditional Distributions
Include Conditional Distributions
Model dependencies between properties.Default: ONRequires: “Generate probability distributions” must be ONWhat are conditional distributions:Conditional distributions override the default probabilities based on previously selected property values. Each property has:How sampling works:Properties are sampled sequentially in the order they appear in the spec. For each property:
- Base probabilities: Default probabilities used when no condition matches
- Conditional probabilities: Alternative probabilities used when specific conditions are met
- Check if any conditional rule applies (based on already-selected property values)
- If a matching conditional rule is found, use those probabilities
- If no conditional rule matches, fall back to base probabilities
- If multiple conditional rules could apply, the first matching one is used (order defined by spec)
Language: Python applies, so Framework uses [0.4, 0.4, 0.1, 0.1, 0.0] instead of base probabilities.When to enable:- Dataset has obvious correlations
- You need compatibility constraints (file extension matches language)
- More complex to edit in the UI: deleting a properties requires deleting all other properties that depend on it; there are more values to view and edit
Generate New Data Properties
Generate New Data Properties
Discover properties not explicitly present in seeds.Default: OFFWhen enabled:
- Example: Seeds have colors but don’t differ in brightness → spec still contains a new brightness property to create a different type of variation
- Only the types of variation explicitly present in seeds are included
- Example: Seeds have colors but don’t differ in brightness → spec will not contain a brightness property, only a color property.
Generate New Property Values
Generate New Property Values
Suggest values not present in seeds for each property.Default: ONWhen enabled:
- Expands possible values beyond seeds
- Example: Seeds have red/blue → suggests green/yellow/purple
- Example: Seeds have Python/JavaScript → suggests Java/Go/Rust
- Use when you want generation limited to observed values
- Example: Seeds have Python/JavaScript → spec programming language property only contains Python/JavaScript
Editing specifications
Once created, specs can be edited:- Add new properties and remove existing ones
- Add/remove property values
- Edit the probability distributions
- Create and edit conditional probability distributions
Create runs
Navigate to Runs → Create Run → Select spec and version.Run configuration
Number of Samples
Number of Samples
Range: 1-20,000
Generation Model
Generation Model
Model used to generate document parts.This model role is less sensitive to model intelligence than others. Feel free to pick a cheaper model such as Claude Haiku here if data is very simple and you want to minimize cost.
Outline Model
Outline Model
Model used to create document blueprint.Default: Claude Sonnet (thinking variant). Claude Opus or Gemini 3 are also good choices.
Enable Revisions
Enable Revisions
Turn on quality improvement cycles after generation. When enabled, you can select which revision types to apply:
- Coherence & Flow — reviews for formatting issues, artifacts, and global coherence. Ensures smooth transitions between sections.
- Consistency — checks facts, names, dates, numbers, terminology, and style consistency. Fixes fake or broken references.
- Distinguishability — ensures generated content matches the style and formatting patterns of the example data, making it blend in naturally with seed examples. Only available for specs based on seed datasets.
- Conformance — verifies the document satisfies shared requirements from properties and does not contradict the sampled attribute values.
Enable Filtering
Enable Filtering
Quality gates that reject and regenerate samples with severe issues. When enabled, you can select which filtering types to apply:
- Structural Filtering — checks for severe structural issues like invalid formatting, generation artifacts, or major structural problems. Documents that fail are regenerated.
- Conformance Filtering — verifies generated documents conform to the target properties and requirements. Documents with clear property violations are regenerated.
Revision & Filtering Model
Revision & Filtering Model
Model used to perform quality improvements, filtering, and part concatenation.Recommended: Claude Sonnet/Opus or Gemini 3 are all good choices. Weak models are not recommended as revision is a complex task.Only used if revisions or filtering are enabled.
Max Revision Cycles
Max Revision Cycles
Number of quality improvement passes.Range: 1-5For highest quality or whenever you see any issues with generated data, increase this setting to 3+. Note this increases cost and generation time.
Thinking Budget
Thinking Budget
Token budget for extended thinking mode. Only appears when at least one selected model is a thinking-capable variant (e.g. models ending in
-thinking).Minimum: 1024 tokens. Default: 1024 tokens.Higher budgets allow the model to reason more before generating, which can improve quality for complex data at the cost of increased latency and cost.Skip Outline and Part-by-Part Generation
Skip Outline and Part-by-Part Generation
Generate a document draft directly instead of the usual outline → parts → concatenation pipeline.Default: OFFFaster and cheaper, but only suitable for shorter documents. For long or complex documents, keep this off to maintain quality.
Unified Multifield Generation
Unified Multifield Generation
For multi-field and multi-folder datasets, generate all fields/files in a single pass instead of sequentially.Default: ONFaster and cheaper, but may produce worse results for very long or very complex fields.
Enable Calculator Tool
Enable Calculator Tool
Give the LLM access to a sandboxed Python calculator during generation.Default: OFFUseful when data contains numerical calculations that need to be internally consistent (e.g. financial reports, invoices, scientific data). Roughly doubles cost and generation time.
Advanced: Seed Shuffling
Advanced: Seed Shuffling
Control how seed examples are shuffled when composing generation prompts. A slider with four levels:
- Level 0 — No shuffling (default, strongly recommended): Maximum prompt caching efficiency, minimum diversity.
- Level 1 — Shuffle across samples: Seeds are shuffled between different samples.
- Level 2 — Shuffle across fields: Seeds are also shuffled between fields (multi-field datasets only).
- Level 3 — Shuffle in all prompts: Maximum diversity, minimum caching efficiency.
Advanced: Max Examples in Prompt
Advanced: Max Examples in Prompt
Limit number of seed examples shown to model.Default: As many seeds as possible are packed into 10K tokens allocation.10K tokens is the default that limits generation cost. You can override it with an integer to cap examples supplied to every generation prompt. It completely overrides the 10K cap in all cases; the value X that you supply determines the allocation of tokens for seeds, which may be more or less than 10K.
Monitor runs and access generated data
Click a run to see detailed information: Overview Tab:- Configuration parameters (models, settings used)
- Spec and dataset references
- Metrics: success rate, failure rate, duration, cost
- File explorer
- LLM-as-a-judge assigned labels for samples
- Manually label samples with key-value annotation tags: these tags are saved and downloaded together with the data
- Download individual files or all files together
- Distribution Analysis: Expected vs observed distributions (bar charts) for different properties
- Chat Interface: Ask questions about generated dataset

