Core Concepts

Overview

Dataframer generates synthetic data using distribution-based generation. Instead of templates or rules, it infers patterns and probability distributions from your seed data (or from your description in seedless mode), then generates new samples that match those distributions. A sample is the unit that gets imitated: in CSV/JSONL it’s a row, in multiple files it’s a file, in multiple folders it’s a folder.

Three-stage pipeline

Seed Data and Generation Objectives

Provide examples that define what “good” looks like, or describe in text what dataset you want.

Specification

AI analyzes your seeds and objectives to infer:

Data structure and format
Properties (tone, sentiment, domain, length, etc.)
Probability distributions for each property (optional)
Dependencies between properties (optional)

Generation

Generate new samples at scale:

Sample property values from distributions
Generate content matching those properties
Validate and refine for quality

Data properties and distributions

What are data properties?

Data properties (also called “axes”) are the dimensions of variation in your data. Dataframer automatically discovers these from seeds or your generation objectives. Examples:

Product reviews: Formality (formal/casual), Sentiment (positive/negative/neutral), Product category (electronics/clothing/food)
Code samples: Language (Python/JavaScript/Java), Framework (Django/Flask/React), Complexity (simple/moderate/complex)
Legal documents: Document type (contract/brief/motion), Jurisdiction (federal/state), Complexity level (basic/intermediate/advanced)

Base distributions

Each property has a base distribution - the default probability for each value. These are automatically generated during spec creation and can be edited manually in the spec editor. Example:

property: formality
possible_values: [formal, casual, technical]
base_probabilities: [0.5, 0.3, 0.2]

This means: 50% formal, 30% casual, 20% technical

You can edit these distributions in the spec editor after creation to adjust the mix of generated samples.

Conditional distributions

Advanced feature: Model dependencies between properties. These are automatically generated when enabled during spec creation and can be edited manually in the spec editor. Example: If language is Python, framework distribution changes:

property: framework
possible_values: [Django, Flask, FastAPI, Express, Spring]
base_probabilities: [0.2, 0.2, 0.2, 0.2, 0.2]
conditional_probabilities:
  language:
    Python: [0.35, 0.35, 0.30, 0.0, 0.0]  # Only Python frameworks
    JavaScript: [0.0, 0.0, 0.0, 0.7, 0.3]  # Only JS frameworks

Use cases:

Compatibility constraints (Python → .py extension)
Domain dependencies (Medical records → medical terminology)
Realistic co-occurrence patterns

Conditional distributions are optional. Enable “Include conditional distributions” during spec creation to use this feature.

Generation objectives

During spec creation, you can provide natural language guidance to influence property discovery. Examples:

Include writing style as a property

Tells the analyzer to explicitly capture writing style variations.

Don't consider length as a variable

Prevents length from being treated as a property (all outputs will vary naturally in length).

Add neutral sentiment alongside positive/negative

Suggests including neutral as a third sentiment value.

Make formal writing 80% likely, casual 20%

Suggests a specific probability distribution (only works if “Generate probability distributions” is enabled).

Generation objectives are optional but highly recommended. They help the AI understand what matters in your data.

Extrapolation settings

Control how creative the spec creation should be: Generate new data properties (default: OFF):

When enabled: Discovers properties not explicit in seeds
Example: Seeds have colors → adds “brightness” or “saturation”

Generate new property values (default: ON):

When enabled: Suggests values not present in seeds
Example: Seeds have red/blue → adds green, yellow, purple

Model Selection

We usually have latest versions of Gemini and Claude models, as well as a selection of open-source models (Deepseek, etc). The “thinking” versions of the models have reasoning enabled—you should typically enable it if there’s an option. You can assign different models for different roles within the same run to maximize quality and minimize cost. Read more on model choice in the Complete Workflow article.

Next steps

Complete Workflow

Step-by-step guide for all features

Main Docs

Tutorials

Integrations

Release Notes

Overview

Three-stage pipeline

Data properties and distributions

What are data properties?

Base distributions

Conditional distributions

Generation objectives

Extrapolation settings

Model Selection

Next steps

Complete Workflow

Main Docs

Tutorials

Integrations

Release Notes

​Overview

​Three-stage pipeline

​Data properties and distributions

​What are data properties?

​Base distributions

​Conditional distributions

​Generation objectives

​Extrapolation settings

​Model Selection

​Next steps

Complete Workflow

Overview

Three-stage pipeline

Data properties and distributions

What are data properties?

Base distributions

Conditional distributions

Generation objectives

Extrapolation settings

Model Selection

Next steps