Complete Workflow Guide

This comprehensive guide covers the complete workflow from seed data upload through result analysis.

Upload seed data (optional)

You can upload data serving as examples of what a “good” generated sample means for you. If you don’t have such data, skip to the Create Specifications section.

Choosing upload mode

Dataframer supports three upload modes depending on your data structure:

Single File Mode

When to use: Structured datasets in tabular or line-delimited formatSupported formats:

CSV: Tabular data with optional headers
JSONL: One flat JSON object per line (no nesting)
JSON: Array of flat objects only (no nesting)

Constraints:

Max file size: 50MB
Max columns/fields: 40

Example use cases:

Product reviews CSV with columns: review_text, rating, date
Chat conversations JSONL with fields: user_message, assistant_response, tone
API responses JSON with array of objects

Multiple Files Mode

When to use: Collection of independent samples, each in its own fileSupported formats: TXT, MD, JSON, CSV, JSONLConstraints:

Max 1,000 files
1MB per file
50MB total
All files must use same format

Structure: Flat folder of files, each file = one sampleExample use cases:

Folder of 100 product descriptions (100 .txt files)
Collection of Python functions (100 .py files)
Set of news articles (100 .md files)

Multiple Folders Mode

When to use: Multi-file samples where each sample consists of related filesSupported formats: MD, TXT, JSON, CSV, JSONLConstraints:

Minimum 2 folders required
Max 20 files per folder
Max 1,000 files total
1MB per file
50MB total across all folders
Max depth: parent/subfolder/file.txt (3 levels)

Structure: Parent folder → subfolders → files (each subfolder = one sample)Example use cases:

Code repositories (homogeneous): Each folder contains main.py + utils.py + config.json
Mixed document sets (heterogeneous): Folder 1 has report.pdf, Folder 2 has article.md + references.json, Folder 3 has presentation.pptx + notes.txt
Flexible data samples (heterogeneous): Folder 1 has data.csv, Folder 2 has schema.sql + queries.sql, Folder 3 has output.json

Seed data best practices: quality over quantity

Generation heavily depends on the quality of the seeds.

Of course, “quality” is defined relative to what you want to achieve—you may want to generate data with imperfections, then your seed examples should display those imperfections.
In some cases you can fix seed issues by using the “Generation Objectives” feature, but in general it is recommended to have very high-quality seed examples.

Number of samples:

2 samples is minimum, even 2 samples is often enough, but loading a more substantial number of seed examples is encouraged.
More samples improve inference of data properties and distributions.

Create specifications

Specification creation workflow

Navigate to your uploaded dataset
Click Create Spec
Configure spec settings
Submit and wait for spec generation to finish (1-5 minutes)
Review generated spec
Edit if needed

Configuration options

Generation Objectives

Natural language guidance to influence property discovery.Purpose: Help the analyzer understand what matters in your dataExamples:

Include writing style and formality as separate properties

Don't consider text length as a variable - let it vary naturally

Add neutral sentiment alongside positive/negative

Make formal tone 80% likely, casual 20%

(Last example only works if “Generate probability distributions” is enabled)

For Python code, use Django/Flask frameworks; for JavaScript code, use Express/React frameworks

(This example works if “Include conditional distributions” is enabled - creates dependencies between properties)How objectives flow into generation:Generation objectives influence the specification contents (both shared properties and variable properties). The spec in turn influences the distribution of generated data. Objectives only indirectly affect generation through the spec.Important: After spec creation completes, always review the generated spec to verify your objectives were correctly captured. If the spec doesn’t match your intent, edit it manually or regenerate with refined objectives.Best practices:

Be specific about properties you want captured
Explicitly exclude properties you don’t want
Feel free to list specific values, or give instructions for generating those values
Suggest probability adjustments if you have strong preferences
Review the generated spec to ensure objectives were met

Spec Generation Model

Model used to analyze seeds (if present) and create the specification.A powerful model such as latest Claude Opus is recommended here. You can also use Gemini Pro, however, it has the behavior of generating very few data properties.

Generate Probability Distributions

Create explicit probability distributions for each property value.Default: ONWhen enabled:

Each property gets probabilities: formal: 0.6, casual: 0.4
Properties are sampled independently using these probabilities (unless conditional distributions are enabled)
Enables more controlled generation

When disabled:

Properties discovered but no explicit probabilities
Generation samples uniformly from observed values, unless you manually add probabilities in the spec

Include Conditional Distributions

Model dependencies between properties.Default: OFFRequires: “Generate probability distributions” must be ONWhat are conditional distributions:Conditional distributions override the default probabilities based on previously selected property values. Each property has:

Base probabilities: Default probabilities used when no condition matches
Conditional probabilities: Alternative probabilities used when specific conditions are met

Example YAML structure:

axis: Framework
possible_values: [Django, Flask, Express, React, Spring]
base_probabilities: [0.2, 0.2, 0.2, 0.2, 0.2]  # Default: equal probability
conditional_probabilities:
  Language:
    Python: [0.4, 0.4, 0.1, 0.1, 0.0]  # If Language=Python: favor Django/Flask
    JavaScript: [0.0, 0.0, 0.45, 0.45, 0.1]  # If Language=JavaScript: favor Express/React
    Java: [0.0, 0.0, 0.0, 0.0, 1.0]  # If Language=Java: only Spring

How sampling works:Properties are sampled sequentially in the order they appear in the spec. For each property:

Check if any conditional rule applies (based on already-selected property values)
If a matching conditional rule is found, use those probabilities
If no conditional rule matches, fall back to base probabilities
If multiple conditional rules could apply, the first matching one is used (order defined by spec)

Example: If Language is sampled first as “Python”, then when Framework is sampled, the conditional rule Language: Python applies, so Framework uses [0.4, 0.4, 0.1, 0.1, 0.0] instead of base probabilities.When to enable:

Dataset has obvious correlations
You need compatibility constraints (file extension matches language)

Trade-offs:

More complex to edit in the UI: deleting a properties requires deleting all other properties that depend on it; there are more values to view and edit

Generate New Data Properties

Discover properties not explicitly present in seeds.Default: OFFWhen enabled:

Example: Seeds have colors but don’t differ in brightness → spec still contains a new brightness property to create a different type of variation

When disabled:

Only the types of variation explicitly present in seeds are included
Example: Seeds have colors but don’t differ in brightness → spec will not contain a brightness property, only a color property.

Generate New Property Values

Suggest values not present in seeds for each property.Default: ONWhen enabled:

Expands possible values beyond seeds
Example: Seeds have red/blue → suggests green/yellow/purple
Example: Seeds have Python/JavaScript → suggests Java/Go/Rust

When disabled:

Use when you want generation limited to observed values
Example: Seeds have Python/JavaScript → spec programming language property only contains Python/JavaScript

Editing specifications

Once created, specs can be edited:

Add new properties and remove existing ones
Add/remove property values
Edit the probability distributions
Create and edit conditional probability distributions

It only saves the spec once you click “Save”: this creates a new numbered version of the spec. You can select any version when configuring a run.

For single-file datasets containing SQL schema and query columns, these columns will be automatically recognized. This works if there is one SQL schema column and one or more query columns corresponding to the schema. Verify SQL column detection in Specs -> click on Spec -> “Advanced Settings” at the bottom of the page. If correct columns are selected there, all generated schemas and queries are guaranteed to be valid in MySQL, SQLite, and PostgreSQL.

Create runs

Navigate to Runs → Create Run → Select spec and version.

Run configuration

Number of Samples

Range: 1-20,000

Generation Model

Model used to generate document parts.This model role is less sensitive to model intelligence than others. Feel free to pick a cheaper model such as Claude Haiku here if data is very simple and you want to minimize cost.

Outline Model

Model used to create document blueprint.Recommended: Claude Opus or Gemini Pro.

Enable Revisions

Turn on quality improvement cycles after generation.Default: ON (recommended)It’s recommended to keep revisions ON when generating complex unstructured documents. However, for structured documents (json/csv) in single-file or multi-folder seed dataset types the recommendation is to turn it OFF due to higher cost and generation slowdown.When enabled:

Revision model performs quality passes
Multiple revision types including for coherence and for conformance to specified properties

When disabled:

Raw generated output without refinement
Faster and cheaper but works worse with very dense, complex or structured documents

Revision Model

Model used to perform quality improvements.Recommended: Claude Sonnet/Opus or Gemini 3 are all good choices. Weak models are not recommended as revision is a complex task.Only used if “Enable revisions” is ON

Max Revision Cycles

Number of quality improvement passes.Range: 1-5For highest quality or whenever you see any issues with generated data, increase this setting to 3+. Note this increases cost and generation time.

Advanced: Seed Shuffling

Whether and how much to shuffle seeds when composing various generation prompts.Strongly recommended: No shuffling (default)Introducing shuffling might increase diversity in some cases, but generation time and cost increase steeply.

Advanced: Max Examples in Prompt

Limit number of seed examples shown to model.Default: As many seeds as possible are packed into 10K tokens allocation.10K tokens is the default that limits generation cost. You can override it with an integer to cap examples supplied to every generation prompt. It completely overrides the 10K cap in all cases; the value X that you supply determines the allocation of tokens for seeds, which may be more or less than 10K.

Monitor runs and access generated data

Click a run to see detailed information: Overview Tab:

Configuration parameters (models, settings used)
Spec and dataset references
Metrics: success rate, failure rate, duration, iterations

Generated Dataset Tab:

File explorer
LLM-as-a-judge assigned labels for samples
Manually label samples with key-value annotation tags: these tags are saved and downloaded together with the data
Download individual files or all files together

Evaluation Tab (appears after completion):

Distribution Analysis: Expected vs observed distributions (bar charts) for different properties
Chat Interface: Ask questions about generated dataset

Main Docs

Tutorials

Integrations

Release Notes

Complete Workflow Guide

Upload seed data (optional)

Choosing upload mode

Seed data best practices: quality over quantity

Create specifications

Specification creation workflow

Configuration options

Editing specifications

Create runs

Run configuration

Monitor runs and access generated data

Main Docs

Tutorials

Integrations

Release Notes

​Upload seed data (optional)

​Choosing upload mode

​Seed data best practices: quality over quantity

​Create specifications

​Specification creation workflow

​Configuration options

​Editing specifications

​Create runs

​Run configuration

​Monitor runs and access generated data

Upload seed data (optional)

Choosing upload mode

Seed data best practices: quality over quantity

Create specifications

Specification creation workflow

Configuration options

Editing specifications

Create runs

Run configuration

Monitor runs and access generated data