What is Dataframer?

Dataframer is a platform for generating high-quality synthetic datasets at scale. Give it sample data, and it will generate thousands of new samples that match the same patterns, distributions, and structure.

Quickstart

Generate your first dataset in 5 minutes

Core Concepts

Understand how Dataframer works

Complete Guide

Step-by-step walkthrough of all features

API Reference

Programmatic access documentation

How it works

Dataframer uses a three-stage pipeline: 1. Upload Seed Data Upload sample data that represents what you want to generate. A sample is the unit that gets imitated - in CSV/JSONL it’s a row, in multiple files it’s a file, in multiple folders it’s a folder. Can be text documents, code files, SQL queries, or multi-file structures. 2. Create Specifications AI analyzes your seeds to create a specification - a blueprint capturing data structure, properties, distributions, and patterns. 3. Run Generation Generate thousands of new samples based on your specification. Configure quality settings, validation, and model selection.

Use cases

ML training data: Generate balanced, diverse datasets for model training
Testing & QA: Create realistic test sets and simulate edge cases
Text-to-SQL: Build training datasets with schema-aware SQL queries
Code generation: Produce code samples across languages and frameworks
Privacy compliance: Create synthetic datasets that preserve statistical properties without exposing PII

Key features

Distribution-based generation: Captures probability distributions and property dependencies
Quality validation: Iterative refinement with evaluation loops and revision cycles
SQL validation: Syntax, schema, and execution validation for generated queries
Multi-format support: CSV, JSON, JSONL, text files, multi-file structures
Flexible model selection: Choose from multiple LLMs optimized for different tasks
Scalable: Generate many thousands of samples per run with distributed processing

Next steps

Get started

Follow the quickstart guide to generate your first dataset

Main Docs

API Tutorials

Release Notes

Quickstart

Core Concepts

Complete Guide

API Reference

How it works

Use cases

Key features

Next steps

Get started

Main Docs

API Tutorials

Release Notes

Quickstart

Core Concepts

Complete Guide

API Reference

​How it works

​Use cases

​Key features

​Next steps

Get started

How it works

Use cases

Key features

Next steps