Quickstart
Generate your first dataset in 5 minutes
Core Concepts
Understand how Dataframer works
Complete Guide
Step-by-step walkthrough of all features
API Reference
Programmatic access documentation
How it works
Dataframer uses a three-stage pipeline: 1. Upload Seed Data Upload sample data that represents what you want to generate. A sample is the unit that gets imitated - in CSV/JSONL it’s a row, in multiple files it’s a file, in multiple folders it’s a folder. Can be text documents, code files, SQL queries, or multi-file structures. 2. Create Specifications AI analyzes your seeds to create a specification - a blueprint capturing data structure, properties, distributions, and patterns. 3. Run Generation Generate thousands of new samples based on your specification. Configure quality settings, validation, and model selection.Use cases
- ML training data: Generate balanced, diverse datasets for model training
- Testing & QA: Create realistic test sets and simulate edge cases
- Text-to-SQL: Build training datasets with schema-aware SQL queries
- Code generation: Produce code samples across languages and frameworks
- Privacy compliance: Create synthetic datasets that preserve statistical properties without exposing PII
Key features
- Distribution-based generation: Captures probability distributions and property dependencies
- Quality validation: Iterative refinement with evaluation loops and revision cycles
- SQL validation: Syntax, schema, and execution validation for generated queries
- Multi-format support: CSV, JSON, JSONL, text files, multi-file structures
- Flexible model selection: Choose from multiple LLMs optimized for different tasks
- Scalable: Generate many thousands of samples per run with distributed processing
Next steps
Get started
Follow the quickstart guide to generate your first dataset

