> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dataframer.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# What is DataFramer?

> Generate realistic, diverse synthetic datasets at scale — from example data or a text description

DataFramer is a platform for generating high-quality synthetic datasets at scale. Give it example data or describe what you need, and it generates thousands of new samples that match your target patterns, distributions, and structure — across documents, spreadsheets, multi-file packages, and more.

<CardGroup cols={2}>
  <Card title="Quickstart" icon="rocket" href="/quickstart">
    Generate your first dataset in 5 minutes
  </Card>

  <Card title="Core Concepts" icon="book" href="/concepts">
    Understand how DataFramer works
  </Card>

  <Card title="Complete Guide" icon="map" href="/workflow">
    Step-by-step walkthrough of all features
  </Card>

  <Card title="API & MCP" icon="code" href="/api-and-mcp">
    Programmatic access via Python SDK or MCP
  </Card>
</CardGroup>

## How it works

DataFramer uses a three-stage pipeline:

**1. Upload Seed Data** (optional)

Upload sample data that represents what you want to generate. A sample is the unit that gets imitated - in CSV/JSONL it's a row, in multiple files it's a file, in multiple folders it's a folder. Can be text documents, code files, SQL queries, or multi-file structures.

**2. Create Specifications**

AI analyzes your seeds to create a specification - an editable blueprint capturing data structure, properties, distributions, and patterns.

**3. Run Generation**

Generate thousands of new samples based on your specification. Configure quality settings, validation, and model selection.

## Use cases

* **LLM evaluation & benchmarking**: Generate diverse test datasets to evaluate and stress-test AI models
* **Training data**: Create balanced, labeled datasets for model training and fine-tuning
* **Fraud detection**: Synthesize rare fraud scenarios for pre-production testing of detection systems
* **Insurance & healthcare**: Generate multi-file application packages, EHR datasets, and claims data
* **Privacy & compliance**: Produce synthetic datasets that preserve statistical properties without exposing PII
* **Testing & QA**: Create realistic test sets, edge cases, and adversarial scenarios

## Key features

* **Long-form & complex documents**: Generate documents up to 50K+ tokens with consistent structure, style, and formatting
* **Multi-format support**: CSV, JSON, JSONL, PDF, DOCX, text files, and multi-file/multi-folder structures
* **Seeded or seedless**: Learn from example data, or generate from a natural language description alone
* **Distribution control**: Define and enforce probability distributions, property dependencies, and conditional relationships
* **Quality validation**: Iterative refinement with evaluation loops, revision cycles, and built-in conformance checks
* **Flexible model selection**: Choose from multiple LLMs optimized for different tasks and budgets

## Next steps

<Card title="Get started" icon="play" href="/quickstart">
  Follow the quickstart guide to generate your first dataset
</Card>
