Skip to main content
This guide walks you through creating your first synthetic dataset. You’ll upload sample data, create a specification, and generate new samples.

Prerequisites

  • Access to Dataframer platform
  • Sample data file (CSV, JSONL, or text files) with at least 2 examples

Step 1: Upload seed data

Navigate to Seed Datasets and click Upload Dataset. Choose your upload mode:
Upload one CSV, JSON, or JSONL file containing structured data.Example: CSV with product reviews, JSONL with chat messages
  • Max file size: 50MB
Upload a folder of independent text files (each file = one sample).Example: Collection of documents, code snippets
  • Max 1,000 files
  • 1MB per file, 50MB total
  • Supported: TXT, MD, JSON, CSV, JSONL
Upload parent folder containing subfolders (each subfolder = one multi-file sample).Example: Code repositories with multiple files per project
  • Min 2 folders required
  • Max 20 files per folder
  • Max depth: parent/subfolder/file.txt

Step 2: Create specification

Once your dataset is uploaded, click Create Spec.
  1. Name: Give your spec a descriptive name
  2. Generation objectives (optional but recommended): Guide the analysis
    • Example: “Include writing style and formality as properties”
    • Example: “Don’t treat length as a variable”
Use the default settings for your first run. Click Create Spec and wait 1-5 minutes for analysis to complete. Once ready, you can view the generated spec and manually edit properties, adjust probability distributions, add or remove values, or configure conditional relationships.
The specification captures data structure, discovered properties, and probability distributions from your seeds. You can edit these to make generated data deviate from the seed patterns.

Step 3: Configure generation run

Once your spec shows “Ready” status, click Create Run. Choose Long Samples mode and set Number of samples to 10 for a quick test. Use default settings for everything else.

Step 4: Monitor progress

Your run starts immediately. Watch real-time status (Pending → Running → Finished), progress percentage, and elapsed time on the Runs page. On our SaaS platform, generating 10 samples can take anywhere from 1 minute (for short text samples) to 1 hour (for structured 100K token samples). We place huge emphasis on quality - one very large, complex sample can require 100+ diverse LLM calls to generate it properly.

Step 5: Review results

Once finished, click your run to view results. Generated Dataset tab:
  • Browse generated samples
  • Preview samples inline
  • See property tags for each sample (formality, complexity, domain, etc.)
  • Download individual samples or entire dataset as ZIP
Analysis tab:
  • Distributions: Compare what you specified in the spec vs what our evaluation classifiers measured in generated samples
  • Classification: View property assignments per sample
  • Chat: Ask questions about your dataset
Review at least 10 random samples manually to verify quality before scaling to larger runs.

Next steps

Now that you’ve generated your first dataset:

Common issues

Make sure you’ve selected the correct upload type (Single File / Multiple Files / Multiple Folders) for your data structure, and that you’ve specified a dataset name.
A sample is the unit that gets imitated:
  • CSV/JSONL: One row = one sample
  • Multiple Files: One file = one sample
  • Multiple Folders: One folder = one sample
A dataset is your collection of sample examples uploaded together.
We’re working on seedless generation (generate with 0-1 examples) but it’s not yet in production. For now:
  • Find at least one more similar example
  • Generate a second example with ChatGPT to use as a starting point
  • Minimum 2 examples required currently
Check these:
  1. Verify variation is specified in Variable data properties, NOT Shared data properties. Variable properties change across samples; shared properties stay constant.
  2. Ensure you have enough seed examples - the empirical distribution of properties in your generated data will only approximate your specified distribution if you have sufficient examples.
If your data has very complex structure and generated samples don’t capture it:
  • Enable revisions if not already enabled
  • Increase max revision cycles to 5
This allows more quality improvement passes to refine the output.