Quickstart

This guide walks you through creating your first synthetic dataset. You’ll upload sample data, create a specification, and generate new samples.

Prerequisites

Access to Dataframer — register at app.aimon.ai, then request access at [email protected]
Sample data file (CSV, JSONL, or text files) with at least 2 examples

Step 1: Upload seed data

Navigate to Seed Datasets and click Upload Dataset. Choose your upload mode:

Single File

Upload one CSV, JSON, or JSONL file containing structured data.Example: CSV with product reviews, JSONL with chat messages

Max file size: 50MB

Multiple Files

Upload a folder of independent text files (each file = one sample).Example: Collection of documents, code snippets

Max 1,000 files
1MB per file, 50MB total
Supported: TXT, MD, JSON, CSV, JSONL

Multiple Folders

Upload parent folder containing subfolders (each subfolder = one multi-file sample).Example: Code repositories with multiple files per project

Min 2 folders required
Max 20 files per folder
Max depth: parent/subfolder/file.txt

Step 2: Create specification

Once your dataset is uploaded, click Create Spec.

Name: Give your spec a descriptive name
Generation objectives (optional but recommended): Guide the analysis
- Example: “Include writing style and formality as properties”
- Example: “Don’t treat length as a variable”

Use the default settings for your first run. Click Create Spec and wait 1-5 minutes for analysis to complete. Once ready, you can view the generated spec and manually edit properties, adjust probability distributions, add or remove values, or configure conditional relationships.

The specification captures data structure, discovered properties, and probability distributions from your seeds. You can edit these to make generated data deviate from the seed patterns.

Step 3: Configure generation run

Once your spec shows “Ready” status, click Create Run. Choose Long Samples mode and set Number of samples to 10 for a quick test. Use default settings for everything else.

Step 4: Monitor progress

Your run starts immediately. Watch real-time status (Pending → Running → Finished), progress percentage, and elapsed time on the Runs page. On our SaaS platform, generating 10 samples can take anywhere from 1 minute (for short text samples) to 1 hour (for structured 100K token samples). We place huge emphasis on quality - one large, complex sample can require 30+ diverse LLM calls to generate it properly.

Step 5: Review results

Once finished, click your run to view results. Generated Dataset tab:

Browse generated samples
Preview samples inline
See property tags for each sample (formality, complexity, domain, etc.)
Download individual samples or entire dataset as ZIP

Analysis tab:

Distributions: Compare what you specified in the spec vs what our evaluation classifiers measured in generated samples
Classification: View property assignments per sample
Chat: Ask questions about your dataset

Review at least 10 random samples manually to verify quality and cost per sample before scaling to larger runs.

Next steps

Now that you’ve generated your first dataset:

Core Concepts

Learn about data properties, distributions, and generation modes

Complete Workflow

Explore all features and configuration options

Common issues

I don't have example data

Not a problem. Use Seedless Generation to create specs without uploading samples:

Go to Specifications → Create Spec
Select the Seedless tab
Provide a spec name and generation objectives describing your desired data
The system will analyze your objectives and create a spec from scratch

See Seedless Specifications for more details.

Can't upload a dataset

Make sure you’ve selected the correct upload type (Single File / Multiple Files / Multiple Folders) for your data structure, and that you’ve specified a dataset name.

Confused about samples vs datasets

A sample is the unit that gets imitated:

CSV/JSONL: One row = one sample
Multiple Files: One file = one sample
Multiple Folders: One folder = one sample

A dataset is your collection of sample examples uploaded together.

I don't see the type of data variation that I want

Check these:

Verify variation is specified in Variable data properties, NOT Shared data properties. Variable properties change across samples; shared properties stay constant.
Ensure you have enough seed examples - the empirical distribution of properties in your generated data will only approximate your specified distribution if you have sufficient examples.

Complex structure or features not reproduced correctly

If your data has very complex structure and generated samples don’t capture it:

Enable revisions if not already enabled
Increase max revision cycles to 3-5

This allows more quality improvement passes to refine the output.

Main Docs

Tutorials

Integrations

Release Notes

Prerequisites

Step 1: Upload seed data

Step 2: Create specification

Step 3: Configure generation run

Step 4: Monitor progress

Step 5: Review results

Next steps

Core Concepts

Complete Workflow

Common issues

Main Docs

Tutorials

Integrations

Release Notes

​Prerequisites

​Step 1: Upload seed data

​Step 2: Create specification

​Step 3: Configure generation run

​Step 4: Monitor progress

​Step 5: Review results

​Next steps

Core Concepts

Complete Workflow

​Common issues

Prerequisites

Step 1: Upload seed data

Step 2: Create specification

Step 3: Configure generation run

Step 4: Monitor progress

Step 5: Review results

Next steps

Common issues