> ## Documentation Index > Fetch the complete documentation index at: https://docs.dataframer.ai/llms.txt > Use this file to discover all available pages before exploring further. # Quickstart > Generate your first synthetic dataset in 5 minutes This guide walks you through creating your first synthetic dataset. You'll upload sample data, create a specification, and generate new samples. ## Prerequisites * Access to DataFramer — request at [info@dataframer.ai](mailto:info@dataframer.ai) * Sample data file (CSV, JSONL, or text files) with at least 2 examples — or just a description of what you want to generate (seedless mode) ## Step 1: Upload seed data Navigate to **Seed Datasets** and click **+ Upload**. Choose your upload mode: Upload one CSV, JSON, or JSONL file containing structured data. **Example**: CSV with product reviews, JSONL with chat messages * Max file size: 50MB Upload a folder of independent text files (each file = one sample). **Example**: Collection of documents, code snippets * Max 1,000 files * 1MB per text file (up to 25MB for PDF/XLSX), 50MB total * Supported: TXT, MD, JSON, CSV, JSONL, PDF, XLSX Upload parent folder containing subfolders (each subfolder = one multi-file sample). **Example**: Code repositories with multiple files per project * Min 2 folders required * Max 20 files per folder * Max depth: parent/subfolder/file.txt ## Step 2: Create specification Once your dataset is uploaded, click **Create Spec**. Fill in the form: 1. **Spec name**: Give your spec a descriptive name 2. **Spec generation objectives** (optional but encouraged): Guide the analysis * Example: "Include writing style and formality as properties" * Example: "Don't treat length as a variable" Leave the model and other settings at their defaults for your first run. Click **+ Create Spec** and wait 1-5 minutes for analysis to complete. Once ready, you can view the generated spec and manually edit properties, adjust probability distributions, add or remove values, or configure conditional relationships. The specification captures data structure, discovered properties, and probability distributions from your seeds. You can edit these to make generated data deviate from the seed patterns. ## Step 3: Configure generation run Once your spec shows "Ready" status, click **Create Run**. Set **Number of samples** to 10 for a quick test. Use default settings for everything else. ## Step 4: Monitor progress Your run starts immediately. Watch real-time status (Pending → Running → Succeeded), progress percentage, and elapsed time on the Generation Runs page. On our SaaS platform, generating 10 samples can take anywhere from 1 minute (for short text samples) to 1 hour (for structured 100K token samples). We place huge emphasis on quality - one large, complex sample can require 30+ diverse LLM calls to generate it properly. ## Step 5: Review results Once finished, click your run to view results. The run detail view has three tabs: **Generated Dataset tab**: * Browse generated samples * Preview samples inline * See property tags for each sample (formality, complexity, domain, etc.) * Download individual samples or entire dataset as ZIP **Distribution Analysis tab**: * Compare what you specified in the spec vs what our evaluation classifiers measured in generated samples **Chat tab**: * Ask questions about your generated dataset Review at least 10 random samples manually to verify quality and cost per sample before scaling to larger runs. ## Next steps Now that you've generated your first dataset: Learn about data properties, distributions, and generation modes Explore all features and configuration options ## Common issues Not a problem. Use **Seedless (prompt-based) Generation** to create specs without uploading samples: 1. Go to **Generation Specs** → **+ Create** 2. Select the **Seedless** tab 3. Provide a spec name and generation objectives describing your desired data 4. The system will analyze your objectives and create a spec from scratch Make sure you've selected the correct upload type (Single File / Multiple Files / Multiple Folders) for your data structure, and that you've specified a dataset name. A **sample** is the unit that gets imitated: * **CSV/JSONL**: One row = one sample * **Multiple Files**: One file = one sample * **Multiple Folders**: One folder = one sample A **dataset** is your collection of sample examples uploaded together. **Check these**: 1. Verify variation is specified in **Variable data properties**, NOT Shared data properties. Variable properties change across samples; shared properties stay constant. 2. Ensure you have enough seed examples - the empirical distribution of properties in your generated data will only approximate your specified distribution if you have sufficient examples. If your data has very complex structure and generated samples don't capture it: * Enable revisions if not already enabled * Increase max revision cycles to 3-5 This allows more quality improvement passes to refine the output.