Skip to main content

Overview

This tutorial walks you through generating synthetic data samples using your specification. You’ll learn how to create generation requests, monitor progress, and retrieve results.

What You’ll Learn

  • How to request sample generation
  • Choosing between short and long samples
  • Monitoring generation progress
  • Retrieving and downloading samples
  • Handling generation errors

Prerequisites

Step 1: Choose Sample Type

Decide between short and long samples:

Short Samples

Use for:
  • Quick text (reviews, messages, posts)
  • Simple documents
  • Fast iteration and testing
Characteristics:
  • Fast generation (seconds per sample)
  • 50-500 words typically
  • Lower cost
Not supported for structured single-file datasets (CSV, JSON, JSONL).

Long Samples

Use for:
  • Complex documents
  • Detailed records
  • Structured data (CSV, JSON)
  • Multi-file samples
Characteristics:
  • Slower generation (minutes per sample)
  • Can be several pages
  • Higher quality and detail
When in doubt, start with short samples for faster feedback!

Step 2: Submit Generation Request

Generate Short Samples

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 10,
    "sample_type": "short"
  }'
Response:
{
  "task_id": "gen_abc123",
  "run_id": "run_xyz456",
  "status": "PENDING",
  "total_samples": 10,
  "message": "Generation task created successfully"
}

Generate Long Samples

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 5,
    "sample_type": "long"
  }'

With Advanced Options

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 20,
    "sample_type": "short",
    "model": "anthropic/claude-sonnet-4-5",
    "temperature": 0.8
  }'
Parameters:
ParameterRequiredDefaultDescription
spec_idYes-Specification to use
number_of_samplesYes-How many samples (1-1000)
sample_typeYes-”short” or “long”
modelNoclaude-sonnet-4-5LLM model to use
temperatureNo0.7Creativity (0.0-1.0)

Step 3: Monitor Generation Progress

Check Status

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/generate/status/gen_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response (In Progress):
{
  "task_id": "gen_abc123",
  "run_id": "run_xyz456",
  "status": "RUNNING",
  "progress": 60,
  "completed_samples": 6,
  "total_samples": 10,
  "estimated_time_remaining": 120
}
Response (Completed):
{
  "task_id": "gen_abc123",
  "run_id": "run_xyz456",
  "status": "SUCCEEDED",
  "progress": 100,
  "completed_samples": 10,
  "total_samples": 10,
  "message": "Generation completed successfully"
}

Python Monitoring Script

import requests
import time

def monitor_generation(task_id, api_key):
    """Monitor generation progress with progress bar"""
    headers = {"Authorization": f"Bearer {api_key}"}
    base_url = "https://df-api.dataframer.ai/api/dataframer"
    
    while True:
        response = requests.get(
            f"{base_url}/generate/status/{task_id}",
            headers=headers
        )
        result = response.json()
        
        status = result["status"]
        progress = result.get("progress", 0)
        completed = result.get("completed_samples", 0)
        total = result.get("total_samples", 0)
        
        # Display progress
        bar = "=" * (progress // 2) + " " * (50 - progress // 2)
        print(f"\r[{bar}] {progress}% ({completed}/{total})", end="")
        
        if status == "SUCCEEDED":
            print("\n✓ Generation completed!")
            return result
        elif status == "FAILED":
            print(f"\n✗ Generation failed: {result.get('error')}")
            raise Exception(result.get('error'))
        
        time.sleep(10)  # Poll every 10 seconds

# Use the function
API_KEY = "your_api_key"
result = monitor_generation("gen_abc123", API_KEY)
Generation time varies:
  • Short samples: 5-30 seconds per sample
  • Long samples: 2-10 minutes per sample
Total time depends on available workers.

Step 4: Retrieve Generated Samples

Download All Samples as ZIP

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/generate/retrieve/gen_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  --output generated_samples.zip
The ZIP file contains:
  • All generated samples
  • Metadata about the generation
  • Folder structure (for MULTI_FOLDER datasets)

Extract and View

# Extract ZIP
unzip generated_samples.zip -d samples/

# View contents
ls -la samples/

# Read a sample
cat samples/sample_001.txt

Retrieve Specific Samples

Get individual samples without downloading everything:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/retrieve/gen_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "indices": [0, 5, 9]
  }'
Response:
{
  "samples": [
    {
      "index": 0,
      "content": "Sample content here...",
      "metadata": {
        "generated_at": "2025-11-26T10:30:00Z",
        "model": "anthropic/claude-sonnet-4-5"
      }
    },
    {
      "index": 5,
      "content": "Another sample...",
      "metadata": {...}
    }
  ]
}

Step 5: Review Sample Quality

Python Quality Check

import requests
import json

API_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Retrieve all samples
response = requests.post(
    "https://df-api.dataframer.ai/api/dataframer/generate/retrieve/gen_abc123",
    headers=headers,
    json={"indices": list(range(10))}
)

samples = response.json()["samples"]

# Quick quality checks
print(f"Generated {len(samples)} samples")
print(f"Average length: {sum(len(s['content']) for s in samples) / len(samples):.0f} chars")

# Check for duplicates
contents = [s['content'] for s in samples]
unique = len(set(contents))
print(f"Unique samples: {unique}/{len(samples)}")

# Review first sample
print("\n--- Sample 0 ---")
print(samples[0]['content'][:200] + "...")

Manual Review Checklist

  • Samples follow specification requirements
  • Appropriate length and format
  • No exact duplicates
  • Realistic and coherent
  • Proper structure (for structured data)
  • Diverse across variation axes

Step 6: Regenerate if Needed

If samples don’t meet expectations:

Option 1: Adjust Temperature

# More consistent (lower temperature)
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 10,
    "sample_type": "short",
    "temperature": 0.3
  }'

Option 2: Update Specification

  1. Review problematic samples
  2. Identify what’s wrong
  3. Update specification with clearer requirements
  4. Generate new batch

Option 3: Try Different Model

# Try Claude Haiku for speed
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 10,
    "sample_type": "short",
    "model": "anthropic/claude-haiku-4-5"
  }'

Common Issues

Normal times:
  • Short: 5-30 seconds per sample
  • Long: 2-10 minutes per sample
If unusually slow:
  • Check status for progress updates
  • High load may slow processing
  • Large batches take longer
  • Contact support if no progress for 30+ minutes
Common causes:
  • Specification format errors
  • Unsupported sample type for dataset
  • Model timeout or error
  • Invalid parameters
Solution:
  • Check error message in status
  • Verify spec is READY status
  • Use long samples for structured data
  • Try with smaller batch first
Problem: Samples don’t match requirementsSolutions:
  1. Review and clarify specification
  2. Add more specific requirements
  3. Provide better seed data
  4. Try different model
  5. Adjust temperature
Problem: Not enough variationSolutions:
  • Increase temperature (0.8-0.9)
  • Add more variation axes to spec
  • Ensure variation axes are clear
  • Generate larger batch
Problem: ZIP download failsSolutions:
  • Verify generation completed (SUCCEEDED status)
  • Check disk space
  • Try retrieving specific samples instead
  • Use --output flag with curl

Generation Best Practices

Start small: Test with 5-10 samples first Review iteratively: Generate → Review → Refine → Repeat Use appropriate sample type: Short for text, long for structured Monitor progress: Check status periodically Save run IDs: Keep track of successful generations Test different settings: Try various temperatures and models

Batch Size Recommendations

Use CaseRecommended SizeSample Type
Quick test5-10Short
Spec validation20-50Short
Small dataset50-100Either
Production data100-500Either
Large dataset500-1000Long

Next Steps