Overview
This tutorial walks you through generating synthetic data samples using your specification. You’ll learn how to create generation requests, monitor progress, and retrieve results.What You’ll Learn
- How to request sample generation
- Choosing between short and long samples
- Monitoring generation progress
- Retrieving and downloading samples
- Handling generation errors
Prerequisites
- A ready specification (see Generating a Specification)
- Specification ID from previous tutorial
- API key for authentication
Step 1: Choose Sample Type
Decide between short and long samples:Short Samples
Use for:- Quick text (reviews, messages, posts)
- Simple documents
- Fast iteration and testing
- Fast generation (seconds per sample)
- 50-500 words typically
- Lower cost
Long Samples
Use for:- Complex documents
- Detailed records
- Structured data (CSV, JSON)
- Multi-file samples
- Slower generation (minutes per sample)
- Can be several pages
- Higher quality and detail
Step 2: Submit Generation Request
Generate Short Samples
Generate Long Samples
With Advanced Options
| Parameter | Required | Default | Description |
|---|---|---|---|
spec_id | Yes | - | Specification to use |
number_of_samples | Yes | - | How many samples (1-1000) |
sample_type | Yes | - | ”short” or “long” |
model | No | claude-sonnet-4-5 | LLM model to use |
temperature | No | 0.7 | Creativity (0.0-1.0) |
Step 3: Monitor Generation Progress
Check Status
Python Monitoring Script
Generation time varies:
- Short samples: 5-30 seconds per sample
- Long samples: 2-10 minutes per sample
Step 4: Retrieve Generated Samples
Download All Samples as ZIP
- All generated samples
- Metadata about the generation
- Folder structure (for MULTI_FOLDER datasets)
Extract and View
Retrieve Specific Samples
Get individual samples without downloading everything:Step 5: Review Sample Quality
Python Quality Check
Manual Review Checklist
- Samples follow specification requirements
- Appropriate length and format
- No exact duplicates
- Realistic and coherent
- Proper structure (for structured data)
- Diverse across variation axes
Step 6: Regenerate if Needed
If samples don’t meet expectations:Option 1: Adjust Temperature
Option 2: Update Specification
- Review problematic samples
- Identify what’s wrong
- Update specification with clearer requirements
- Generate new batch
Option 3: Try Different Model
Common Issues
Generation stuck or slow
Generation stuck or slow
Normal times:
- Short: 5-30 seconds per sample
- Long: 2-10 minutes per sample
- Check status for progress updates
- High load may slow processing
- Large batches take longer
- Contact support if no progress for 30+ minutes
Generation fails
Generation fails
Common causes:
- Specification format errors
- Unsupported sample type for dataset
- Model timeout or error
- Invalid parameters
- Check error message in status
- Verify spec is READY status
- Use long samples for structured data
- Try with smaller batch first
Poor sample quality
Poor sample quality
Problem: Samples don’t match requirementsSolutions:
- Review and clarify specification
- Add more specific requirements
- Provide better seed data
- Try different model
- Adjust temperature
Samples too similar
Samples too similar
Problem: Not enough variationSolutions:
- Increase temperature (0.8-0.9)
- Add more variation axes to spec
- Ensure variation axes are clear
- Generate larger batch
Cannot download ZIP
Cannot download ZIP
Problem: ZIP download failsSolutions:
- Verify generation completed (SUCCEEDED status)
- Check disk space
- Try retrieving specific samples instead
- Use
--outputflag with curl
Generation Best Practices
✅ Start small: Test with 5-10 samples first ✅ Review iteratively: Generate → Review → Refine → Repeat ✅ Use appropriate sample type: Short for text, long for structured ✅ Monitor progress: Check status periodically ✅ Save run IDs: Keep track of successful generations ✅ Test different settings: Try various temperatures and modelsBatch Size Recommendations
| Use Case | Recommended Size | Sample Type |
|---|---|---|
| Quick test | 5-10 | Short |
| Spec validation | 20-50 | Short |
| Small dataset | 50-100 | Either |
| Production data | 100-500 | Either |
| Large dataset | 500-1000 | Long |

