Overview
Sample generation is the process of creating new synthetic data based on your specification. Dataframer uses LLMs to generate samples that match the patterns, structure, and requirements defined in your spec while introducing controlled variation.Sample Types
Dataframer supports two types of sample generation:Short Samples
Generate brief, focused samples quickly. Characteristics:- Faster generation (seconds per sample)
- Suitable for text snippets and short documents
- Typically 50-500 words
- Lower token usage
- Customer reviews
- Social media posts
- Short messages or emails
- Quick prototyping
Long Samples
Generate comprehensive, detailed samples. Characteristics:- Slower generation (minutes per sample)
- Suitable for complex documents
- Can be several pages long
- Higher quality and detail
- Required for structured data
- Medical records
- Legal documents
- Technical reports
- Structured data (CSV, JSON)
- Multi-file samples
Generation Process
1
Submit Generation Request
Specify your spec ID, number of samples, and sample type.
2
Task Queued
Your request is queued for distributed processing.
3
Parallel Generation
Samples are generated in parallel across multiple workers for speed.
4
Quality Check
Generated samples are validated against requirements.
5
Results Available
Download completed samples as a ZIP file.
Making a Generation Request
Basic Request
Advanced Options
| Parameter | Type | Required | Description |
|---|---|---|---|
spec_id | string | Yes | ID of the specification to use |
number_of_samples | integer | Yes | Number of samples to generate (1-1000) |
sample_type | string | Yes | "short" or "long" |
model | string | No | LLM model to use (default: claude-sonnet-4-5) |
temperature | float | No | Creativity level 0.0-1.0 (default: 0.7) |
spec_version_id | string | No | Specific version of spec to use |
Monitoring Generation
Check Status
Poll the status endpoint to monitor progress:PENDING: Request queued, not yet startedRUNNING: Generation in progressSUCCEEDED: All samples generated successfullyFAILED: Generation encountered an errorCANCELED: User canceled the generation
Progress Tracking
The status endpoint provides real-time progress: progress: Percentage complete (0-100) completed_samples: Number of samples finished estimated_time_remaining: Seconds until completion (approximate)Retrieving Generated Samples
Download All Samples
Once status isSUCCEEDED, download the complete results:
- All generated sample files
- Metadata about the generation run
- Folder structure (for MULTI_FOLDER datasets)
Retrieve Specific Samples
Get individual samples by index:Distributed Generation
Dataframer uses distributed processing for efficient generation: How it Works:- Generation request is split into individual tasks (one per sample)
- Tasks are distributed across multiple workers via SQS
- Workers generate samples in parallel
- Results aggregate in real-time
- Complete dataset available when all samples finish
- Fast generation (10 workers = 10x faster)
- Fault tolerance (failed samples automatically retry)
- Scalability (add more workers for more throughput)
Generation time depends on available workers and sample complexity. Large batches may take several hours.
Generation Models
Choose from multiple LLM models based on your needs:| Model | Speed | Quality | Cost | Best For |
|---|---|---|---|---|
anthropic/claude-haiku-4-5 | Fast | Good | Low | Quick iterations, simple data |
anthropic/claude-sonnet-4-5 | Medium | Excellent | Medium | Production quality (default) |
gemini/gemini-2.5-pro | Medium | Excellent | Medium | Complex reasoning |
openai/gpt-4.1 | Slow | Excellent | High | Maximum quality |
Temperature Setting
Control generation creativity with the temperature parameter: 0.0 - 0.3: Deterministic, consistent, conservative- Best for: Structured data, factual content
- Best for: Most use cases, general purpose
- Best for: Creative writing, diverse examples
Automatic Evaluation
After generation completes, Dataframer automatically triggers evaluation to assess sample quality. Learn more in the Evaluation guide.Best Practices
Batch Size
Batch Size
- Small batches (5-10): Quick turnaround for testing
- Medium batches (10-50): Balance of speed and volume
- Large batches (50-1000): Production data generation
Sample Type Selection
Sample Type Selection
- Use short samples for: Quick text, reviews, messages
- Use long samples for: Documents, structured data, complex formats
- When unsure: Try short samples first (faster feedback)
Model Selection
Model Selection
- Start with default model (claude-sonnet-4-5)
- Try Haiku for faster prototyping
- Use Gemini or GPT-4 for complex requirements
- Consider cost vs quality tradeoffs
Handling Failures
Handling Failures
- Check error message in status response
- Verify specification is valid
- Ensure model supports your data format
- Try reducing batch size
- Contact support if issues persist
Generation Limits
Current limits per request:- Maximum samples: 1,000 per generation request
- Maximum request size: 10 MB
- Concurrent generations: 5 per user
- Timeout: 24 hours per generation task
Need higher limits? Contact support for enterprise options.

