Sample Generation

Overview

Sample generation is the process of creating new synthetic data based on your specification. Dataframer uses LLMs to generate samples that match the patterns, structure, and requirements defined in your spec while introducing controlled variation.

Sample Types

Dataframer supports two types of sample generation:

Short Samples

Generate brief, focused samples quickly. Characteristics:

Faster generation (seconds per sample)
Suitable for text snippets and short documents
Typically 50-500 words
Lower token usage

Best For:

Customer reviews
Social media posts
Short messages or emails
Quick prototyping

Example:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 20,
    "sample_type": "short"
  }'

Short samples are NOT supported for structured single-file datasets (CSV, JSON, JSONL). Use long samples instead.

Long Samples

Generate comprehensive, detailed samples. Characteristics:

Slower generation (minutes per sample)
Suitable for complex documents
Can be several pages long
Higher quality and detail
Required for structured data

Best For:

Medical records
Legal documents
Technical reports
Structured data (CSV, JSON)
Multi-file samples

Example:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 5,
    "sample_type": "long"
  }'

Generation Process

Submit Generation Request

Specify your spec ID, number of samples, and sample type.

Task Queued

Your request is queued for distributed processing.

Parallel Generation

Samples are generated in parallel across multiple workers for speed.

Quality Check

Generated samples are validated against requirements.

Results Available

Download completed samples as a ZIP file.

Making a Generation Request

Basic Request

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "550e8400-e29b-41d4-a716-446655440000",
    "number_of_samples": 10,
    "sample_type": "short"
  }'

Response:

{
  "task_id": "gen_abc123",
  "status": "PENDING",
  "total_samples": 10,
  "run_id": "run_xyz789"
}

Advanced Options

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "550e8400-e29b-41d4-a716-446655440000",
    "number_of_samples": 50,
    "sample_type": "long",
    "model": "anthropic/claude-sonnet-4-5",
    "temperature": 0.7,
    "spec_version_id": "version_specific_id"
  }'

Parameters:

Parameter	Type	Required	Description
`spec_id`	string	Yes	ID of the specification to use
`number_of_samples`	integer	Yes	Number of samples to generate (1-1000)
`sample_type`	string	Yes	`"short"` or `"long"`
`model`	string	No	LLM model to use (default: claude-sonnet-4-5)
`temperature`	float	No	Creativity level 0.0-1.0 (default: 0.7)
`spec_version_id`	string	No	Specific version of spec to use

Monitoring Generation

Check Status

Poll the status endpoint to monitor progress:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/generate/status/gen_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response:

{
  "task_id": "gen_abc123",
  "status": "RUNNING",
  "progress": 45,
  "completed_samples": 45,
  "total_samples": 100,
  "estimated_time_remaining": 300,
  "run_id": "run_xyz789"
}

Status Values:

PENDING: Request queued, not yet started
RUNNING: Generation in progress
SUCCEEDED: All samples generated successfully
FAILED: Generation encountered an error
CANCELED: User canceled the generation

Progress Tracking

The status endpoint provides real-time progress: progress: Percentage complete (0-100) completed_samples: Number of samples finished estimated_time_remaining: Seconds until completion (approximate)

Poll every 10-30 seconds. More frequent polling doesn’t speed up generation.

Retrieving Generated Samples

Download All Samples as ZIP

Once status is COMPLETED, download all files as a ZIP:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/runs/run_xyz456/generated-files/download-all/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  --output samples.zip

The ZIP file contains:

All generated sample files
Folder structure (for MULTI_FOLDER datasets)

Retrieve Samples as JSON

Get samples with pagination:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/runs/run_xyz456/samples/?offset=0&limit=10' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Or retrieve specific samples by index:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/runs/run_xyz456/samples/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "indices": [0, 5, 10]
  }'

Response:

{
  "task_id": "gen_abc123",
  "status": "COMPLETED",
  "total_samples": 10,
  "returned_count": 3,
  "result": {
    "generated_outputs": [
      {
        "generated_sample": {"value": "Sample content..."},
        "metadata": {"finish_time": 1.23}
      }
    ]
  }
}

Distributed Generation

Dataframer uses distributed processing for efficient generation: How it Works:

Generation request is split into individual tasks (one per sample)
Tasks are distributed across multiple workers via SQS
Workers generate samples in parallel
Results aggregate in real-time
Complete dataset available when all samples finish

Benefits:

Fast generation (10 workers = 10x faster)
Fault tolerance (failed samples automatically retry)
Scalability (add more workers for more throughput)

Generation time depends on available workers and sample complexity. Large batches may take several hours.

Generation Models

Choose from multiple LLM models based on your needs:

Model	Speed	Quality	Cost	Best For
`anthropic/claude-opus-4-5`	Medium	Outstanding	High	Maximum quality, complex data
`anthropic/claude-opus-4-5-thinking`	Slow	Outstanding	High	Extended reasoning for complex tasks
`anthropic/claude-sonnet-4-5`	Medium	Excellent	Medium	Production quality (default)
`anthropic/claude-sonnet-4-5-thinking`	Slow	Excellent	Medium	Extended reasoning mode
`anthropic/claude-haiku-4-5`	Fast	Good	Low	Quick iterations, simple data
`gemini/gemini-2.5-pro`	Medium	Good	Medium	Complex reasoning

Temperature Setting

Control generation creativity with the temperature parameter: 0.0 - 0.3: Deterministic, consistent, conservative

Best for: Structured data, factual content

0.4 - 0.7: Balanced creativity and consistency (default: 0.7)

Best for: Most use cases, general purpose

0.8 - 1.0: High creativity, more variation

Best for: Creative writing, diverse examples

Higher temperatures may produce less consistent results. Start with 0.7 and adjust as needed.

Automatic Evaluation

After generation completes, Dataframer automatically triggers evaluation to assess sample quality. Learn more in the Evaluation guide.

Best Practices

Batch Size

Small batches (5-10): Quick turnaround for testing
Medium batches (10-50): Balance of speed and volume
Large batches (50-1000): Production data generation

Start small to validate quality before generating large batches.

Sample Type Selection

Use short samples for: Quick text, reviews, messages
Use long samples for: Documents, structured data, complex formats
When unsure: Try short samples first (faster feedback)

Model Selection

Start with default model (claude-sonnet-4-5)
Try Haiku for faster prototyping
Use Opus for maximum quality on complex data
Use thinking variants for tasks requiring extended reasoning
Consider cost vs quality tradeoffs

Handling Failures

Check error message in status response
Verify specification is valid
Ensure model supports your data format
Try reducing batch size
Contact support if issues persist

Generation Limits

Current limits per request:

Maximum samples: 1,000 per generation request
Maximum request size: 10 MB
Concurrent generations: 5 per user
Timeout: 24 hours per generation task

Need higher limits? Contact support for enterprise options.

Next Steps

Evaluate Samples

Learn how to assess the quality of generated samples.

Generation Tutorial

Follow a step-by-step tutorial for sample generation.

Main Docs

API Tutorials

Release Notes

Overview

Sample Types

Short Samples

Long Samples

Generation Process

Making a Generation Request

Basic Request

Advanced Options

Monitoring Generation

Check Status

Progress Tracking

Retrieving Generated Samples

Download All Samples as ZIP

Retrieve Samples as JSON

Distributed Generation

Generation Models

Temperature Setting

Automatic Evaluation

Best Practices

Generation Limits

Next Steps

Evaluate Samples

Generation Tutorial

Main Docs

API Tutorials

Release Notes

​Overview

​Sample Types

​Short Samples

​Long Samples

​Generation Process

​Making a Generation Request

​Basic Request

​Advanced Options

​Monitoring Generation

​Check Status

​Progress Tracking

​Retrieving Generated Samples

​Download All Samples as ZIP

​Retrieve Samples as JSON

​Distributed Generation

​Generation Models

​Temperature Setting

​Automatic Evaluation

​Best Practices

​Generation Limits

​Next Steps

Evaluate Samples

Generation Tutorial

Overview

Sample Types

Short Samples

Long Samples

Generation Process

Making a Generation Request

Basic Request

Advanced Options

Monitoring Generation

Check Status

Progress Tracking

Retrieving Generated Samples

Download All Samples as ZIP

Retrieve Samples as JSON

Distributed Generation

Generation Models

Temperature Setting

Automatic Evaluation

Best Practices

Generation Limits

Next Steps