What is a Specification?
A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Dataframer automatically generates specifications by analyzing your seed data, identifying key characteristics like:- Data structure and schema
- Field types and formats
- Content patterns and styles
- Constraints and relationships
- Required properties
How Specifications Work
1
Analysis Phase
Dataframer analyzes your dataset using LLMs to understand patterns, structure, and requirements.
2
Specification Generation
The analysis produces a YAML specification that captures all identified patterns and requirements.
3
Review & Customize
You can review and edit the specification to refine requirements or add constraints.
4
Sample Generation
The specification guides the LLM in generating new samples that conform to your requirements.
Specification Structure
A typical specification includes:Requirements
Core requirements that all generated samples must meet:Data Property Variations
Axes of variation that define how samples should differ:Creating Specifications
Automatic Analysis
Generate a specification from an existing dataset:- Small datasets (< 10 samples): 2-3 minutes
- Medium datasets (10-50 samples): 3-5 minutes
- Large datasets (> 50 samples): 5-10 minutes
Check Analysis Status
PENDING: Analysis queuedRUNNING: Analysis in progressCOMPLETED: Specification readyFAILED: Analysis encountered an error
Specification Versions
Dataframer maintains version history for specifications:- Automatic Versioning: Each edit creates a new version
- Version Tracking: View and compare different versions
- Rollback: Revert to previous versions if needed
Creating a New Version
Edit a specification to create a new version:List All Versions
Customizing Specifications
You can customize specifications to refine generation behavior:Modify Requirements
Add or update requirements to enforce specific constraints:Add Variation Axes
Create more diverse samples by adding variation axes:Specify Output Format
For structured data, define the exact output format:Supported Models
Dataframer supports multiple LLM models for specification generation:| Model | Provider | Best For |
|---|---|---|
anthropic/claude-sonnet-4-5 | Anthropic | Balanced quality and speed (default) |
anthropic/claude-haiku-4-5 | Anthropic | Fast generation |
gemini/gemini-2.5-pro | Complex reasoning | |
openai/gpt-4.1 | OpenAI | Alternative quality option |
Different models may produce different specification styles. Choose based on your quality and speed requirements.
Best Practices
Clear Requirements
Clear Requirements
- Be specific about mandatory fields and formats
- Define constraints explicitly
- Include examples for complex requirements
- Specify acceptable ranges for numeric values
Effective Variations
Effective Variations
- Define 3-5 meaningful variation axes
- Keep variation values distinct and clear
- Cover important dimensions of diversity
- Avoid too many axes (causes combinatorial explosion)
Iterative Refinement
Iterative Refinement
- Start with automatic analysis
- Generate a small batch (5-10 samples)
- Review results and identify issues
- Update specification and regenerate
- Repeat until satisfied
Structured vs Unstructured
Structured vs Unstructured
- For structured data (CSV, JSON): Define exact schema
- For unstructured data (text, documents): Focus on content patterns and style
- For mixed datasets: Specify requirements for each file type
Specification Status
Specifications have three possible statuses: PROCESSING: Analysis is in progress READY: Specification is complete and can be used for generation FAILED: Analysis encountered an error Only specifications withREADY status can be used to generate samples.

