Specifications

What is a Specification?

A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Dataframer automatically generates specifications by analyzing your seed data, identifying key characteristics like:

Data structure and schema
Field types and formats
Content patterns and styles
Constraints and relationships
Required properties

Specifications serve as blueprints for generating new synthetic samples that match your original data’s characteristics.

How Specifications Work

Analysis Phase

Dataframer analyzes your dataset using LLMs to understand patterns, structure, and requirements.

Specification Generation

The analysis produces a YAML specification that captures all identified patterns and requirements.

Review & Customize

You can review and edit the specification to refine requirements or add constraints.

Sample Generation

The specification guides the LLM in generating new samples that conform to your requirements.

Specification Structure

A typical specification includes:

Requirements

Core requirements that all generated samples must meet:

specification:
  requirements: |
    - Customer reviews for a product or service
    - Include rating (1-5 stars)
    - Include review text (50-200 words)
    - Include customer name and date
    - Maintain authentic customer voice
    - Mix of positive, neutral, and negative sentiment

Data Property Variations

Axes of variation that define how samples should differ:

specification:
  data_property_variations:
    - axis_name: "Sentiment"
      axis_description: "Emotional tone of the review"
      possible_values:
        - "Highly positive"
        - "Slightly positive"
        - "Neutral"
        - "Slightly negative"
        - "Highly negative"
    
    - axis_name: "Product Category"
      axis_description: "Type of product being reviewed"
      possible_values:
        - "Electronics"
        - "Clothing"
        - "Home goods"
        - "Food and beverage"

Data property variations ensure generated samples cover diverse scenarios rather than being repetitive.

Creating Specifications

Automatic Analysis

Generate a specification from an existing dataset:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/analyze/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Customer Review Spec",
    "model": "anthropic/claude-sonnet-4-5"
  }'

Analysis Time:

Small datasets (< 10 samples): 2-3 minutes
Medium datasets (10-50 samples): 3-5 minutes
Large datasets (> 50 samples): 5-10 minutes

Seedless Specifications

Create specifications without any seed data by describing what you want to generate:

Navigate to Specifications → Create Spec
Select the Seedless tab
Enter a Spec Name
Write Generation Objectives describing the data you want (required)
Click Create Spec

The generation objectives should clearly describe:

What type of data to generate (e.g., “customer support conversations”, “legal contracts”)
Key characteristics and structure
Any specific requirements or constraints

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/analyze/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Customer Support Conversations",
    "model": "anthropic/claude-sonnet-4-5",
    "analysis_objectives": "Multi-turn customer support conversations about software products. Include technical troubleshooting, billing inquiries, and feature requests. Conversations should have 4-8 turns with realistic back-and-forth dialogue."
  }'

Note: Omit dataset_id to create a seedless specification.

Seedless generation works best when your objectives are detailed and specific. The more context you provide about the desired data structure and content, the better the resulting specification.

Check Analysis Status

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/analyze/status/TASK_ID' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response:

{
  "task_id": "analyze_abc123",
  "status": "COMPLETED",
  "spec_id": "spec_xyz789",
  "progress": 100,
  "error": null
}

Possible Statuses:

PENDING: Analysis queued
RUNNING: Analysis in progress
COMPLETED: Specification ready
FAILED: Analysis encountered an error

Specification Versions

Dataframer maintains version history for specifications:

Automatic Versioning: Each edit creates a new version
Version Tracking: View and compare different versions
Rollback: Revert to previous versions if needed

Creating a New Version

Edit a specification to create a new version:

curl -X PUT 'https://df-api.dataframer.ai/api/dataframer/specs/{spec_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "config_yaml": "updated YAML content"
  }'

List All Versions

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/{spec_id}/versions/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Customizing Specifications

You can customize specifications to refine generation behavior:

Modify Requirements

Add or update requirements to enforce specific constraints:

specification:
  requirements: |
    # Original requirements
    - Customer reviews with ratings
    - Include product name and date
    
    # Added constraints
    - Reviews must be 100-150 words exactly
    - Must include at least one specific product feature
    - Date must be within last 6 months

Add Variation Axes

Create more diverse samples by adding variation axes:

specification:
  data_property_variations:
    - axis_name: "Review Length"
      axis_description: "Verbosity of the review"
      possible_values:
        - "Concise (50-100 words)"
        - "Moderate (100-150 words)"
        - "Detailed (150-200 words)"
    
    - axis_name: "Customer Type"
      axis_description: "Type of customer writing review"
      possible_values:
        - "First-time buyer"
        - "Repeat customer"
        - "Professional reviewer"

Specify Output Format

For structured data, define the exact output format:

specification:
  requirements: |
    Output format must be JSON with this exact schema:
    {
      "review_id": "string (UUID)",
      "customer_name": "string",
      "rating": "integer (1-5)",
      "review_text": "string",
      "product": "string",
      "date": "string (ISO 8601)"
    }

Supported Models

Dataframer supports multiple LLM models for specification generation:

Model	Provider	Best For
`anthropic/claude-opus-4-5`	Anthropic	Highest quality, complex specifications
`anthropic/claude-opus-4-5-thinking`	Anthropic	Extended reasoning for complex analysis
`anthropic/claude-sonnet-4-5`	Anthropic	Balanced quality and speed (default)
`anthropic/claude-sonnet-4-5-thinking`	Anthropic	Extended reasoning mode
`anthropic/claude-haiku-4-5`	Anthropic	Fast generation
`gemini/gemini-2.5-pro`	Google	Complex reasoning
`openai/gpt-4.1`	OpenAI	Alternative quality option

Different models may produce different specification styles. Choose based on your quality and speed requirements.

Best Practices

Clear Requirements

Be specific about mandatory fields and formats
Define constraints explicitly
Include examples for complex requirements
Specify acceptable ranges for numeric values

Effective Variations

Define 3-5 meaningful variation axes
Keep variation values distinct and clear
Cover important dimensions of diversity
Avoid too many axes (causes combinatorial explosion)

Iterative Refinement

Start with automatic analysis
Generate a small batch (5-10 samples)
Review results and identify issues
Update specification and regenerate
Repeat until satisfied

Structured vs Unstructured

For structured data (CSV, JSON): Define exact schema
For unstructured data (text, documents): Focus on content patterns and style
For mixed datasets: Specify requirements for each file type

Specification Status

Specifications have three possible statuses: PROCESSING: Analysis is in progress READY: Specification is complete and can be used for generation FAILED: Analysis encountered an error Only specifications with READY status can be used to generate samples.

Main Docs

API Tutorials

Release Notes

What is a Specification?

How Specifications Work

Specification Structure

Requirements

Data Property Variations

Creating Specifications

Automatic Analysis

Seedless Specifications

Check Analysis Status

Specification Versions

Creating a New Version

List All Versions

Customizing Specifications

Modify Requirements

Add Variation Axes

Specify Output Format

Supported Models

Best Practices

Specification Status

Next Steps

Generate Samples

Specification Tutorial

Main Docs

API Tutorials

Release Notes

​What is a Specification?

​How Specifications Work

​Specification Structure

​Requirements

​Data Property Variations

​Creating Specifications

​Automatic Analysis

​Seedless Specifications

​Check Analysis Status

​Specification Versions

​Creating a New Version

​List All Versions

​Customizing Specifications

​Modify Requirements

​Add Variation Axes

​Specify Output Format

​Supported Models

​Best Practices

​Specification Status

​Next Steps

Generate Samples

Specification Tutorial

What is a Specification?

How Specifications Work

Specification Structure

Requirements

Data Property Variations

Creating Specifications

Automatic Analysis

Seedless Specifications

Check Analysis Status

Specification Versions

Creating a New Version

List All Versions

Customizing Specifications

Modify Requirements

Add Variation Axes

Specify Output Format

Supported Models

Best Practices

Specification Status

Next Steps