Skip to main content

What is a Specification?

A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Dataframer automatically generates specifications by analyzing your seed data, identifying key characteristics like:
  • Data structure and schema
  • Field types and formats
  • Content patterns and styles
  • Constraints and relationships
  • Required properties
Specifications serve as blueprints for generating new synthetic samples that match your original data’s characteristics.

How Specifications Work

1

Analysis Phase

Dataframer analyzes your dataset using LLMs to understand patterns, structure, and requirements.
2

Specification Generation

The analysis produces a YAML specification that captures all identified patterns and requirements.
3

Review & Customize

You can review and edit the specification to refine requirements or add constraints.
4

Sample Generation

The specification guides the LLM in generating new samples that conform to your requirements.

Specification Structure

A typical specification includes:

Requirements

Core requirements that all generated samples must meet:
specification:
  requirements: |
    - Customer reviews for a product or service
    - Include rating (1-5 stars)
    - Include review text (50-200 words)
    - Include customer name and date
    - Maintain authentic customer voice
    - Mix of positive, neutral, and negative sentiment

Data Property Variations

Axes of variation that define how samples should differ:
specification:
  data_property_variations:
    - axis_name: "Sentiment"
      axis_description: "Emotional tone of the review"
      possible_values:
        - "Highly positive"
        - "Slightly positive"
        - "Neutral"
        - "Slightly negative"
        - "Highly negative"
    
    - axis_name: "Product Category"
      axis_description: "Type of product being reviewed"
      possible_values:
        - "Electronics"
        - "Clothing"
        - "Home goods"
        - "Food and beverage"
Data property variations ensure generated samples cover diverse scenarios rather than being repetitive.

Creating Specifications

Automatic Analysis

Generate a specification from an existing dataset:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/analyze/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Customer Review Spec",
    "model": "anthropic/claude-sonnet-4-5"
  }'
Analysis Time:
  • Small datasets (< 10 samples): 2-3 minutes
  • Medium datasets (10-50 samples): 3-5 minutes
  • Large datasets (> 50 samples): 5-10 minutes

Check Analysis Status

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/analyze/status/TASK_ID' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "task_id": "analyze_abc123",
  "status": "COMPLETED",
  "spec_id": "spec_xyz789",
  "progress": 100,
  "error": null
}
Possible Statuses:
  • PENDING: Analysis queued
  • RUNNING: Analysis in progress
  • COMPLETED: Specification ready
  • FAILED: Analysis encountered an error

Specification Versions

Dataframer maintains version history for specifications:
  • Automatic Versioning: Each edit creates a new version
  • Version Tracking: View and compare different versions
  • Rollback: Revert to previous versions if needed

Creating a New Version

Edit a specification to create a new version:
curl -X PATCH 'https://df-api.dataframer.ai/api/dataframer/specs/{spec_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "config_yaml": "updated YAML content"
  }'

List All Versions

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/{spec_id}/versions/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Customizing Specifications

You can customize specifications to refine generation behavior:

Modify Requirements

Add or update requirements to enforce specific constraints:
specification:
  requirements: |
    # Original requirements
    - Customer reviews with ratings
    - Include product name and date
    
    # Added constraints
    - Reviews must be 100-150 words exactly
    - Must include at least one specific product feature
    - Date must be within last 6 months

Add Variation Axes

Create more diverse samples by adding variation axes:
specification:
  data_property_variations:
    - axis_name: "Review Length"
      axis_description: "Verbosity of the review"
      possible_values:
        - "Concise (50-100 words)"
        - "Moderate (100-150 words)"
        - "Detailed (150-200 words)"
    
    - axis_name: "Customer Type"
      axis_description: "Type of customer writing review"
      possible_values:
        - "First-time buyer"
        - "Repeat customer"
        - "Professional reviewer"

Specify Output Format

For structured data, define the exact output format:
specification:
  requirements: |
    Output format must be JSON with this exact schema:
    {
      "review_id": "string (UUID)",
      "customer_name": "string",
      "rating": "integer (1-5)",
      "review_text": "string",
      "product": "string",
      "date": "string (ISO 8601)"
    }

Supported Models

Dataframer supports multiple LLM models for specification generation:
ModelProviderBest For
anthropic/claude-sonnet-4-5AnthropicBalanced quality and speed (default)
anthropic/claude-haiku-4-5AnthropicFast generation
gemini/gemini-2.5-proGoogleComplex reasoning
openai/gpt-4.1OpenAIAlternative quality option
Different models may produce different specification styles. Choose based on your quality and speed requirements.

Best Practices

  • Be specific about mandatory fields and formats
  • Define constraints explicitly
  • Include examples for complex requirements
  • Specify acceptable ranges for numeric values
  • Define 3-5 meaningful variation axes
  • Keep variation values distinct and clear
  • Cover important dimensions of diversity
  • Avoid too many axes (causes combinatorial explosion)
  • Start with automatic analysis
  • Generate a small batch (5-10 samples)
  • Review results and identify issues
  • Update specification and regenerate
  • Repeat until satisfied
  • For structured data (CSV, JSON): Define exact schema
  • For unstructured data (text, documents): Focus on content patterns and style
  • For mixed datasets: Specify requirements for each file type

Specification Status

Specifications have three possible statuses: PROCESSING: Analysis is in progress READY: Specification is complete and can be used for generation FAILED: Analysis encountered an error Only specifications with READY status can be used to generate samples.

Next Steps