Skip to main content

What is a Specification?

A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Dataframer automatically generates specifications by analyzing your seed data, identifying key characteristics like:
  • Data structure and schema
  • Field types and formats
  • Content patterns and styles
  • Constraints and relationships
  • Required properties
Specifications serve as blueprints for generating new synthetic samples that match your original data’s characteristics.

How Specifications Work

1

Analysis Phase

Dataframer analyzes your dataset using LLMs to understand patterns, structure, and requirements.
2

Specification Generation

The analysis produces a YAML specification that captures all identified patterns and requirements.
3

Review & Customize

You can review and edit the specification to refine requirements or add constraints.
4

Sample Generation

The specification guides the LLM in generating new samples that conform to your requirements.

Specification Structure

A typical specification includes:

Requirements

Core requirements that all generated samples must meet:
specification:
  requirements: |
    - Customer reviews for a product or service
    - Include rating (1-5 stars)
    - Include review text (50-200 words)
    - Include customer name and date
    - Maintain authentic customer voice
    - Mix of positive, neutral, and negative sentiment

Data Property Variations

Axes of variation that define how samples should differ:
specification:
  data_property_variations:
    - axis_name: "Sentiment"
      axis_description: "Emotional tone of the review"
      possible_values:
        - "Highly positive"
        - "Slightly positive"
        - "Neutral"
        - "Slightly negative"
        - "Highly negative"
    
    - axis_name: "Product Category"
      axis_description: "Type of product being reviewed"
      possible_values:
        - "Electronics"
        - "Clothing"
        - "Home goods"
        - "Food and beverage"
Data property variations ensure generated samples cover diverse scenarios rather than being repetitive.

Creating Specifications

Automatic Analysis

Generate a specification from an existing dataset:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/analyze/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Customer Review Spec",
    "model": "anthropic/claude-sonnet-4-5"
  }'
Analysis Time:
  • Small datasets (< 10 samples): 2-3 minutes
  • Medium datasets (10-50 samples): 3-5 minutes
  • Large datasets (> 50 samples): 5-10 minutes

Seedless Specifications

Create specifications without any seed data by describing what you want to generate:
  1. Navigate to SpecificationsCreate Spec
  2. Select the Seedless tab
  3. Enter a Spec Name
  4. Write Generation Objectives describing the data you want (required)
  5. Click Create Spec
The generation objectives should clearly describe:
  • What type of data to generate (e.g., “customer support conversations”, “legal contracts”)
  • Key characteristics and structure
  • Any specific requirements or constraints
Seedless generation works best when your objectives are detailed and specific. The more context you provide about the desired data structure and content, the better the resulting specification.

Check Analysis Status

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/analyze/status/TASK_ID' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "task_id": "analyze_abc123",
  "status": "COMPLETED",
  "spec_id": "spec_xyz789",
  "progress": 100,
  "error": null
}
Possible Statuses:
  • PENDING: Analysis queued
  • RUNNING: Analysis in progress
  • COMPLETED: Specification ready
  • FAILED: Analysis encountered an error

Specification Versions

Dataframer maintains version history for specifications:
  • Automatic Versioning: Each edit creates a new version
  • Version Tracking: View and compare different versions
  • Rollback: Revert to previous versions if needed

Creating a New Version

Edit a specification to create a new version:
curl -X PUT 'https://df-api.dataframer.ai/api/dataframer/specs/{spec_id}/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "config_yaml": "updated YAML content"
  }'

List All Versions

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/{spec_id}/versions/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Customizing Specifications

You can customize specifications to refine generation behavior:

Modify Requirements

Add or update requirements to enforce specific constraints:
specification:
  requirements: |
    # Original requirements
    - Customer reviews with ratings
    - Include product name and date
    
    # Added constraints
    - Reviews must be 100-150 words exactly
    - Must include at least one specific product feature
    - Date must be within last 6 months

Add Variation Axes

Create more diverse samples by adding variation axes:
specification:
  data_property_variations:
    - axis_name: "Review Length"
      axis_description: "Verbosity of the review"
      possible_values:
        - "Concise (50-100 words)"
        - "Moderate (100-150 words)"
        - "Detailed (150-200 words)"
    
    - axis_name: "Customer Type"
      axis_description: "Type of customer writing review"
      possible_values:
        - "First-time buyer"
        - "Repeat customer"
        - "Professional reviewer"

Specify Output Format

For structured data, define the exact output format:
specification:
  requirements: |
    Output format must be JSON with this exact schema:
    {
      "review_id": "string (UUID)",
      "customer_name": "string",
      "rating": "integer (1-5)",
      "review_text": "string",
      "product": "string",
      "date": "string (ISO 8601)"
    }

Supported Models

Dataframer supports multiple LLM models for specification generation:
ModelProviderBest For
anthropic/claude-opus-4-5AnthropicHighest quality, complex specifications
anthropic/claude-opus-4-5-thinkingAnthropicExtended reasoning for complex analysis
anthropic/claude-sonnet-4-5AnthropicBalanced quality and speed (default)
anthropic/claude-sonnet-4-5-thinkingAnthropicExtended reasoning mode
anthropic/claude-haiku-4-5AnthropicFast generation
gemini/gemini-2.5-proGoogleComplex reasoning
openai/gpt-4.1OpenAIAlternative quality option
Different models may produce different specification styles. Choose based on your quality and speed requirements.

Best Practices

  • Be specific about mandatory fields and formats
  • Define constraints explicitly
  • Include examples for complex requirements
  • Specify acceptable ranges for numeric values
  • Define 3-5 meaningful variation axes
  • Keep variation values distinct and clear
  • Cover important dimensions of diversity
  • Avoid too many axes (causes combinatorial explosion)
  • Start with automatic analysis
  • Generate a small batch (5-10 samples)
  • Review results and identify issues
  • Update specification and regenerate
  • Repeat until satisfied
  • For structured data (CSV, JSON): Define exact schema
  • For unstructured data (text, documents): Focus on content patterns and style
  • For mixed datasets: Specify requirements for each file type

Specification Status

Specifications have three possible statuses: PROCESSING: Analysis is in progress READY: Specification is complete and can be used for generation FAILED: Analysis encountered an error Only specifications with READY status can be used to generate samples.

Next Steps