Generating a Specification

Overview

This tutorial shows you how to generate a specification from your dataset. Specifications define the structure and requirements that generated samples must follow.

What You’ll Learn

How to trigger specification analysis
Monitoring analysis progress
Reviewing generated specifications
Customizing specifications
Using specifications for generation

Prerequisites

A created dataset (see Creating a Dataset)
Dataset ID from previous tutorial
API key for authentication

Step 1: Start Specification Analysis

Trigger the analysis process:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/analyze/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Customer Review Specification",
    "model": "anthropic/claude-sonnet-4-5"
  }'

Response:

{
  "task_id": "analyze_abc123",
  "status": "PENDING",
  "message": "Analysis task created successfully"
}

Save the task_id - you’ll need it to check progress!

Python Example

from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Start analysis
result = client.dataframer.analyze.create(
    dataset_id="dataset_id",
    name="Customer Review Specification",
    analysis_model_name="anthropic/claude-sonnet-4-5"
)

task_id = result.task_id
print(f"Analysis started: {task_id}")

Step 2: Monitor Analysis Progress

Check the status periodically:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/analyze/status/analyze_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response (In Progress):

{
  "task_id": "analyze_abc123",
  "status": "RUNNING",
  "progress": 45,
  "message": "Analyzing dataset structure and patterns"
}

Response (Completed):

{
  "task_id": "analyze_abc123",
  "status": "COMPLETED",
  "progress": 100,
  "spec_id": "spec_xyz789",
  "message": "Analysis completed successfully"
}

Python Polling Script

from dataframer import Dataframer
import time

client = Dataframer()

# Poll until complete
task_id = "analysis_task_id"
start_time = time.time()

while True:
    status = client.dataframer.analyze.get_status(task_id=task_id)
    elapsed = int(time.time() - start_time)
    
    print(f"Status: {status['status']} (elapsed: {elapsed}s)")
    
    if status['status'] == "COMPLETED":
        spec_id = status.get('spec_id')
        print(f"✓ Specification ready: {spec_id}")
        break
    elif status['status'] == "FAILED":
        raise Exception(f"Analysis failed: {status.get('error')}")
    
    time.sleep(30)  # Check every 30 seconds

Analysis typically takes 2-5 minutes. Larger datasets may take up to 10 minutes.

Step 3: Review Generated Specification

Once analysis completes, retrieve the specification:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/spec_xyz789/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response:

{
  "id": "spec_xyz789",
  "name": "Customer Review Specification",
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "READY",
  "config_yaml": "...",
  "created_at": "2025-11-26T10:05:00Z",
  "version": 1
}

View the YAML Configuration

The config_yaml field contains the specification:

specification:
  requirements: |
    - Customer product reviews
    - Include 1-5 star rating
    - Review text 50-200 words
    - Include customer name
    - Include review date
    - Authentic customer voice
    - Mix of sentiments
  
  data_property_variations:
    - axis_name: "Sentiment"
      axis_description: "Overall emotional tone"
      possible_values:
        - "Very positive"
        - "Positive"
        - "Neutral"
        - "Negative"
        - "Very negative"
    
    - axis_name: "Product Category"
      axis_description: "Type of product reviewed"
      possible_values:
        - "Electronics"
        - "Clothing"
        - "Home & Garden"
        - "Food & Beverage"

Step 4: Customize the Specification

You can edit the specification to refine requirements:

Add More Requirements

specification:
  requirements: |
    # Existing requirements
    - Customer product reviews
    - Include 1-5 star rating
    
    # New requirements
    - Review must mention specific product features
    - Include verified purchase badge (Yes/No)
    - Review date within last 12 months
    - Minimum 3 sentences

Modify Variation Axes

specification:
  data_property_variations:
    # Add review length variation
    - axis_name: "Review Length"
      axis_description: "Verbosity of review"
      possible_values:
        - "Brief (50-75 words)"
        - "Medium (75-150 words)"
        - "Detailed (150-200 words)"
    
    # Add customer type
    - axis_name: "Customer Type"
      axis_description: "Type of customer"
      possible_values:
        - "First-time buyer"
        - "Repeat customer"
        - "Professional reviewer"

Update the Specification

curl -X PATCH 'https://df-api.dataframer.ai/api/dataframer/specs/spec_xyz789/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "config_yaml": "specification:\n  requirements: |...",
    "notes": "Added review length and customer type variations"
  }'

Updating a specification creates a new version. The previous version is preserved.

Step 5: Test with Small Generation

Before generating many samples, test with a small batch:

curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 5,
    "sample_type": "short"
  }'

Wait for generation to complete, then review the samples. If they don’t meet expectations, refine the specification and try again.

Step 6: View Specification Versions

List all versions of a specification:

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/spec_xyz789/versions/' \
  -H 'Authorization: Bearer YOUR_API_KEY'

Response:

{
  "versions": [
    {
      "version_id": "version_v1",
      "version_number": 1,
      "created_at": "2025-11-26T10:05:00Z",
      "notes": "Initial specification from analysis"
    },
    {
      "version_id": "version_v2",
      "version_number": 2,
      "created_at": "2025-11-26T10:15:00Z",
      "notes": "Added review length and customer type variations"
    }
  ]
}

Common Issues

Analysis takes too long

Normal duration: 2-10 minutesIf longer than 15 minutes:

Check status for error messages
Verify dataset is not corrupt
Try with smaller dataset first
Contact support if persistent

Analysis fails

Common causes:

Dataset files corrupted
Unsupported file format
Files not UTF-8 encoded
Dataset empty or too small

Solution:

Check error message in status response
Verify dataset has at least 3-5 samples
Ensure files are valid and readable

Specification quality issues

Problem: Generated spec doesn’t capture requirementsSolution:

Manually edit specification
Add specific requirements
Define clearer variation axes
Provide more diverse seed data

YAML format errors

Problem: YAML syntax error when updatingSolution:

Validate YAML syntax: https://www.yamllint.com/
Check indentation (use spaces, not tabs)
Escape special characters
Use multiline strings with |

Best Practices

✅ Review automatically generated specs: Always review before large generation runs ✅ Start with small tests: Generate 5-10 samples to validate spec quality ✅ Be specific in requirements: Clear requirements → better samples ✅ Use diverse seed data: More variety → better specification ✅ Iterate and refine: Test, review, update, repeat ✅ Document changes: Add notes when creating new versions

Specification Quality Checklist

Before using a specification for production:

Requirements are clear and specific
All mandatory fields/properties are listed
Data formats are defined
Variation axes cover important dimensions
Variation values are distinct and clear
Tested with small sample batch
Samples meet quality expectations

UI Docs

API Core Concepts

API Tutorials

Troubleshooting

Generating a Specification

Overview

What You’ll Learn

Prerequisites

Step 1: Start Specification Analysis

Python Example

Step 2: Monitor Analysis Progress

Python Polling Script

Step 3: Review Generated Specification

View the YAML Configuration

Step 4: Customize the Specification

Add More Requirements

Modify Variation Axes

Update the Specification

Step 5: Test with Small Generation

Step 6: View Specification Versions

Common Issues

Best Practices

Specification Quality Checklist

Next Steps

Generate Samples

Specification Guide

UI Docs

API Core Concepts

API Tutorials

Troubleshooting

​Overview

​What You’ll Learn

​Prerequisites

​Step 1: Start Specification Analysis

​Python Example

​Step 2: Monitor Analysis Progress

​Python Polling Script

​Step 3: Review Generated Specification

​View the YAML Configuration

​Step 4: Customize the Specification

​Add More Requirements

​Modify Variation Axes

​Update the Specification

​Step 5: Test with Small Generation

​Step 6: View Specification Versions

​Common Issues

​Best Practices

​Specification Quality Checklist

​Next Steps

Generate Samples

Specification Guide

Overview

What You’ll Learn

Prerequisites

Step 1: Start Specification Analysis

Python Example

Step 2: Monitor Analysis Progress

Python Polling Script

Step 3: Review Generated Specification

View the YAML Configuration

Step 4: Customize the Specification

Add More Requirements

Modify Variation Axes

Update the Specification

Step 5: Test with Small Generation

Step 6: View Specification Versions

Common Issues

Best Practices

Specification Quality Checklist

Next Steps