Skip to main content

Overview

This tutorial shows you how to generate a specification from your dataset. Specifications define the structure and requirements that generated samples must follow.

What You’ll Learn

  • How to trigger specification analysis
  • Monitoring analysis progress
  • Reviewing generated specifications
  • Customizing specifications
  • Using specifications for generation

Prerequisites

  • A created dataset (see Creating a Dataset)
  • Dataset ID from previous tutorial
  • API key for authentication

Step 1: Start Specification Analysis

Trigger the analysis process:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/analyze/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
    "name": "Customer Review Specification",
    "model": "anthropic/claude-sonnet-4-5"
  }'
Response:
{
  "task_id": "analyze_abc123",
  "status": "PENDING",
  "message": "Analysis task created successfully"
}
Save the task_id - you’ll need it to check progress!

Python Example

from dataframer import Dataframer

# Initialize client (reads DATAFRAMER_API_KEY from environment)
# Or explicitly: client = Dataframer(api_key="your_api_key")
client = Dataframer()

# Start analysis
result = client.dataframer.analyze.create(
    dataset_id="dataset_id",
    name="Customer Review Specification",
    analysis_model_name="anthropic/claude-sonnet-4-5"
)

task_id = result.task_id
print(f"Analysis started: {task_id}")

Step 2: Monitor Analysis Progress

Check the status periodically:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/analyze/status/analyze_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response (In Progress):
{
  "task_id": "analyze_abc123",
  "status": "RUNNING",
  "progress": 45,
  "message": "Analyzing dataset structure and patterns"
}
Response (Completed):
{
  "task_id": "analyze_abc123",
  "status": "COMPLETED",
  "progress": 100,
  "spec_id": "spec_xyz789",
  "message": "Analysis completed successfully"
}

Python Polling Script

from dataframer import Dataframer
import time

client = Dataframer()

# Poll until complete
task_id = "analysis_task_id"
start_time = time.time()

while True:
    status = client.dataframer.analyze.get_status(task_id=task_id)
    elapsed = int(time.time() - start_time)
    
    print(f"Status: {status['status']} (elapsed: {elapsed}s)")
    
    if status['status'] == "COMPLETED":
        spec_id = status.get('spec_id')
        print(f"✓ Specification ready: {spec_id}")
        break
    elif status['status'] == "FAILED":
        raise Exception(f"Analysis failed: {status.get('error')}")
    
    time.sleep(30)  # Check every 30 seconds
Analysis typically takes 2-5 minutes. Larger datasets may take up to 10 minutes.

Step 3: Review Generated Specification

Once analysis completes, retrieve the specification:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/spec_xyz789/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "id": "spec_xyz789",
  "name": "Customer Review Specification",
  "dataset_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "READY",
  "config_yaml": "...",
  "created_at": "2025-11-26T10:05:00Z",
  "version": 1
}

View the YAML Configuration

The config_yaml field contains the specification:
specification:
  requirements: |
    - Customer product reviews
    - Include 1-5 star rating
    - Review text 50-200 words
    - Include customer name
    - Include review date
    - Authentic customer voice
    - Mix of sentiments
  
  data_property_variations:
    - axis_name: "Sentiment"
      axis_description: "Overall emotional tone"
      possible_values:
        - "Very positive"
        - "Positive"
        - "Neutral"
        - "Negative"
        - "Very negative"
    
    - axis_name: "Product Category"
      axis_description: "Type of product reviewed"
      possible_values:
        - "Electronics"
        - "Clothing"
        - "Home & Garden"
        - "Food & Beverage"

Step 4: Customize the Specification

You can edit the specification to refine requirements:

Add More Requirements

specification:
  requirements: |
    # Existing requirements
    - Customer product reviews
    - Include 1-5 star rating
    
    # New requirements
    - Review must mention specific product features
    - Include verified purchase badge (Yes/No)
    - Review date within last 12 months
    - Minimum 3 sentences

Modify Variation Axes

specification:
  data_property_variations:
    # Add review length variation
    - axis_name: "Review Length"
      axis_description: "Verbosity of review"
      possible_values:
        - "Brief (50-75 words)"
        - "Medium (75-150 words)"
        - "Detailed (150-200 words)"
    
    # Add customer type
    - axis_name: "Customer Type"
      axis_description: "Type of customer"
      possible_values:
        - "First-time buyer"
        - "Repeat customer"
        - "Professional reviewer"

Update the Specification

curl -X PATCH 'https://df-api.dataframer.ai/api/dataframer/specs/spec_xyz789/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "config_yaml": "specification:\n  requirements: |...",
    "notes": "Added review length and customer type variations"
  }'
Updating a specification creates a new version. The previous version is preserved.

Step 5: Test with Small Generation

Before generating many samples, test with a small batch:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/generate/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "spec_id": "spec_xyz789",
    "number_of_samples": 5,
    "sample_type": "short"
  }'
Wait for generation to complete, then review the samples. If they don’t meet expectations, refine the specification and try again.

Step 6: View Specification Versions

List all versions of a specification:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/specs/spec_xyz789/versions/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "versions": [
    {
      "version_id": "version_v1",
      "version_number": 1,
      "created_at": "2025-11-26T10:05:00Z",
      "notes": "Initial specification from analysis"
    },
    {
      "version_id": "version_v2",
      "version_number": 2,
      "created_at": "2025-11-26T10:15:00Z",
      "notes": "Added review length and customer type variations"
    }
  ]
}

Common Issues

Normal duration: 2-10 minutesIf longer than 15 minutes:
  • Check status for error messages
  • Verify dataset is not corrupt
  • Try with smaller dataset first
  • Contact support if persistent
Common causes:
  • Dataset files corrupted
  • Unsupported file format
  • Files not UTF-8 encoded
  • Dataset empty or too small
Solution:
  • Check error message in status response
  • Verify dataset has at least 3-5 samples
  • Ensure files are valid and readable
Problem: Generated spec doesn’t capture requirementsSolution:
  • Manually edit specification
  • Add specific requirements
  • Define clearer variation axes
  • Provide more diverse seed data
Problem: YAML syntax error when updatingSolution:
  • Validate YAML syntax: https://www.yamllint.com/
  • Check indentation (use spaces, not tabs)
  • Escape special characters
  • Use multiline strings with |

Best Practices

Review automatically generated specs: Always review before large generation runs Start with small tests: Generate 5-10 samples to validate spec quality Be specific in requirements: Clear requirements → better samples Use diverse seed data: More variety → better specification Iterate and refine: Test, review, update, repeat Document changes: Add notes when creating new versions

Specification Quality Checklist

Before using a specification for production:
  • Requirements are clear and specific
  • All mandatory fields/properties are listed
  • Data formats are defined
  • Variation axes cover important dimensions
  • Variation values are distinct and clear
  • Tested with small sample batch
  • Samples meet quality expectations

Next Steps