Skip to main content

Overview

This tutorial shows you how to evaluate generated samples to ensure they meet your quality standards. Learn how to use automatic evaluation, interpret results, and iterate for improvement.

What You’ll Learn

  • Understanding automatic evaluation
  • Checking conformance and quality
  • Interpreting evaluation metrics
  • Using chat to understand results
  • Adding human labels

Prerequisites

  • Completed sample generation (see Generating Samples)
  • Run ID from generation
  • API key for authentication

Step 1: Understanding Automatic Evaluation

After generation completes, Dataframer automatically evaluates your samples. No action needed! What Gets Evaluated:
  • Conformance: Do samples meet requirements?
  • Quality: Are samples realistic and well-formed?
  • Classification: Which variation axes do samples fall into?
Automatic evaluation runs in the background after generation completes.

Step 2: Check Evaluation Status

Verify evaluation is complete:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/runs/run_xyz456/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "run_id": "run_xyz456",
  "spec_id": "spec_xyz789",
  "status": "SUCCEEDED",
  "total_samples": 10,
  "created_at": "2025-11-26T10:30:00Z",
  "evaluation": {
    "status": "COMPLETED",
    "evaluation_id": "eval_abc123"
  }
}
Look for evaluation.status == "COMPLETED".

Step 3: View Evaluation Summary

Get high-level evaluation metrics:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "evaluation_id": "eval_abc123",
  "run_id": "run_xyz456",
  "status": "COMPLETED",
  "created_at": "2025-11-26T10:35:00Z",
  "metrics": {
    "conformance_rate": 0.90,
    "total_samples": 10,
    "passed_samples": 9,
    "failed_samples": 1,
    "average_quality_score": 8.2
  },
  "classifications": {
    "Sentiment": {
      "Very positive": 2,
      "Positive": 2,
      "Neutral": 3,
      "Negative": 2,
      "Very negative": 1
    },
    "Product Category": {
      "Electronics": 3,
      "Clothing": 2,
      "Home & Garden": 3,
      "Food & Beverage": 2
    }
  }
}

Interpreting Summary Metrics

Conformance Rate: 90%
  • 9 out of 10 samples meet all requirements
  • Excellent result (>90% is target)
Average Quality Score: 8.2/10
  • Overall quality is good
  • Above 8.0 is production-ready
Classifications:
  • Good distribution across variation axes
  • Balanced coverage of all categories
Target: 95%+ conformance, 8+ quality score for production use.

Step 4: Review Failed Samples

Understand why samples failed:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/samples/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "samples": [
    {
      "sample_index": 0,
      "conformance": "PASS",
      "quality_score": 9.0,
      "classifications": {
        "Sentiment": "Positive",
        "Product Category": "Electronics"
      },
      "feedback": "Excellent sample with authentic voice."
    },
    {
      "sample_index": 7,
      "conformance": "FAIL",
      "quality_score": 5.0,
      "issues": [
        "Missing required field: customer_name",
        "Rating out of valid range (6/5)"
      ],
      "feedback": "Sample does not meet requirements."
    }
  ]
}

Python Analysis Script

import requests

API_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Get evaluation details
response = requests.get(
    "https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/samples/",
    headers=headers
)
samples = response.json()["samples"]

# Analyze failures
failed = [s for s in samples if s["conformance"] == "FAIL"]
print(f"Failed samples: {len(failed)}")

# Common failure reasons
all_issues = []
for s in failed:
    all_issues.extend(s.get("issues", []))

from collections import Counter
issue_counts = Counter(all_issues)
print("\nMost common issues:")
for issue, count in issue_counts.most_common(3):
    print(f"  - {issue}: {count} times")

Step 5: Analyze Classification Distribution

Check if variation coverage is balanced:
import requests

API_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {API_KEY}"}

response = requests.get(
    "https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/",
    headers=headers
)
classifications = response.json()["classifications"]

# Check balance for each axis
for axis, values in classifications.items():
    total = sum(values.values())
    print(f"\n{axis} Distribution:")
    for value, count in values.items():
        percentage = (count / total) * 100
        print(f"  {value}: {count} ({percentage:.1f}%)")
    
    # Check if balanced (within 10% of expected)
    expected = 100 / len(values)
    imbalanced = [v for v, c in values.items() 
                  if abs((c/total)*100 - expected) > 10]
    if imbalanced:
        print(f"  ⚠️ Imbalanced: {', '.join(imbalanced)}")
Good Distribution:
Sentiment Distribution:
  Positive: 3 (30%)
  Neutral: 4 (40%)
  Negative: 3 (30%)
Poor Distribution:
Sentiment Distribution:
  Positive: 8 (80%)
  Neutral: 1 (10%)
  Negative: 1 (10%)
  ⚠️ Imbalanced: Positive

Step 6: Use Chat to Understand Results

Ask questions about specific samples:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/chat/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "Why did sample 7 fail conformance?",
    "sample_index": 7
  }'
Response:
{
  "response": "Sample 7 failed conformance because it was missing the required 'customer_name' field and the rating value (6) exceeded the valid range of 1-5 stars specified in the requirements.",
  "context": "sample_specific"
}

Example Queries

# Ask about overall quality
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/chat/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "What are the most common quality issues?"
  }'

# Ask about classification
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/chat/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "Why are most samples classified as positive?"
  }'

Step 7: Add Human Labels

Mark samples for your own tracking:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/samples/sample_xyz/labels/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "label": "approved",
    "notes": "Excellent quality, ready for production use"
  }'
Label Types:
  • approved: Ready for production
  • rejected: Does not meet standards
  • needs_review: Requires manual review
  • flagged: Potential issue
  • Custom labels as needed

Batch Labeling Script

import requests

API_KEY = "your_api_key"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Get high-quality samples (score > 8)
response = requests.get(
    "https://df-api.dataframer.ai/api/dataframer/evaluations/eval_abc123/samples/",
    headers=headers
)
samples = response.json()["samples"]

high_quality = [s for s in samples if s["quality_score"] > 8]

# Label them as approved
for sample in high_quality:
    requests.post(
        f"https://df-api.dataframer.ai/api/dataframer/samples/{sample['sample_id']}/labels/",
        headers=headers,
        json={
            "label": "approved",
            "notes": f"Quality score: {sample['quality_score']}"
        }
    )

print(f"Approved {len(high_quality)} samples")

Step 8: Decide Next Steps

Based on evaluation results:

Scenario 1: Good Results (>90% conformance, >8 quality)

Action: Use samples for production
# Download approved samples
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/generate/retrieve/gen_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  --output production_samples.zip

Scenario 2: Fair Results (70-90% conformance, 6-8 quality)

⚠️ Action: Refine specification and regenerate
  1. Identify common issues
  2. Update specification to address issues
  3. Generate new batch
  4. Re-evaluate

Scenario 3: Poor Results (<70% conformance, <6 quality)

Action: Major specification revision needed
  1. Review failed samples in detail
  2. Rewrite requirements more clearly
  3. Add specific examples to spec
  4. Consider different model
  5. Start with small test batch

Iterative Improvement Process

Workflow:
  1. Generate → 2. Evaluate → 3. Review results
  • If good → Use samples
  • If not good → Update spec → Repeat from step 1
Steps:
  1. Generate small batch (10-20 samples)
  2. Evaluate results
  3. Review failures and quality
  4. Update specification
  5. Repeat until satisfactory
  6. Scale to larger batch

Common Issues

Problem: Many samples failing requirementsPossible causes:
  • Vague requirements
  • Contradictory requirements
  • Model doesn’t understand format
Solutions:
  • Be more specific in requirements
  • Add format examples to spec
  • Simplify complex requirements
  • Try different model
Problem: Samples score below 6Possible causes:
  • Poor seed data quality
  • Unclear specification
  • Wrong sample type (short vs long)
Solutions:
  • Improve seed data
  • Clarify specification requirements
  • Use long samples for complex data
  • Increase temperature for more diversity
Problem: Some variation values rarely appearCauses:
  • Unclear variation descriptions
  • Model bias toward certain values
  • Too many variation axes
Solutions:
  • Clarify variation axis descriptions
  • Make all values equally specific
  • Reduce number of axes
  • Generate larger batch
Normal time: 5-10 minutes for 100 samplesIf longer:
  • Wait up to 15 minutes
  • Check evaluation status
  • Large batches take longer
  • Contact support if stuck

Quality Benchmarks

By Use Case

Use CaseMin ConformanceMin QualityNotes
Production data95%8.0Strict standards
Testing/QA85%7.0Moderate standards
Research70%6.0Exploratory
Prototyping60%5.0Rapid iteration

Next Steps