Skip to main content

Overview

Evaluation is the process of assessing whether generated samples meet the requirements defined in your specification. Dataframer automatically evaluates samples after generation, providing detailed feedback on conformance, quality, and potential issues.

Automatic Evaluation

After sample generation completes, Dataframer automatically:
  1. Checks Conformance: Verifies samples match specification requirements
  2. Assesses Quality: Evaluates overall sample quality
  3. Identifies Issues: Flags problems or deviations
  4. Provides Classifications: Categorizes samples along variation axes
Automatic evaluation runs in the background. No additional action required.

Evaluation Components

Conformance Checking

Evaluates whether samples satisfy specification requirements: What’s Checked:
  • Required fields are present
  • Data formats are correct
  • Constraints are satisfied
  • Structure matches specification
Results:
  • Pass/Fail per sample
  • Detailed explanation of failures
  • Specific requirement violations

Sample Classification

Classifies samples according to data property variations defined in your spec: Example: If your spec defines a “Sentiment” axis with values [“Positive”, “Neutral”, “Negative”], evaluation classifies each sample into one of these categories. Benefits:
  • Verify coverage across variation axes
  • Identify gaps in generated diversity
  • Ensure balanced distribution

Quality Assessment

Overall quality score and feedback: Metrics:
  • Realism and authenticity
  • Consistency with specification
  • Coherence and readability
  • Appropriate complexity

Creating an Evaluation

While automatic evaluation runs after generation, you can also create custom evaluations:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/evaluations/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "run_id": "run_xyz789",
    "name": "Custom Quality Check",
    "model": "anthropic/claude-sonnet-4-5"
  }'
Response:
{
  "task_id": "eval_abc123",
  "status": "PENDING",
  "evaluation_id": "eval_xyz789"
}

Monitoring Evaluation

Check Status

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/evaluations/status/eval_abc123' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "task_id": "eval_abc123",
  "status": "COMPLETED",
  "progress": 100,
  "evaluation_id": "eval_xyz789"
}
Status Values:
  • PENDING: Evaluation queued
  • RUNNING: Evaluation in progress
  • COMPLETED: Evaluation finished
  • FAILED: Evaluation encountered an error

Viewing Results

Get Evaluation Summary

curl -X GET 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_xyz789/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "evaluation_id": "eval_xyz789",
  "run_id": "run_xyz789",
  "status": "COMPLETED",
  "created_at": "2025-11-26T10:00:00Z",
  "metrics": {
    "conformance_rate": 0.95,
    "total_samples": 100,
    "passed_samples": 95,
    "failed_samples": 5,
    "average_quality_score": 8.5
  },
  "classifications": {
    "Sentiment": {
      "Positive": 35,
      "Neutral": 30,
      "Negative": 35
    },
    "Product Category": {
      "Electronics": 25,
      "Clothing": 25,
      "Home goods": 25,
      "Food": 25
    }
  }
}

Get Sample-Level Details

View individual sample evaluations:
curl -X GET 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_xyz789/samples/' \
  -H 'Authorization: Bearer YOUR_API_KEY'
Response:
{
  "samples": [
    {
      "sample_index": 0,
      "conformance": "PASS",
      "quality_score": 9.0,
      "classifications": {
        "Sentiment": "Positive",
        "Product Category": "Electronics"
      },
      "feedback": "Excellent sample with authentic voice and proper format."
    },
    {
      "sample_index": 1,
      "conformance": "FAIL",
      "quality_score": 4.0,
      "issues": [
        "Missing required field: product_name",
        "Rating value out of range (6/5)"
      ],
      "feedback": "Sample does not meet requirements."
    }
  ]
}

Understanding Results

Conformance Rate

Percentage of samples that pass all requirements:
  • 95-100%: Excellent - specification is clear and well-followed
  • 80-95%: Good - minor issues, consider spec refinement
  • 60-80%: Fair - review specification for clarity
  • < 60%: Poor - specification needs significant revision
If conformance is low, review failed samples to identify common issues, then update your specification.

Classification Distribution

Check if samples cover all variation axes appropriately: Balanced Distribution:
Sentiment:
  Positive: 33%
  Neutral: 34%
  Negative: 33%
✅ Good coverage across all values Unbalanced Distribution:
Sentiment:
  Positive: 80%
  Neutral: 15%
  Negative: 5%
⚠️ Poor coverage - may need to adjust specification or regenerate

Quality Scores

Individual sample quality on a 0-10 scale:
  • 9-10: Exceptional quality
  • 7-8: Good quality, minor improvements possible
  • 5-6: Acceptable, but noticeable issues
  • < 5: Poor quality, regeneration recommended

Chat with Samples

Ask questions about specific samples or the entire evaluation:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/evaluations/eval_xyz789/chat/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "Why did sample 5 fail conformance?",
    "sample_index": 5
  }'
Response:
{
  "response": "Sample 5 failed conformance because it was missing the required 'customer_name' field and the rating was formatted as text ('five stars') instead of a number (1-5).",
  "context": "sample_specific"
}
Chat functionality helps you understand evaluation results and identify patterns in failures.

Human Labeling

Add manual labels to samples for custom tracking:
curl -X POST 'https://df-api.dataframer.ai/api/dataframer/samples/sample_xyz/labels/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "label": "approved",
    "notes": "Excellent quality, ready for production"
  }'
Use Cases:
  • Mark samples for production use
  • Track manual review progress
  • Flag samples for revision
  • Build training datasets

Evaluation Models

Different models may provide different evaluation perspectives:
ModelBest For
anthropic/claude-sonnet-4-5Balanced, general purpose (default)
anthropic/claude-haiku-4-5Fast evaluation for quick checks
gemini/gemini-2.5-proComplex reasoning tasks
Use the same model for generation and evaluation for consistency.

Best Practices

  1. Check conformance rate first
  2. Review failed samples to identify patterns
  3. Examine classification distribution
  4. Sample high and low quality scores
  5. Update specification based on findings
  • Generate small batch (10 samples)
  • Review evaluation results
  • Refine specification
  • Regenerate and re-evaluate
  • Repeat until quality is acceptable
  • Scale up to larger batches
Low Conformance:
  • Specification requirements too vague
  • Conflicting requirements
  • Model doesn’t understand format
Solution: Clarify requirements, add examples, simplify constraintsUnbalanced Classification:
  • Variation axes not clearly defined
  • Some values easier to generate than others
Solution: Refine variation descriptions, ensure all values are clear
Define acceptable thresholds for your use case:
  • Production data: 95%+ conformance, 8+ quality score
  • Testing data: 80%+ conformance, 6+ quality score
  • Prototyping: 60%+ conformance, any quality score

Evaluation Metrics Reference

MetricRangeDescription
Conformance Rate0-100%Percentage passing all requirements
Quality Score0-10Overall sample quality assessment
Coverage0-100%Distribution across variation axes
Pass Count0-NNumber of samples passing conformance
Fail Count0-NNumber of samples failing conformance

Troubleshooting

  • Wait at least 5 minutes (evaluation takes time)
  • Check status again after 10-15 minutes
  • For large batches (>100 samples), wait up to 30 minutes
  • Contact support if stuck for >1 hour
  • Review specification for clarity
  • Check if requirements are too strict
  • Verify specification format is correct
  • Try regenerating with different model
  • Different models may score differently
  • Quality is subjective for some content types
  • Review specific feedback for context
  • Consider human labeling for final decisions

Next Steps