Overview
Evaluation is the process of assessing whether generated samples meet the requirements defined in your specification. Dataframer automatically evaluates samples after generation, providing detailed feedback on conformance, quality, and potential issues.Automatic Evaluation
After sample generation completes, Dataframer automatically:- Checks Conformance: Verifies samples match specification requirements
- Assesses Quality: Evaluates overall sample quality
- Identifies Issues: Flags problems or deviations
- Provides Classifications: Categorizes samples along variation axes
Automatic evaluation runs in the background. No additional action required.
Evaluation Components
Conformance Checking
Evaluates whether samples satisfy specification requirements: What’s Checked:- Required fields are present
- Data formats are correct
- Constraints are satisfied
- Structure matches specification
- Pass/Fail per sample
- Detailed explanation of failures
- Specific requirement violations
Sample Classification
Classifies samples according to data property variations defined in your spec: Example: If your spec defines a “Sentiment” axis with values [“Positive”, “Neutral”, “Negative”], evaluation classifies each sample into one of these categories. Benefits:- Verify coverage across variation axes
- Identify gaps in generated diversity
- Ensure balanced distribution
Quality Assessment
Overall quality score and feedback: Metrics:- Realism and authenticity
- Consistency with specification
- Coherence and readability
- Appropriate complexity
Creating an Evaluation
While automatic evaluation runs after generation, you can also create custom evaluations:Monitoring Evaluation
Check Status
PENDING: Evaluation queuedRUNNING: Evaluation in progressCOMPLETED: Evaluation finishedFAILED: Evaluation encountered an error
Viewing Results
Get Evaluation Summary
Get Sample-Level Details
View individual sample evaluations:Understanding Results
Conformance Rate
Percentage of samples that pass all requirements:- 95-100%: Excellent - specification is clear and well-followed
- 80-95%: Good - minor issues, consider spec refinement
- 60-80%: Fair - review specification for clarity
- < 60%: Poor - specification needs significant revision
Classification Distribution
Check if samples cover all variation axes appropriately: Balanced Distribution:Quality Scores
Individual sample quality on a 0-10 scale:- 9-10: Exceptional quality
- 7-8: Good quality, minor improvements possible
- 5-6: Acceptable, but noticeable issues
- < 5: Poor quality, regeneration recommended
Chat with Samples
Ask questions about specific samples or the entire evaluation:Chat functionality helps you understand evaluation results and identify patterns in failures.
Human Labeling
Add manual labels to samples for custom tracking:- Mark samples for production use
- Track manual review progress
- Flag samples for revision
- Build training datasets
Evaluation Models
Different models may provide different evaluation perspectives:| Model | Best For |
|---|---|
anthropic/claude-sonnet-4-5 | Balanced, general purpose (default) |
anthropic/claude-haiku-4-5 | Fast evaluation for quick checks |
gemini/gemini-2.5-pro | Complex reasoning tasks |
Best Practices
Review Process
Review Process
- Check conformance rate first
- Review failed samples to identify patterns
- Examine classification distribution
- Sample high and low quality scores
- Update specification based on findings
Iterative Improvement
Iterative Improvement
- Generate small batch (10 samples)
- Review evaluation results
- Refine specification
- Regenerate and re-evaluate
- Repeat until quality is acceptable
- Scale up to larger batches
Common Issues
Common Issues
Low Conformance:
- Specification requirements too vague
- Conflicting requirements
- Model doesn’t understand format
- Variation axes not clearly defined
- Some values easier to generate than others
Quality Standards
Quality Standards
Define acceptable thresholds for your use case:
- Production data: 95%+ conformance, 8+ quality score
- Testing data: 80%+ conformance, 6+ quality score
- Prototyping: 60%+ conformance, any quality score
Evaluation Metrics Reference
| Metric | Range | Description |
|---|---|---|
| Conformance Rate | 0-100% | Percentage passing all requirements |
| Quality Score | 0-10 | Overall sample quality assessment |
| Coverage | 0-100% | Distribution across variation axes |
| Pass Count | 0-N | Number of samples passing conformance |
| Fail Count | 0-N | Number of samples failing conformance |
Troubleshooting
Evaluation stuck in RUNNING
Evaluation stuck in RUNNING
- Wait at least 5 minutes (evaluation takes time)
- Check status again after 10-15 minutes
- For large batches (>100 samples), wait up to 30 minutes
- Contact support if stuck for >1 hour
All samples failing conformance
All samples failing conformance
- Review specification for clarity
- Check if requirements are too strict
- Verify specification format is correct
- Try regenerating with different model
Inconsistent quality scores
Inconsistent quality scores
- Different models may score differently
- Quality is subjective for some content types
- Review specific feedback for context
- Consider human labeling for final decisions

