Overview
This tutorial shows you how to evaluate generated samples to ensure they meet your quality standards. Learn how to use automatic evaluation, interpret results, and iterate for improvement.What You’ll Learn
- Understanding automatic evaluation
- Checking conformance and quality
- Interpreting evaluation metrics
- Using chat to understand results
- Adding human labels
Prerequisites
- Completed sample generation (see Generating Samples)
- Run ID from generation
- API key for authentication
Step 1: Understanding Automatic Evaluation
After generation completes, Dataframer automatically evaluates your samples. No action needed! What Gets Evaluated:- Conformance: Do samples meet requirements?
- Quality: Are samples realistic and well-formed?
- Classification: Which variation axes do samples fall into?
Automatic evaluation runs in the background after generation completes.
Step 2: Check Evaluation Status
Verify evaluation is complete:evaluation.status == "COMPLETED".
Step 3: View Evaluation Summary
Get high-level evaluation metrics:Interpreting Summary Metrics
Conformance Rate: 90%- 9 out of 10 samples meet all requirements
- Excellent result (>90% is target)
- Overall quality is good
- Above 8.0 is production-ready
- Good distribution across variation axes
- Balanced coverage of all categories
Step 4: Review Failed Samples
Understand why samples failed:Python Analysis Script
Step 5: Analyze Classification Distribution
Check if variation coverage is balanced:Step 6: Use Chat to Understand Results
Ask questions about specific samples:Example Queries
Step 7: Add Human Labels
Mark samples for your own tracking:approved: Ready for productionrejected: Does not meet standardsneeds_review: Requires manual reviewflagged: Potential issue- Custom labels as needed
Batch Labeling Script
Step 8: Decide Next Steps
Based on evaluation results:Scenario 1: Good Results (>90% conformance, >8 quality)
✅ Action: Use samples for productionScenario 2: Fair Results (70-90% conformance, 6-8 quality)
⚠️ Action: Refine specification and regenerate- Identify common issues
- Update specification to address issues
- Generate new batch
- Re-evaluate
Scenario 3: Poor Results (<70% conformance, <6 quality)
❌ Action: Major specification revision needed- Review failed samples in detail
- Rewrite requirements more clearly
- Add specific examples to spec
- Consider different model
- Start with small test batch
Iterative Improvement Process
Workflow:- Generate → 2. Evaluate → 3. Review results
- If good → Use samples
- If not good → Update spec → Repeat from step 1
- Generate small batch (10-20 samples)
- Evaluate results
- Review failures and quality
- Update specification
- Repeat until satisfactory
- Scale to larger batch
Common Issues
Low conformance rate
Low conformance rate
Problem: Many samples failing requirementsPossible causes:
- Vague requirements
- Contradictory requirements
- Model doesn’t understand format
- Be more specific in requirements
- Add format examples to spec
- Simplify complex requirements
- Try different model
Low quality scores
Low quality scores
Problem: Samples score below 6Possible causes:
- Poor seed data quality
- Unclear specification
- Wrong sample type (short vs long)
- Improve seed data
- Clarify specification requirements
- Use long samples for complex data
- Increase temperature for more diversity
Unbalanced classification
Unbalanced classification
Problem: Some variation values rarely appearCauses:
- Unclear variation descriptions
- Model bias toward certain values
- Too many variation axes
- Clarify variation axis descriptions
- Make all values equally specific
- Reduce number of axes
- Generate larger batch
Evaluation takes too long
Evaluation takes too long
Normal time: 5-10 minutes for 100 samplesIf longer:
- Wait up to 15 minutes
- Check evaluation status
- Large batches take longer
- Contact support if stuck
Quality Benchmarks
By Use Case
| Use Case | Min Conformance | Min Quality | Notes |
|---|---|---|---|
| Production data | 95% | 8.0 | Strict standards |
| Testing/QA | 85% | 7.0 | Moderate standards |
| Research | 70% | 6.0 | Exploratory |
| Prototyping | 60% | 5.0 | Rapid iteration |

