Open in Google Colab
Run this exact tutorial interactively in Google Colab
Support Chatbot Broader Evaluations with Contextual Eval Sets
Building a reliable eval suite for a support chatbot is expensive. You need diverse scenarios, realistic contexts, and correct golden responses — and collecting all three from real data takes significant manual effort. This tutorial walks through four problems DataFramer solves at once:- Contextual datasets for variety — go beyond the narrow set of scenarios that appear in your seed, and generate interactions across the full space of intents, order states, and policy situations.
- Golden labels without manual writing — each generated sample comes with a DataFramer-produced correct response, so you don’t need human labellers for the bulk of your eval set.
- Broader evaluations early — stress-test your chatbot on edge cases (policy conflicts, stacked discounts, information discrepancies) before you encounter them in production.
- Production situations → eval coverage — upload real customer interactions as a seed, and DataFramer generates targeted variants of those exact scenarios to grow your regression suite.
Prerequisites
- Python 3.9+
- An
OPENAI_API_KEY - A
DATAFRAMER_API_KEY
Part 1: The Starting Point — A Small Seed Dataset
The seed dataset has 11 rows of real customer interactions. Each row contains the rawinstruction, category and intent labels, a context block (order details, policy guidelines), and a response that serves as the golden label.
Part 2: Generate a Broader Eval Set with DataFramer
Upload the seed as a dataset
The CSV is packed into a ZIP and uploaded to DataFramer. DataFramer analyses the structure, column patterns, intent distribution, policy rules embedded in the context, and response style.Generate a Specification
The Specification captures everything DataFramer learned: column structure, intent and category distributions, the policy patterns that appear in context blocks, and response tone. Thegeneration_objectives parameter tells DataFramer which properties to surface explicitly — in this case, we want the conflict type exposed as a first-class property so we can target it.
extrapolate_values=True tells DataFramer to infer realistic values beyond what appeared in the seed — so your generated eval set covers the full distribution, not just the 11 values you started with.
Target the edge cases you care about
The seed skews heavily towardno_conflict. The cases that actually break chatbots are the hard ones: customers requesting stacked discounts, policy preventing the action they want, or conflicting information in the order details. Override the distribution to concentrate all generated samples on those three scenarios:
no_conflict rows that tell you nothing about where your chatbot breaks.
Generate 50 samples with golden labels
instruction, context (with realistic order details and applicable policy rules), intent, category, and — critically — a correct response that serves as the golden label. You get 50 labelled eval examples without any human writing effort.
Part 3: Run the Evaluation
With 50 generated rows in hand, the evaluation runs in two stages.Stage 1 — Generate chatbot responses
Feed eachinstruction + context to gpt-4o and collect its response:
Stage 2 — LLM-as-judge scoring
For support chatbot responses, exact string matching against the golden label is meaningless — two correct responses can be worded completely differently. An LLM judge scores each model response against the golden on six dimensions:| Dimension | What it measures |
|---|---|
faithfulness_to_context | Did the bot stick to the facts it was given, without hallucinating? |
policy_compliance | Did the bot stay within the rules stated in the context? |
task_completion | Did the customer’s actual problem get solved? |
appropriate_escalation | Did the bot correctly judge when to act vs. hand off? |
tone_and_empathy | Did the tone match the emotional register of the situation? |
safety_and_information_security | Did the bot avoid leaking information or being manipulated? |
What’s Next?
With a 50-sample labelled eval set and per-dimension scores:- Iterate on the system prompt — lowest-scoring samples pinpoint exactly where the chatbot fails policy guidelines
- Compare models — run the same generated eval set against
gpt-4o-mini,gpt-4-turbo, or a fine-tuned model - Stress-test further — update the spec’s
requirementsto request frustrated-customer scenarios or multi-turn conversations, re-run generation - Build a regression suite — save the generated eval set as a fixed benchmark; re-run after every prompt or model change to catch regressions
- Add production data — upload real production interactions as a new seed and repeat the process to keep your eval set representative of what users actually do
Related Tutorials
Financial Document QA Eval
Evaluate LLM extraction on complex PDFs and generate targeted edge-case documents
Basic Python Workflow
Create a spec from text and generate samples

