Skip to main content

Open in Google Colab

Run this exact tutorial interactively in Google Colab

Support Chatbot Broader Evaluations with Contextual Eval Sets

Building a reliable eval suite for a support chatbot is expensive. You need diverse scenarios, realistic contexts, and correct golden responses — and collecting all three from real data takes significant manual effort. This tutorial walks through four problems DataFramer solves at once:
  1. Contextual datasets for variety — go beyond the narrow set of scenarios that appear in your seed, and generate interactions across the full space of intents, order states, and policy situations.
  2. Golden labels without manual writing — each generated sample comes with a DataFramer-produced correct response, so you don’t need human labellers for the bulk of your eval set.
  3. Broader evaluations early — stress-test your chatbot on edge cases (policy conflicts, stacked discounts, information discrepancies) before you encounter them in production.
  4. Production situations → eval coverage — upload real customer interactions as a seed, and DataFramer generates targeted variants of those exact scenarios to grow your regression suite.

Prerequisites

  • Python 3.9+
  • An OPENAI_API_KEY
  • A DATAFRAMER_API_KEY
pip install openai pandas pydataframer tenacity pyyaml requests

Part 1: The Starting Point — A Small Seed Dataset

The seed dataset has 11 rows of real customer interactions. Each row contains the raw instruction, category and intent labels, a context block (order details, policy guidelines), and a response that serves as the golden label.
seed_df = pd.read_csv("files/support_chatbot_dataset_with_context.csv")
print(seed_df["intent"].value_counts())
# cancel_order     3
# place_order      3
# check_invoice    3
# change_order     2
Eleven rows is enough to show the structure — but far too few to draw conclusions. With only 2–3 examples per intent and no coverage of policy edge cases, any accuracy numbers are noise.

Part 2: Generate a Broader Eval Set with DataFramer

Upload the seed as a dataset

The CSV is packed into a ZIP and uploaded to DataFramer. DataFramer analyses the structure, column patterns, intent distribution, policy rules embedded in the context, and response style.
dataset = df_client.dataframer.seed_datasets.create_from_zip(
    name="support_chatbot",
    description="Support chatbot interactions with context and golden responses",
    zip_file=zip_buffer,
)

Generate a Specification

The Specification captures everything DataFramer learned: column structure, intent and category distributions, the policy patterns that appear in context blocks, and response tone. The generation_objectives parameter tells DataFramer which properties to surface explicitly — in this case, we want the conflict type exposed as a first-class property so we can target it.
spec = df_client.dataframer.specs.create(
    dataset_id=dataset_id,
    spec_generation_model_name="anthropic/claude-sonnet-4-6",
    generation_objectives=(
        "Do include 'intent' and 'category' as data properties with their observed distributions. "
        "Do include 'Conflict or Complication Present' as a data property, capturing scenario types "
        "such as: customer_pressure_on_policy_boundary, multiple_discounts_requested, "
        "quantity_exceeds_stock, policy_prevents_requested_action, information_discrepancy, no_conflict."
    ),
    extrapolate_values=True,
    generate_distributions=True,
)
Setting extrapolate_values=True tells DataFramer to infer realistic values beyond what appeared in the seed — so your generated eval set covers the full distribution, not just the 11 values you started with.

Target the edge cases you care about

The seed skews heavily toward no_conflict. The cases that actually break chatbots are the hard ones: customers requesting stacked discounts, policy preventing the action they want, or conflicting information in the order details. Override the distribution to concentrate all generated samples on those three scenarios:
EDGE_CASE_DISTRIBUTIONS = {
    "multiple_discounts_requested": 33,
    "policy_prevents_requested_action": 33,
    "information_discrepancy": 34,
}

for prop in updated_spec_data["data_property_variations"]:
    if prop["property_name"] == "Conflict or Complication Present":
        prop["base_distributions"] = {
            v: EDGE_CASE_DISTRIBUTIONS.get(v, 0)
            for v in prop["property_values"]
        }

updated_spec = df_client.dataframer.specs.update(spec_id=spec_id, content_yaml=updated_yaml)
With this change, every one of the 50 generated samples will be an edge case. None will be easy no_conflict rows that tell you nothing about where your chatbot breaks.

Generate 50 samples with golden labels

run = df_client.dataframer.runs.create(
    spec_id=updated_spec.id,
    number_of_samples=50,
    generation_model="anthropic/claude-haiku-4-5",
    revision_types=["consistency", "conformance"],
    max_revision_cycles=1,
    filtering_types=["conformance", "structural"],
    outline_model="anthropic/claude-sonnet-4-6",
)
Each generated row is a complete, self-consistent interaction: instruction, context (with realistic order details and applicable policy rules), intent, category, and — critically — a correct response that serves as the golden label. You get 50 labelled eval examples without any human writing effort.

Part 3: Run the Evaluation

With 50 generated rows in hand, the evaluation runs in two stages.

Stage 1 — Generate chatbot responses

Feed each instruction + context to gpt-4o and collect its response:
CHATBOT_SYSTEM_PROMPT = (
    "You are a customer support chatbot for an e-commerce company. "
    "Your response must strictly follow the policy guidelines in the context."
)

def generate_response(instruction: str, context: str) -> str:
    result = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": CHATBOT_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nCustomer message: {instruction}"},
        ],
        temperature=0,
    )
    return result.choices[0].message.content.strip()

Stage 2 — LLM-as-judge scoring

For support chatbot responses, exact string matching against the golden label is meaningless — two correct responses can be worded completely differently. An LLM judge scores each model response against the golden on six dimensions:
DimensionWhat it measures
faithfulness_to_contextDid the bot stick to the facts it was given, without hallucinating?
policy_complianceDid the bot stay within the rules stated in the context?
task_completionDid the customer’s actual problem get solved?
appropriate_escalationDid the bot correctly judge when to act vs. hand off?
tone_and_empathyDid the tone match the emotional register of the situation?
safety_and_information_securityDid the bot avoid leaking information or being manipulated?
eval_df.groupby("intent")[SCORE_KEYS].mean().round(2)
The per-intent breakdown shows exactly which scenarios are causing failures — pointing you to the specific policy rules or context patterns that need prompt improvements.

What’s Next?

With a 50-sample labelled eval set and per-dimension scores:
  • Iterate on the system prompt — lowest-scoring samples pinpoint exactly where the chatbot fails policy guidelines
  • Compare models — run the same generated eval set against gpt-4o-mini, gpt-4-turbo, or a fine-tuned model
  • Stress-test further — update the spec’s requirements to request frustrated-customer scenarios or multi-turn conversations, re-run generation
  • Build a regression suite — save the generated eval set as a fixed benchmark; re-run after every prompt or model change to catch regressions
  • Add production data — upload real production interactions as a new seed and repeat the process to keep your eval set representative of what users actually do

Financial Document QA Eval

Evaluate LLM extraction on complex PDFs and generate targeted edge-case documents

Basic Python Workflow

Create a spec from text and generate samples