> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dataframer.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Support Chatbot Broader Evaluations with Contextual Eval Sets

> Go from 11 hand-labelled seed rows to a 50-sample (1000s if needed) eval set — complete with golden labels — by using DataFramer to generate targeted edge-case interactions for your support chatbot.

<Card title="Open in Google Colab" icon="book" href="https://colab.research.google.com/github/aimonlabs/dataframer-docs-public/blob/main/support-chatbot-borader-evals-with-contextual-eval-sets/evaluation_support_chatbot_broader_evals.ipynb" horizontal>
  Run this exact tutorial interactively in Google Colab
</Card>

# Support Chatbot Broader Evaluations with Contextual Eval Sets

Building a reliable eval suite for a support chatbot is expensive. You need diverse scenarios, realistic contexts, and correct golden responses — and collecting all three from real data takes significant manual effort.

This tutorial walks through four problems DataFramer solves at once:

1. **Contextual datasets for variety** — go beyond the narrow set of scenarios that appear in your seed, and generate interactions across the full space of intents, order states, and policy situations.
2. **Golden labels without manual writing** — each generated sample comes with a DataFramer-produced correct response, so you don't need human labellers for the bulk of your eval set.
3. **Broader evaluations early** — stress-test your chatbot on edge cases (policy conflicts, stacked discounts, information discrepancies) before you encounter them in production.
4. **Production situations → eval coverage** — upload real customer interactions as a seed, and DataFramer generates targeted variants of those exact scenarios to grow your regression suite.

## Prerequisites

* Python 3.9+
* An `OPENAI_API_KEY`
* A `DATAFRAMER_API_KEY`

```bash theme={null}
pip install openai pandas pydataframer tenacity pyyaml requests
```

## Part 1: The Starting Point — A Small Seed Dataset

The seed dataset has 11 rows of real customer interactions. Each row contains the raw `instruction`, `category` and `intent` labels, a `context` block (order details, policy guidelines), and a `response` that serves as the golden label.

```python theme={null}
seed_df = pd.read_csv("files/support_chatbot_dataset_with_context.csv")
print(seed_df["intent"].value_counts())
# cancel_order     3
# place_order      3
# check_invoice    3
# change_order     2
```

Eleven rows is enough to show the structure — but far too few to draw conclusions. With only 2–3 examples per intent and no coverage of policy edge cases, any accuracy numbers are noise.

## Part 2: Generate a Broader Eval Set with DataFramer

### Upload the seed as a dataset

The CSV is packed into a ZIP and uploaded to DataFramer. DataFramer analyses the structure, column patterns, intent distribution, policy rules embedded in the context, and response style.

```python theme={null}
dataset = df_client.dataframer.seed_datasets.create_from_zip(
    name="support_chatbot",
    description="Support chatbot interactions with context and golden responses",
    zip_file=zip_buffer,
)
```

### Generate a Specification

The Specification captures everything DataFramer learned: column structure, intent and category distributions, the policy patterns that appear in context blocks, and response tone. The `generation_objectives` parameter tells DataFramer which properties to surface explicitly — in this case, we want the conflict type exposed as a first-class property so we can target it.

```python theme={null}
spec = df_client.dataframer.specs.create(
    dataset_id=dataset_id,
    spec_generation_model_name="anthropic/claude-sonnet-5",
    generation_objectives=(
        "Do include 'intent' and 'category' as data properties with their observed distributions. "
        "Do include 'Conflict or Complication Present' as a data property, capturing scenario types "
        "such as: customer_pressure_on_policy_boundary, multiple_discounts_requested, "
        "quantity_exceeds_stock, policy_prevents_requested_action, information_discrepancy, no_conflict."
    ),
    extrapolate_values=True,
    generate_distributions=True,
)
```

Setting `extrapolate_values=True` tells DataFramer to infer realistic values beyond what appeared in the seed — so your generated eval set covers the full distribution, not just the 11 values you started with.

### Target the edge cases you care about

The seed skews heavily toward `no_conflict`. The cases that actually break chatbots are the hard ones: customers requesting stacked discounts, policy preventing the action they want, or conflicting information in the order details. Override the distribution to concentrate all generated samples on those three scenarios:

```python theme={null}
EDGE_CASE_DISTRIBUTIONS = {
    "multiple_discounts_requested": 33,
    "policy_prevents_requested_action": 33,
    "information_discrepancy": 34,
}

for prop in updated_spec_data["data_property_variations"]:
    if prop["property_name"] == "Conflict or Complication Present":
        prop["base_distributions"] = {
            v: EDGE_CASE_DISTRIBUTIONS.get(v, 0)
            for v in prop["property_values"]
        }

updated_spec = df_client.dataframer.specs.update(spec_id=spec_id, content_yaml=updated_yaml)
```

With this change, every one of the 50 generated samples will be an edge case. None will be easy `no_conflict` rows that tell you nothing about where your chatbot breaks.

### Generate 50 samples with golden labels

```python theme={null}
run = df_client.dataframer.runs.create(
    spec_id=updated_spec.id,
    number_of_samples=50,
    generation_model="anthropic/claude-haiku-4-5",
    revision_types=["consistency", "conformance"],
    max_revision_cycles=1,
    filtering_types=["conformance", "structural"],
    outline_model="anthropic/claude-sonnet-5",
)
```

Each generated row is a complete, self-consistent interaction: `instruction`, `context` (with realistic order details and applicable policy rules), `intent`, `category`, and — critically — a correct `response` that serves as the golden label. You get 50 labelled eval examples without any human writing effort.

## Part 3: Run the Evaluation

With 50 generated rows in hand, the evaluation runs in two stages.

### Stage 1 — Generate chatbot responses

Feed each `instruction` + `context` to `gpt-4o` and collect its response:

```python theme={null}
CHATBOT_SYSTEM_PROMPT = (
    "You are a customer support chatbot for an e-commerce company. "
    "Your response must strictly follow the policy guidelines in the context."
)

def generate_response(instruction: str, context: str) -> str:
    result = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": CHATBOT_SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nCustomer message: {instruction}"},
        ],
        temperature=0,
    )
    return result.choices[0].message.content.strip()
```

### Stage 2 — LLM-as-judge scoring

For support chatbot responses, exact string matching against the golden label is meaningless — two correct responses can be worded completely differently. An LLM judge scores each model response against the golden on six dimensions:

| Dimension                         | What it measures                                                    |
| --------------------------------- | ------------------------------------------------------------------- |
| `faithfulness_to_context`         | Did the bot stick to the facts it was given, without hallucinating? |
| `policy_compliance`               | Did the bot stay within the rules stated in the context?            |
| `task_completion`                 | Did the customer's actual problem get solved?                       |
| `appropriate_escalation`          | Did the bot correctly judge when to act vs. hand off?               |
| `tone_and_empathy`                | Did the tone match the emotional register of the situation?         |
| `safety_and_information_security` | Did the bot avoid leaking information or being manipulated?         |

```python theme={null}
eval_df.groupby("intent")[SCORE_KEYS].mean().round(2)
```

The per-intent breakdown shows exactly which scenarios are causing failures — pointing you to the specific policy rules or context patterns that need prompt improvements.

## What's Next?

With a 50-sample labelled eval set and per-dimension scores:

* **Iterate on the system prompt** — lowest-scoring samples pinpoint exactly where the chatbot fails policy guidelines
* **Compare models** — run the same generated eval set against `gpt-4o-mini`, `gpt-4-turbo`, or a fine-tuned model
* **Stress-test further** — update the spec's `requirements` to request frustrated-customer scenarios or multi-turn conversations, re-run generation
* **Build a regression suite** — save the generated eval set as a fixed benchmark; re-run after every prompt or model change to catch regressions
* **Add production data** — upload real production interactions as a new seed and repeat the process to keep your eval set representative of what users actually do

## Related Tutorials

<CardGroup cols={2}>
  <Card title="Financial Document QA Eval" icon="file-pdf" href="/tutorials/financial-bank-statement-extraction">
    Evaluate LLM extraction on complex PDFs and generate targeted edge-case documents
  </Card>

  <Card title="Basic Use of Python SDK" icon="play" href="/tutorials/basic-python-workflow">
    Create a spec from text and generate samples
  </Card>
</CardGroup>