Skip to main content

Open in Google Colab

Run this exact tutorial interactively in Google Colab

Financial Bank Health Statement Document QA

Real-world bank statement PDFs are complex and varied. Before deploying an LLM to extract financial figures from them at scale, you need to know: how accurately does it actually extract the right numbers? This tutorial walks through two connected workflows:
  1. Evaluate extraction quality — ask a fixed set of financial questions across a small set of real bank PDFs and measure how often the model gets the right answer.
  2. Expand the test set with targeted edge cases — use Dataframer to generate new synthetic bank statements that concentrate on the specific scenarios where your model is most likely to fail.

Prerequisites

  • Python 3.9+
  • An OPENAI_API_KEY
  • A DATAFRAMER_API_KEY
pip install openai pymupdf pandas pydataframer tenacity pyyaml requests

Part 1: Evaluate Extraction Quality

Load and parse the PDFs

The tutorial reads bank statement PDFs from a local files/ directory using PyMuPDF and extracts the full text for each document.
def extract_pdf_text(pdf_path: Path) -> str:
    doc = fitz.open(pdf_path)
    return "\n\n".join(page.get_text() for page in doc)

Ask the model a fixed question set

A set of five financial questions is asked for every PDF — things like CRE concentration ratios and three-year loan growth rates. The model is instructed to return only the single requested value, making it easy to compare against known answers.
SYSTEM_PROMPT = (
    "You are a precise financial data extraction assistant. "
    "Answer with only the single requested value — no explanation, no units label, no extra text."
)

Score against golden labels

Each model answer is compared to a manually verified golden label. Results are collected into a dataframe so you can see exactly where the model succeeds and where it falls short.
Overall exact-match accuracy: 50.0% (10/20)

Per-question accuracy:
q1_cre_to_tier1_plus_acl_pct          0.75
q2_non_owner_occ_cre_3yr_growth       0.75
q3_single_category_concentration      0.25
q4_1_4_family_residential_to_tier1    0.50
q5_non_depository_growth_vs_tier1     0.25
With only four real PDFs, the dataset is too small to draw firm conclusions. Questions like q3 and q5 score 25% — but is that a model problem, a prompt problem, or just an unlucky sample? You need more documents to know.

Part 2: Generate Targeted Edge Cases with Dataframer

Upload the PDFs as a seed dataset

The same four bank statement PDFs are uploaded to Dataframer. Dataframer analyzes their structure, tables, and value patterns to build a Specification — a reusable description of what a bank statement looks like.
dataset = df_client.dataframer.seed_datasets.create_from_zip(
    name="bank_analysis",
    description="Bank Analysis - Concentrations of Credit PDFs",
    zip_file=zip_buffer,
)

Generate a Specification

The Specification captures the data properties that appear in the documents — loan category concentrations, capital ratios, growth rates — along with the distribution of values observed across your seed files.
spec = df_client.dataframer.specs.create(
    dataset_id=dataset_id,
    spec_generation_model_name="anthropic/claude-opus-4-6",
    generation_objectives="Include '1-4 Family Construction concentration level' and 'Multifamily concentration level' as data properties.",
    extrapolate_values=False,
    generate_distributions=True,
)

Edit the Specification to target edge cases

This is where Dataframer becomes powerful. Instead of hoping your four PDFs happen to cover the edge cases you care about, you control exactly which values get generated and how often. For example: the real data shows 1-4 Family Construction concentration values spread between 0% and 25%. But you specifically want to test whether your model handles high-stress cases near 95–99% correctly. You update the spec to concentrate all generated documents on those values:
prop["property_values"] = [...existing_values..., 95, 99]
prop["base_distributions"] = {v: (50 if v in (95, 99) else 0) for v in prop["property_values"]}

updated_spec = df_client.dataframer.specs.update(spec_id=spec_id, content_yaml=updated_yaml)

Run generation

With the updated spec, generate a batch of synthetic bank statements. Each document will be a realistic, internally consistent bank statement — but with concentration values drawn from the targeted edge-case distribution you specified.
run = df_client.dataframer.runs.create(
    spec_id=updated_spec.id,
    number_of_samples=5,
    generation_model="anthropic/claude-opus-4-6",
    revision_types=["conformance"],
    filtering_types=["conformance"],
    tools=["calculator"],   # ensures numerical consistency across all tables
)

Download the generated documents

Once the run completes, download all the generated files as a ZIP archive and extract them locally for review and evaluation.

Part 3: Re-run the Evaluation

With a larger, targeted dataset in hand, you have two options for labeling: Automatic labels — if you know the exact values that were specified in the distributions (e.g. you requested 95% or 99% for a given field), those can serve as ground truth directly. This works well when the spec fully determines the answer. Human review — for more complex questions where the right answer depends on multi-step reasoning across the document (like q3 and q5), have a reviewer go through the generated PDFs and record the correct answers. Then join those labels back into the evaluation dataframe:
reviewer_labels = pd.read_csv("reviewer_labels.csv")
reviewed_df = df.merge(reviewer_labels, on=["file", "question_key"], how="left")
reviewed_df.groupby("question_key")[["exact_match", "tolerance_match"]].mean()
Either way, you now have a principled way to build a test suite that covers exactly the scenarios you care about — not just the ones that happened to appear in four real-world PDFs.

Basic Python Workflow

Create a spec from text and generate samples

Multi-Folder Workflow

Generate from multi-file or folder-based seed data