Open in Google Colab
Run this exact tutorial interactively in Google Colab
Financial Bank Health Statement Document QA
Real-world bank statement PDFs are complex and varied. Before deploying an LLM to extract financial figures from them at scale, you need to know: how accurately does it actually extract the right numbers? This tutorial walks through two connected workflows:- Evaluate extraction quality — ask a fixed set of financial questions across a small set of real bank PDFs and measure how often the model gets the right answer.
- Expand the test set with targeted edge cases — use Dataframer to generate new synthetic bank statements that concentrate on the specific scenarios where your model is most likely to fail.
Prerequisites
- Python 3.9+
- An
OPENAI_API_KEY - A
DATAFRAMER_API_KEY
Part 1: Evaluate Extraction Quality
Load and parse the PDFs
The tutorial reads bank statement PDFs from a localfiles/ directory using PyMuPDF and extracts the full text for each document.
Ask the model a fixed question set
A set of five financial questions is asked for every PDF — things like CRE concentration ratios and three-year loan growth rates. The model is instructed to return only the single requested value, making it easy to compare against known answers.Score against golden labels
Each model answer is compared to a manually verified golden label. Results are collected into a dataframe so you can see exactly where the model succeeds and where it falls short.q3 and q5 score 25% — but is that a model problem, a prompt problem, or just an unlucky sample? You need more documents to know.
Part 2: Generate Targeted Edge Cases with Dataframer
Upload the PDFs as a seed dataset
The same four bank statement PDFs are uploaded to Dataframer. Dataframer analyzes their structure, tables, and value patterns to build a Specification — a reusable description of what a bank statement looks like.Generate a Specification
The Specification captures the data properties that appear in the documents — loan category concentrations, capital ratios, growth rates — along with the distribution of values observed across your seed files.Edit the Specification to target edge cases
This is where Dataframer becomes powerful. Instead of hoping your four PDFs happen to cover the edge cases you care about, you control exactly which values get generated and how often. For example: the real data shows 1-4 Family Construction concentration values spread between 0% and 25%. But you specifically want to test whether your model handles high-stress cases near 95–99% correctly. You update the spec to concentrate all generated documents on those values:Run generation
With the updated spec, generate a batch of synthetic bank statements. Each document will be a realistic, internally consistent bank statement — but with concentration values drawn from the targeted edge-case distribution you specified.Download the generated documents
Once the run completes, download all the generated files as a ZIP archive and extract them locally for review and evaluation.Part 3: Re-run the Evaluation
With a larger, targeted dataset in hand, you have two options for labeling: Automatic labels — if you know the exact values that were specified in the distributions (e.g. you requested 95% or 99% for a given field), those can serve as ground truth directly. This works well when the spec fully determines the answer. Human review — for more complex questions where the right answer depends on multi-step reasoning across the document (likeq3 and q5), have a reviewer go through the generated PDFs and record the correct answers. Then join those labels back into the evaluation dataframe:
Related Tutorials
Basic Python Workflow
Create a spec from text and generate samples
Multi-Folder Workflow
Generate from multi-file or folder-based seed data

