Financial Bank Statement Extraction with Golden Labels
Generate synthetic financial statement PDFs with co-located Q&A golden labels using a multi-folder seed dataset, then evaluate an LLM’s financial data extraction accuracy on the generated documents.
Open in Google Colab
Run this exact tutorial interactively in Google Colab
Financial Bank Statement Extraction with Golden Labels
This tutorial shows how to build an evaluation dataset for an LLM that extracts financial figures from bank statement PDFs — where the golden Q&A labels travel with the PDFs as part of the seed data and get regenerated alongside each synthetic document, potentially lowering significant human time spent in labeling your datasets.The workflow:
Seed DataFramer with multi-folder data — each folder holds one bank statement PDF and a qa-pairs.csv with five golden Q&A pairs.
Generate a Specification — DataFramer learns the structure of both the PDF and the Q&A file.
Edit the Specification — steer the distribution toward a specific bank profile and enforce numerical consistency.
Run generation — produce synthetic bank statement folders, each with a matching qa-pairs.csv.
Evaluate — extract text from a generated PDF, ask the questions, and measure exact-match accuracy against the generated golden labels.
The seed data lives in a files/ directory. Each sub-folder contains one bank statement PDF and a qa-pairs.csv with five question/answer pairs covering CRE concentration ratios, loan growth rates, and similar metrics.Here is an example of one of the source documents:
FILES_DIR = Path("files")sample_dirs = sorted([d for d in FILES_DIR.iterdir() if d.is_dir()])samples = []for sample_dir in sample_dirs: pdf_paths = list(sample_dir.glob("*.pdf")) qa_path = sample_dir / "qa-pairs.csv" pdf_path = pdf_paths[0] if pdf_paths else None qa_df = pd.read_csv(qa_path) if qa_path.exists() else None samples.append({ "dir": sample_dir, "pdf_path": pdf_path, "qa_path": qa_path, "qa_df": qa_df, }) pdf_name = pdf_path.name if pdf_path else "(no PDF)" q_count = len(qa_df) if qa_df is not None else 0 print(f" {sample_dir.name}: {pdf_name} | {q_count} Q&A pairs")
The four seed samples cover Bank of America, Morgan Stanley, JP Morgan, and Cathay Bank — each with five Q&A pairs.
Pack every sample folder into a ZIP preserving the folder hierarchy and upload it to DataFramer. The nested structure (sample1/, sample2/, …) tells DataFramer to treat this as a multi-folder dataset — it will generate complete folders, each containing a PDF and a qa-pairs.csv.
zip_buffer = io.BytesIO()with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zf: for sample in samples: for file_path in sorted(sample["dir"].iterdir()): arcname = f"{sample['dir'].name}/{file_path.name}" zf.write(file_path, arcname=arcname)zip_buffer.seek(0)dataset = df_client.dataframer.seed_datasets.create_from_zip( name=f"bank_statements_golden_labels_{datetime.now().strftime('%Y%m%d_%H%M%S')}", description="Bank financial statement PDFs with golden Q&A pairs — multi-folder seed dataset", zip_file=zip_buffer,)dataset_id = dataset.idprint(f"Dataset ID : {dataset_id}")print(f"Dataset type : {dataset.dataset_type}")print(f"Files : {dataset.file_count}")print(f"Folders : {dataset.folder_count}")
DataFramer analyzes the seed folders and produces a Specification. The generation_objectives tell DataFramer that qa-pairs.csv contains golden Q&A pairs for the bank statement PDF in the same folder, and that Bank Profile should be a named data property.
spec = df_client.dataframer.specs.create( dataset_id=dataset_id, name=spec_name, spec_generation_model_name="anthropic/claude-opus-4-6-thinking", generation_objectives=( "Do include Bank Profile as a data property. " "As an example, a value could be 'CRE-heavy community bank with Total CRE >300% of Tier 1'" ), extrapolate_values=False, generate_distributions=True,)
After the spec completes you can inspect what properties and value distributions DataFramer discovered — things like Peer Group, CRE Concentration Level, Geographic Region, and HTML Styling Variant alongside Bank Profile.
Override the Bank Profile distribution so that every generated sample is a CRE-heavy community bank. All other profile values receive zero weight.
TARGET_BANK_PROFILE = "CRE-heavy community bank with Total CRE >300% of Tier 1"for prop in updated_spec_data["data_property_variations"]: if prop["property_name"] == "Bank Profile": prop["base_distributions"] = { v: (100 if v == TARGET_BANK_PROFILE else 0) for v in prop["property_values"] }
The spec DataFramer generates can be considerably richer than what this tutorial edits. Each property supports conditional distributions — the probability of a value can depend on the value of another property. For example, if Bank Profile is "CRE-heavy community bank with Total CRE >300% of Tier 1", the CRE Concentration Level distribution can automatically shift toward higher buckets, while a different profile keeps it low. When you retrieve the spec YAML you will see conditional_distributions blocks alongside base_distributions; you can add or modify these by hand to express arbitrarily complex relationships between properties before running generation.
Append a strict requirement so the generation model verifies every number via the calculator tool and confirms that every Q&A answer is exactly derivable from the PDF:
CONSISTENCY_REQUIREMENT = """Very important: the document must be 100% internally consistent. All generated numbers must be consistent with each other. You MUST use the calculator tool to verify every single numerical relationship in the document for every number.Before using the calculator, first plan the complete structure of the document: every table, every row label, every column, every entity/location. Decide what line items exist and how they roll up into subtotals and totals. Only after that create SINGLE, COMPLETE script with ALL calculations included in that one script, using the calculator to verify/derive or compute every single number that will appear in the document. Specifically:- This applies across ALL columns and ALL rows of ALL tables, everything must be verified in ONE script.- You are not allowed to rely solely on mental arithmetic for any calculation.- If the document contains multiple locations, entities, departments, or time periods, the calculator must compute numbers for ALL of them.- Every subtotal and total must be verified as the exact sum of its component line items. Every percentage must be verified as the correct division, rounded to the displayed precision.There is NO tolerance for even small errors — if any sum or percentage deviates from its correct value even by $0.01, this is a grave problem.Additionally, every answer in qa-pairs.csv MUST be exactly derivable from the numbers present in the accompanying PDF."""existing_requirements = updated_spec_data.get("requirements") or ""updated_spec_data["requirements"] = existing_requirements + CONSISTENCY_REQUIREMENTupdated_yaml = yaml.dump({"spec": updated_spec_data}, allow_unicode=True, sort_keys=False)updated_spec = df_client.dataframer.specs.update(spec_id=spec_id, content_yaml=updated_yaml)
Download the completed run as a ZIP archive and extract it locally.Here is an example of a generated bank statement document:Each extracted folder contains a bank statement PDF, a qa-pairs.csv with generated golden labels, and a folder.metadata file with DataFramer’s generation tags. Inspect the tags to confirm the spec properties were applied as configured:
with open(first_sample_dir / "folder.metadata") as f: folder_meta = json.load(f)for key, value in folder_meta.get("generation_tags", {}).items(): print(f" {key}: {value}")
Before running automated evaluation, review the generated Q&A pairs alongside their source PDFs to verify the labels are accurate and well-formed. Because the model was required to derive every answer directly from the PDF, most labels should be exact — but edge cases (multi-step calculations, ambiguous table formatting) are worth a manual spot-check.
Extract text from the first generated PDF, ask each golden question using OpenAI, and compare the model answer to the golden label:
SYSTEM_PROMPT = ( "You are a precise financial data extraction assistant. " "The user will provide the full text of a bank financial statement and ask a specific question. " "Answer with only the single requested value — no explanation, no units label, no extra text. " "If answering with a number, answer with 2 decimal places.")doc = fitz.open(first_pdf)pdf_text = "\n\n".join(page.get_text() for page in doc)eval_results = []for _, row in qa_df.iterrows(): question = row["question"] golden = str(row["answer"]) response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": f"Financial statement:\n\n{pdf_text}\n\nQuestion: {question}"}, ], temperature=0, ) model_answer = response.choices[0].message.content.strip() match = model_answer == golden eval_results.append({"question": question, "model_answer": model_answer, "golden": golden, "exact_match": match})results_df = pd.DataFrame(eval_results)accuracy = results_df["exact_match"].mean()print(f"Exact-match accuracy: {accuracy:.1%} ({int(results_df['exact_match'].sum())}/{len(results_df)})")
Because the golden labels were generated alongside the PDF with a strict derivability requirement, you can be confident that a correct extraction model will match them exactly — making this a high-signal eval dataset.