Open in Google Colab

Run this exact tutorial interactively in Google Colab

Financial Bank Statement Extraction with Golden Labels

This tutorial shows how to build an evaluation dataset for an LLM that extracts financial figures from bank statement PDFs — where the golden Q&A labels travel with the PDFs as part of the seed data and get regenerated alongside each synthetic document, potentially lowering significant human time spent in labeling your datasets. The workflow:

Seed DataFramer with multi-folder data — each folder holds one bank statement PDF and a qa-pairs.csv with five golden Q&A pairs.
Generate a Specification — DataFramer learns the structure of both the PDF and the Q&A file.
Edit the Specification — steer the distribution toward a specific bank profile and enforce numerical consistency.
Run generation — produce synthetic bank statement folders, each with a matching qa-pairs.csv.
Evaluate — extract text from a generated PDF, ask the questions, and measure exact-match accuracy against the generated golden labels.

Prerequisites

Python 3.9+
An OPENAI_API_KEY
A DATAFRAMER_API_KEY

pip install openai pymupdf pandas pydataframer tenacity pyyaml requests

Step 1: Read Seed Samples

The seed data lives in a files/ directory. Each sub-folder contains one bank statement PDF and a qa-pairs.csv with five question/answer pairs covering CRE concentration ratios, loan growth rates, and similar metrics. Here is an example of one of the source documents:

FILES_DIR = Path("files")

sample_dirs = sorted([d for d in FILES_DIR.iterdir() if d.is_dir()])

samples = []
for sample_dir in sample_dirs:
    pdf_paths = list(sample_dir.glob("*.pdf"))
    qa_path = sample_dir / "qa-pairs.csv"

    pdf_path = pdf_paths[0] if pdf_paths else None
    qa_df = pd.read_csv(qa_path) if qa_path.exists() else None

    samples.append({
        "dir": sample_dir,
        "pdf_path": pdf_path,
        "qa_path": qa_path,
        "qa_df": qa_df,
    })

    pdf_name = pdf_path.name if pdf_path else "(no PDF)"
    q_count = len(qa_df) if qa_df is not None else 0
    print(f"  {sample_dir.name}: {pdf_name} | {q_count} Q&A pairs")

The four seed samples cover Bank of America, Morgan Stanley, JP Morgan, and Cathay Bank — each with five Q&A pairs.

Step 2: Create a Multi-Folder Seed Dataset

Pack every sample folder into a ZIP preserving the folder hierarchy and upload it to DataFramer. The nested structure (sample1/, sample2/, …) tells DataFramer to treat this as a multi-folder dataset — it will generate complete folders, each containing a PDF and a qa-pairs.csv.

zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zf:
    for sample in samples:
        for file_path in sorted(sample["dir"].iterdir()):
            arcname = f"{sample['dir'].name}/{file_path.name}"
            zf.write(file_path, arcname=arcname)
zip_buffer.seek(0)

dataset = df_client.dataframer.seed_datasets.create_from_zip(
    name=f"bank_statements_golden_labels_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    description="Bank financial statement PDFs with golden Q&A pairs — multi-folder seed dataset",
    zip_file=zip_buffer,
)

dataset_id = dataset.id
print(f"Dataset ID   : {dataset_id}")
print(f"Dataset type : {dataset.dataset_type}")
print(f"Files        : {dataset.file_count}")
print(f"Folders      : {dataset.folder_count}")

Step 3: Generate a Specification

DataFramer analyzes the seed folders and produces a Specification. The generation_objectives tell DataFramer that qa-pairs.csv contains golden Q&A pairs for the bank statement PDF in the same folder, and that Bank Profile should be a named data property.

spec = df_client.dataframer.specs.create(
    dataset_id=dataset_id,
    name=spec_name,
    spec_generation_model_name="anthropic/claude-opus-4-6-thinking",
    generation_objectives=(
        "Do include Bank Profile as a data property. "
        "As an example, a value could be 'CRE-heavy community bank with Total CRE >300% of Tier 1'"
    ),
    extrapolate_values=False,
    generate_distributions=True,
)

After the spec completes you can inspect what properties and value distributions DataFramer discovered — things like Peer Group, CRE Concentration Level, Geographic Region, and HTML Styling Variant alongside Bank Profile.

Step 4: Edit the Specification

Steer the distribution

Override the Bank Profile distribution so that every generated sample is a CRE-heavy community bank. All other profile values receive zero weight.

TARGET_BANK_PROFILE = "CRE-heavy community bank with Total CRE >300% of Tier 1"

for prop in updated_spec_data["data_property_variations"]:
    if prop["property_name"] == "Bank Profile":
        prop["base_distributions"] = {
            v: (100 if v == TARGET_BANK_PROFILE else 0)
            for v in prop["property_values"]
        }

The spec DataFramer generates can be considerably richer than what this tutorial edits. Each property supports conditional distributions — the probability of a value can depend on the value of another property. For example, if Bank Profile is "CRE-heavy community bank with Total CRE >300% of Tier 1", the CRE Concentration Level distribution can automatically shift toward higher buckets, while a different profile keeps it low. When you retrieve the spec YAML you will see conditional_distributions blocks alongside base_distributions; you can add or modify these by hand to express arbitrarily complex relationships between properties before running generation.

Enforce numerical consistency

Append a strict requirement so the generation model verifies every number via the calculator tool and confirms that every Q&A answer is exactly derivable from the PDF:

CONSISTENCY_REQUIREMENT = """
Very important: the document must be 100% internally consistent. All generated numbers must be consistent with each other. You MUST use the calculator tool to verify every single numerical relationship in the document for every number.

Before using the calculator, first plan the complete structure of the document: every table, every row label, every column, every entity/location. Decide what line items exist and how they roll up into subtotals and totals. Only after that create SINGLE, COMPLETE script with ALL calculations included in that one script, using the calculator to verify/derive or compute every single number that will appear in the document. Specifically:
- This applies across ALL columns and ALL rows of ALL tables, everything must be verified in ONE script.
- You are not allowed to rely solely on mental arithmetic for any calculation.
- If the document contains multiple locations, entities, departments, or time periods, the calculator must compute numbers for ALL of them.
- Every subtotal and total must be verified as the exact sum of its component line items. Every percentage must be verified as the correct division, rounded to the displayed precision.

There is NO tolerance for even small errors — if any sum or percentage deviates from its correct value even by $0.01, this is a grave problem.

Additionally, every answer in qa-pairs.csv MUST be exactly derivable from the numbers present in the accompanying PDF.
"""

existing_requirements = updated_spec_data.get("requirements") or ""
updated_spec_data["requirements"] = existing_requirements + CONSISTENCY_REQUIREMENT

updated_yaml = yaml.dump({"spec": updated_spec_data}, allow_unicode=True, sort_keys=False)
updated_spec = df_client.dataframer.specs.update(spec_id=spec_id, content_yaml=updated_yaml)

Step 5: Create a Generation Run

Launch a run using the updated spec. The calculator tool and conformance revision/filtering enforce the consistency requirement during generation.

run = df_client.dataframer.runs.create(
    spec_id=updated_spec.id,
    number_of_samples=3,
    generation_model="anthropic/claude-opus-4-6-thinking",
    revision_model="anthropic/claude-opus-4-6-thinking",
    revision_types=["conformance"],
    max_revision_cycles=1,
    filtering_types=["conformance"],
    generation_thinking_budget=16000,
    tools=["calculator"],
    skip_outline=False,
    unified_multifield=False,
)

Start with 2–5 samples to validate quality before running larger batches.

Step 6: Download and Inspect Generated Samples

Download the completed run as a ZIP archive and extract it locally. Here is an example of a generated bank statement document:

Sample generated bank statement document

Each extracted folder contains a bank statement PDF, a qa-pairs.csv with generated golden labels, and a folder.metadata file with DataFramer’s generation tags. Inspect the tags to confirm the spec properties were applied as configured:

with open(first_sample_dir / "folder.metadata") as f:
    folder_meta = json.load(f)

for key, value in folder_meta.get("generation_tags", {}).items():
    print(f"  {key}: {value}")

Step 7: Human Review

Before running automated evaluation, review the generated Q&A pairs alongside their source PDFs to verify the labels are accurate and well-formed. Because the model was required to derive every answer directly from the PDF, most labels should be exact — but edge cases (multi-step calculations, ambiguous table formatting) are worth a manual spot-check.

Step 8: Evaluate the Generated PDF

Extract text from the first generated PDF, ask each golden question using OpenAI, and compare the model answer to the golden label:

SYSTEM_PROMPT = (
    "You are a precise financial data extraction assistant. "
    "The user will provide the full text of a bank financial statement and ask a specific question. "
    "Answer with only the single requested value — no explanation, no units label, no extra text. "
    "If answering with a number, answer with 2 decimal places."
)

doc = fitz.open(first_pdf)
pdf_text = "\n\n".join(page.get_text() for page in doc)

eval_results = []
for _, row in qa_df.iterrows():
    question = row["question"]
    golden = str(row["answer"])
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Financial statement:\n\n{pdf_text}\n\nQuestion: {question}"},
        ],
        temperature=0,
    )
    model_answer = response.choices[0].message.content.strip()
    match = model_answer == golden
    eval_results.append({"question": question, "model_answer": model_answer, "golden": golden, "exact_match": match})

results_df = pd.DataFrame(eval_results)
accuracy = results_df["exact_match"].mean()
print(f"Exact-match accuracy: {accuracy:.1%} ({int(results_df['exact_match'].sum())}/{len(results_df)})")

Because the golden labels were generated alongside the PDF with a strict derivability requirement, you can be confident that a correct extraction model will match them exactly — making this a high-signal eval dataset.

Basic Use of Python SDK

Create a spec from text and generate samples

Folder Generation

Generate from multi-file or folder-based seed data

Main Docs

Tutorials

Integrations

Release Notes

Financial Bank Statement Extraction with Golden Labels

Open in Google Colab

Financial Bank Statement Extraction with Golden Labels

Prerequisites

Step 1: Read Seed Samples

Step 2: Create a Multi-Folder Seed Dataset

Step 3: Generate a Specification

Step 4: Edit the Specification

Steer the distribution

Enforce numerical consistency

Step 5: Create a Generation Run

Step 6: Download and Inspect Generated Samples

Step 7: Human Review

Step 8: Evaluate the Generated PDF

Basic Use of Python SDK

Folder Generation

Main Docs

Tutorials

Integrations

Release Notes

Documentation Index

Open in Google Colab

​Financial Bank Statement Extraction with Golden Labels

​Prerequisites

​Step 1: Read Seed Samples

​Step 2: Create a Multi-Folder Seed Dataset

​Step 3: Generate a Specification

​Step 4: Edit the Specification

​Steer the distribution

​Enforce numerical consistency

​Step 5: Create a Generation Run

​Step 6: Download and Inspect Generated Samples

​Step 7: Human Review

​Step 8: Evaluate the Generated PDF

​Related Tutorials

Basic Use of Python SDK

Folder Generation

Financial Bank Statement Extraction with Golden Labels

Prerequisites

Step 1: Read Seed Samples

Step 2: Create a Multi-Folder Seed Dataset

Step 3: Generate a Specification

Step 4: Edit the Specification

Steer the distribution

Enforce numerical consistency

Step 5: Create a Generation Run

Step 6: Download and Inspect Generated Samples

Step 7: Human Review

Step 8: Evaluate the Generated PDF

Related Tutorials