Open in Google Colab
Run this exact tutorial interactively in Google Colab
Fraud Detection: Labeled Training Data for Impossible Geographic Jumps
Training a fraud detection model requires labeled examples of fraud. The problem has two parts: fraud is rare in production data, and the specific patterns you want to catch, like impossible geographic jumps, may never appear in your seed data at all. An impossible geo-jump is a credit card transaction sequence where the same card appears in physically separated locations within a time window too short for legitimate travel: New York at 9:00 AM, Los Angeles at 9:58 AM. These patterns are strong fraud signals, but if your seed data doesn’t contain them, you have no labels to learn from. The conventional solution is to hand-label examples, which is slow, or wait for enough production fraud to accumulate, which can take years. DataFramer offers a third path: teach DataFramer a concept it didn’t see in your seed, let it shape the generated data around that concept, and collect the golden labels from its annotations, without changing your expected output schema. This tutorial demonstrates exactly that:- Seed from unlabeled transactions: upload 10 real transaction rows. No
geo_jump_flagcolumn, no fraud labels of any kind. - Introduce a novel property: instruct DataFramer to understand and encode geo-jump as a generation-time concept via
generation_objectives. - Control the fraud rate: override the distribution so generated data has the fraud density you need for training, not the near-zero rate of real logs.
- Collect golden labels from annotations: DataFramer annotates which generated transactions it intended as geo-jumps — giving you ground truth without touching the output schema.
- Validate with physics: confirm annotation quality by checking impossible transitions against haversine distance, closing the loop between intent and reality.
Prerequisites
- Python 3.9+
- A
DATAFRAMER_API_KEY
Part 1: The Seed Dataset
The seed contains 10 real credit card transaction rows. Each row records a payment event with full context: timestamp, city, amount, currency, payment channel, authentication method, device trust level, and derived velocity features. There is nogeo_jump_flag column. No fraud labels of any kind.
Part 2: Introduce Geo-Jump as a Novel Property
Upload the seed
Pack the CSV into a ZIP and upload it. DataFramer analyzes the column structure, value distributions, and temporal patterns in the sequence.Generate a Specification and introduce the fraud concept
The Specification is DataFramer’s learned model of your data. It captures distributions across every property (cities, payment channels, auth methods, device trust levels, velocity features) along with consistency rules that govern the sequence. Thegeneration_objectives parameter is where you inject the concept that was never in your data. You are not adding a new column to the output — you are teaching DataFramer what a geo-jump means, so it can use that understanding to shape what it generates and annotate accordingly:
extrapolate_values=True tells DataFramer to infer realistic values beyond the 10-row seed, so generated city sequences cover the full US geography rather than just the handful of cities that appeared in your sample.
Override the fraud distribution
By default, DataFramer would generate data that mirrors the fraud rate of your seed — which, since the seed has no fraud labels at all, would produce mostly or entirely normal transactions. That is useless for training a classifier. Retrieve the YAML spec and override thegeo_jump_flag distribution. The base distribution sets the overall fraud rate; conditional_distributions then pins specific states to a much higher fraud rate, targeting the corridors you care about most:
Generate 100 labeled samples (or however many you like)
DataFramer supports models from Anthropic, OpenAI, Google, and open-source providers, so you can mix and match to balance cost against quality. In this example, a capable reasoning model drives the spec analysis and outline, while a faster, cheaper model handles the bulk row generation:geo_jump_flag in sight. But DataFramer produces annotations alongside the generated rows, which is a record of which transactions it generated as geo-jumps. Those annotations are your golden labels, produced as a natural byproduct of generation.
Part 3: Validate Label Quality with Physics
DataFramer encoded the geo-jump concept and shaped generation around it. Now verify it did so correctly, using a ground truth that requires no human judgment: the speed of light.Compute haversine distances
Flag impossible transitions
Sort transactions by timestamp and check each consecutive pair. If the required speed exceeds 500 km/h (faster than any commercial flight accounting for boarding), it is a geo-jump:Confirm annotation quality
geo_jump_detected to measure label quality. Any disagreement points to a generated sequence where the timestamps and cities were inconsistent, which the conformance revision step should have already minimized.
What’s Next?
With physics-validated, annotated training rows:- Train a classifier: use the DataFramer annotations as the target and the 17 transaction features as inputs; the controlled 40% fraud rate gives gradient signal that real production data at 0.1% cannot
- Scale up: increase
number_of_samplesto 1,000 or 10,000; DataFramer maintains all distributional properties and temporal consistency at any scale - Target specific corridors: update the spec’s city distributions to concentrate generation on routes your model currently under-represents
- Add new fraud concepts: introduce additional novel properties via
generation_objectives— velocity anomalies, channel-switch patterns, device trust reversals — the same way geo-jump was introduced here - Augment with real data: mix generated samples with any labeled production transactions you do have; the output schema is identical to your seed, so they combine without transformation
Related Tutorials
Support Chatbot Broader Evals
Generate targeted edge-case eval sets with golden labels for LLM evaluation
Insurance Model Drift Detection
Generate data that simulates distribution shift to stress-test model stability

