Skip to main content

Open in Google Colab

Run this exact tutorial interactively in Google Colab

Fraud Detection: Labeled Training Data for Impossible Geographic Jumps

Training a fraud detection model requires labeled examples of fraud. The problem has two parts: fraud is rare in production data, and the specific patterns you want to catch, like impossible geographic jumps, may never appear in your seed data at all. An impossible geo-jump is a credit card transaction sequence where the same card appears in physically separated locations within a time window too short for legitimate travel: New York at 9:00 AM, Los Angeles at 9:58 AM. These patterns are strong fraud signals, but if your seed data doesn’t contain them, you have no labels to learn from. The conventional solution is to hand-label examples, which is slow, or wait for enough production fraud to accumulate, which can take years. DataFramer offers a third path: teach DataFramer a concept it didn’t see in your seed, let it shape the generated data around that concept, and collect the golden labels from its annotations, without changing your expected output schema. This tutorial demonstrates exactly that:
  1. Seed from unlabeled transactions: upload 10 real transaction rows. No geo_jump_flag column, no fraud labels of any kind.
  2. Introduce a novel property: instruct DataFramer to understand and encode geo-jump as a generation-time concept via generation_objectives.
  3. Control the fraud rate: override the distribution so generated data has the fraud density you need for training, not the near-zero rate of real logs.
  4. Collect golden labels from annotations: DataFramer annotates which generated transactions it intended as geo-jumps — giving you ground truth without touching the output schema.
  5. Validate with physics: confirm annotation quality by checking impossible transitions against haversine distance, closing the loop between intent and reality.

Prerequisites

  • Python 3.9+
  • A DATAFRAMER_API_KEY
pip install pandas pydataframer tenacity pyyaml requests

Part 1: The Seed Dataset

The seed contains 10 real credit card transaction rows. Each row records a payment event with full context: timestamp, city, amount, currency, payment channel, authentication method, device trust level, and derived velocity features. There is no geo_jump_flag column. No fraud labels of any kind.
seed_df = pd.read_csv("files/fraud_dataset_geo_jump.csv")
print(seed_df.columns.tolist())
# ['transaction_id', 'customer_name', 'event_timestamp', 'amount', 'currency',
#  'merchant_category_code', 'city', 'state', 'country', 'payment_channel',
#  'auth_method', 'device_trust_level', 'txn_velocity_24h',
#  'days_since_last_txn', 'account_age_months', 'txns_today', 'declines_30d']
Ten rows is enough to show DataFramer the schema and the statistical texture of the data: value ranges, temporal patterns, how cities and channels and auth methods are distributed. What it cannot show DataFramer is the fraud concept — because that concept was never encoded. That is introduced in the next step.

Part 2: Introduce Geo-Jump as a Novel Property

Upload the seed

Pack the CSV into a ZIP and upload it. DataFramer analyzes the column structure, value distributions, and temporal patterns in the sequence.
dataset = df_client.dataframer.seed_datasets.create_from_zip(
    name="fraud_geo_jump",
    description="Transaction sequences for geo-jump fraud simulation",
    zip_file=zip_buffer,
)

Generate a Specification and introduce the fraud concept

The Specification is DataFramer’s learned model of your data. It captures distributions across every property (cities, payment channels, auth methods, device trust levels, velocity features) along with consistency rules that govern the sequence. The generation_objectives parameter is where you inject the concept that was never in your data. You are not adding a new column to the output — you are teaching DataFramer what a geo-jump means, so it can use that understanding to shape what it generates and annotate accordingly:
spec = df_client.dataframer.specs.create(
    dataset_id=dataset_id,
    spec_generation_model_name="anthropic/claude-sonnet-4-6",
    generation_objectives=(
        "Include 'geo_jump_flag' as a data property with values 'normal' and 'geo_jump'. "
        "A geo_jump occurs when consecutive transactions for the same customer appear in "
        "cities that are physically impossible to reach given the elapsed time. "
        "Split 'event_timestamp' into discrete fields: year, month, day, hour, minute, second. "
        "Derived fields (txn_velocity_24h, days_since_last_txn, txns_today) must be "
        "calculated from sequence context, not generated as static values."
    ),
    extrapolate_values=True,
    generate_distributions=True,
)
extrapolate_values=True tells DataFramer to infer realistic values beyond the 10-row seed, so generated city sequences cover the full US geography rather than just the handful of cities that appeared in your sample.

Override the fraud distribution

By default, DataFramer would generate data that mirrors the fraud rate of your seed — which, since the seed has no fraud labels at all, would produce mostly or entirely normal transactions. That is useless for training a classifier. Retrieve the YAML spec and override the geo_jump_flag distribution. The base distribution sets the overall fraud rate; conditional_distributions then pins specific states to a much higher fraud rate, targeting the corridors you care about most:
GEO_PROP = "geo_jump_flag"

GEO_DISTRIBUTIONS = {
    "normal": 60,
    "geo_jump": 40,
}

GEO_STATE_CONDITIONAL_DISTRIBUTIONS = {
    "state": {
        "new_york":   {"geo_jump": 90, "normal": 10},
        "california": {"geo_jump": 90, "normal": 10},
        "florida":    {"geo_jump": 90, "normal": 10},
    }
}

updated_spec_data = copy.deepcopy(spec_data)

for prop in updated_spec_data.get("data_property_variations", []):
    if prop["property_name"] == GEO_PROP:
        prop["base_distributions"] = {
            v: GEO_DISTRIBUTIONS.get(v, 0)
            for v in prop["property_values"]
        }
        prop["conditional_distributions"] = GEO_STATE_CONDITIONAL_DISTRIBUTIONS

updated_spec = df_client.dataframer.specs.update(
    spec_id=spec_id,
    content_yaml=updated_yaml,
)
The base distribution gives 40% geo-jump across all generated transactions. The conditional distributions override that for New York, California, and Florida: transactions in those states will be geo-jumps 90% of the time. DataFramer shapes the entire temporal and geographic sequence around these targets — this is not random oversampling of existing rows, it is targeted generation of the exact scenarios you care about.

Generate 100 labeled samples (or however many you like)

DataFramer supports models from Anthropic, OpenAI, Google, and open-source providers, so you can mix and match to balance cost against quality. In this example, a capable reasoning model drives the spec analysis and outline, while a faster, cheaper model handles the bulk row generation:
run = df_client.dataframer.runs.create(
    spec_id=updated_spec.id,
    number_of_samples=100,
    generation_model="anthropic/claude-haiku-4-5",  # fast, low-cost bulk generation
    revision_types=["consistency", "conformance"],
    max_revision_cycles=1,
    filtering_types=["conformance", "structural"],
    outline_model="anthropic/claude-sonnet-4-6",    # higher-quality reasoning for structure
)
The output schema is identical to your seed: the same 17 columns, no geo_jump_flag in sight. But DataFramer produces annotations alongside the generated rows, which is a record of which transactions it generated as geo-jumps. Those annotations are your golden labels, produced as a natural byproduct of generation.

Part 3: Validate Label Quality with Physics

DataFramer encoded the geo-jump concept and shaped generation around it. Now verify it did so correctly, using a ground truth that requires no human judgment: the speed of light.

Compute haversine distances

CITY_COORDS = {
    "new_york":      (40.7128, -74.0060),
    "los_angeles":   (34.0522, -118.2437),
    "chicago":       (41.8781, -87.6298),
    "san_francisco": (37.7749, -122.4194),
    "miami":         (25.7617, -80.1918),
    # ... full city list
}

def haversine_km(coord1: tuple, coord2: tuple) -> float:
    lat1, lon1 = math.radians(coord1[0]), math.radians(coord1[1])
    lat2, lon2 = math.radians(coord2[0]), math.radians(coord2[1])
    dlat, dlon = lat2 - lat1, lon2 - lon1
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    return 6371 * 2 * math.asin(math.sqrt(a))

Flag impossible transitions

Sort transactions by timestamp and check each consecutive pair. If the required speed exceeds 500 km/h (faster than any commercial flight accounting for boarding), it is a geo-jump:
GEO_JUMP_SPEED_THRESHOLD_KMH = 500

def detect_geo_jumps(df: pd.DataFrame) -> pd.DataFrame:
    df = df.sort_values("event_timestamp").reset_index(drop=True)
    df["geo_jump_detected"] = False

    for i in range(1, len(df)):
        city_a = df.loc[i - 1, "city"]
        city_b = df.loc[i, "city"]
        time_diff_h = (
            df.loc[i, "event_timestamp"] - df.loc[i - 1, "event_timestamp"]
        ).total_seconds() / 3600

        if city_a != city_b and time_diff_h > 0:
            distance_km = haversine_km(CITY_COORDS[city_a], CITY_COORDS[city_b])
            speed_kmh = distance_km / time_diff_h
            if speed_kmh > GEO_JUMP_SPEED_THRESHOLD_KMH:
                df.loc[i, "geo_jump_detected"] = True

    return df

Confirm annotation quality

print(generated_df["geo_jump_detected"].value_counts())
# True     85
# False    15
The physics check confirms that transactions DataFramer annotated as geo-jumps correspond to genuinely impossible city transitions. DataFramer understood the concept, generated data consistent with it, and annotated it correctly — all from a seed that contained none of this information. Cross-check DataFramer’s annotations against geo_jump_detected to measure label quality. Any disagreement points to a generated sequence where the timestamps and cities were inconsistent, which the conformance revision step should have already minimized.

What’s Next?

With physics-validated, annotated training rows:
  • Train a classifier: use the DataFramer annotations as the target and the 17 transaction features as inputs; the controlled 40% fraud rate gives gradient signal that real production data at 0.1% cannot
  • Scale up: increase number_of_samples to 1,000 or 10,000; DataFramer maintains all distributional properties and temporal consistency at any scale
  • Target specific corridors: update the spec’s city distributions to concentrate generation on routes your model currently under-represents
  • Add new fraud concepts: introduce additional novel properties via generation_objectives — velocity anomalies, channel-switch patterns, device trust reversals — the same way geo-jump was introduced here
  • Augment with real data: mix generated samples with any labeled production transactions you do have; the output schema is identical to your seed, so they combine without transformation

Support Chatbot Broader Evals

Generate targeted edge-case eval sets with golden labels for LLM evaluation

Insurance Model Drift Detection

Generate data that simulates distribution shift to stress-test model stability