Skip to main content

Open in Google Colab

Run this exact tutorial interactively in Google Colab

Simulating Population Shift to Expose Model Drift in Life Insurance Underwriting

If you trained an underwriting model on a 2015 applicant population, what happens when the 2026 population looks meaningfully different? This tutorial shows how to use DataFramer to create that shifted population on purpose, then run a frozen underwriting model against both groups to see whether the score distribution has drifted.
Key finding: when scored against the shifted 2026 population — older applicants, higher BMI, more medical conditions — the frozen 2015 model classified 37.8 percentage points fewer applicants as high-risk (classes 7–8), and the mean raw score dropped from 5.6 to 4.4. The model looked at a sicker population and concluded it was safer. This is the shape of silent model drift: nothing looks broken, scores are produced, policies are issued, and the financial exposure only becomes visible in claims data months or years later.
You will walk through the same stages as the notebook:
  1. Train a simple life-insurance risk model on the Prudential underwriting dataset.
  2. Train a frozen 2015 underwriting model and learn the class thresholds that define its 2015 calibration.
  3. Profile the 2015 applicant population as the model’s baseline environment.
  4. Use DataFramer to build a Specification from the 2015 population and shift it to simulate a 2026 applicant pool.
  5. Generate the shifted 2026 synthetic population.
  6. Score both populations with the unchanged 2015 model and compare the risk distributions.
  7. Frame the business question the model cannot answer on its own.

Prerequisites

  • Python 3.9+
  • A DATAFRAMER_API_KEY
  • train.csv from the Prudential underwriting dataset placed in a local files/ directory
pip install xgboost scikit-learn matplotlib pandas pydataframer tenacity pyyaml requests scipy

Part 1: Train the Frozen 2015 Underwriting Model

The notebook starts from Prudential underwriting data with 128 columns. To stay within DataFramer’s seed limits while preserving enough underwriting signal, it selects 29 original fields and adds one derived feature:
  • Biometrics such as Ins_Age, Ht, Wt, and BMI (all normalized to [0, 1])
  • Product attributes from Product_Info_*
  • Employment and insured profile variables
  • Insurance and family history variables
  • Medical_History_1
  • Med_Keywords_Count, which aggregates the 48 binary Medical_Keyword_* columns into a single count
  • Response, the underwriter-assigned risk class from 1 to 8
raw_df = pd.read_csv(FILES_DIR / "train.csv")

keyword_cols = [c for c in raw_df.columns if c.startswith("Medical_Keyword_")]
raw_df["Med_Keywords_Count"] = raw_df[keyword_cols].sum(axis=1)

df = raw_df[FEATURE_COLS + [TARGET_COL]].copy()
The model is an XGBRegressor. Instead of predicting the 1–8 risk classes directly, it predicts a continuous score and then learns seven thresholds that map that score back onto Prudential’s eight underwriting classes. Those thresholds are optimized by maximizing quadratic weighted kappa (QWK) via Nelder-Mead.
model = XGBRegressor(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    verbosity=0,
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)
Those thresholds are the key calibration artifact. They are learned on the 2015 training population and then held fixed for the rest of the tutorial.

Part 2: Establish the 2015 Calibration

The notebook trains an XGBRegressor on the selected features, then learns seven class thresholds that map the model’s continuous score back to Prudential’s 8-level underwriting scale. Those thresholds are calibrated on the 2015 training population and then held fixed for the rest of the analysis. That frozen threshold set is the 2015 calibration.

Part 3: Profile the 2015 Applicant Population

Before simulating drift, score the original applicant population with the frozen model and record the baseline distribution of predicted underwriting classes.
all_preds_2015 = model.predict(engineer_features(df[FEATURE_COLS]))
classes_2015 = predict_classes(all_preds_2015, THRESHOLDS)
This baseline records the mean raw score (5.6), the share of applicants in high-risk classes 7 and 8, and the full per-class distribution. Everything that follows is measured against it.

Part 4: Build and Shift a DataFramer Specification

Build the spec from the 2015 population

Upload the full 2015 applicant table as a seed dataset. DataFramer analyzes the schema, value ranges, and inter-feature relationships, then produces a Specification that captures the population in a reusable form.
seed_csv = io.BytesIO(seed_df.to_csv(index=False).encode("utf-8"))
seed_csv.name = "insurance_seed.csv"

dataset = df_client.dataframer.seed_datasets.create_with_files(
    name=f"insurance_seed_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    description="Life insurance applicant seed dataset (2015 population)",
    dataset_type="SINGLE_FILE",
    files=[seed_csv],
)

spec = df_client.dataframer.specs.create(
    dataset_id=dataset_id,
    name=f"insurance_2015_spec_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
    spec_generation_model_name="anthropic/claude-sonnet-4-6",
    extrapolate_values=True,
    generate_distributions=True,
)
Once the spec is ready, inspect the analyzed data properties and their base distributions. This is the point where the original underwriting population becomes editable.
spec = df_client.dataframer.specs.retrieve(spec_id=spec_id)
config = yaml.safe_load(spec.content_yaml)
spec_data = config.get("spec", config)

Shift the spec to simulate a 2026 population

The tutorial edits the spec to represent three directional changes in the applicant pool:
  • Ins_Age: increase the mean by 7 years
  • BMI: increase the mean by 9%
  • Med_Keywords_Count: increase the mean by 50%
Instead of hand-editing percentages arbitrarily, the notebook uses exponential tilting. That method finds the minimum-KL-divergence adjustment to the existing distribution that hits the new target mean exactly, preserving distribution shape as much as possible.
SHIFT_TARGETS = {
    "ins_age":            {"delta": 7 / INS_AGE_NORM_DENOMINATOR, "relative": False},
    "bmi":                {"delta": 0.09,  "relative": True},
    "med_keywords_count": {"delta": 0.50,  "relative": True},
}
For each shifted property, the notebook recalculates the base_distributions inside the spec and writes the updated YAML back to DataFramer:
new_dist = shift_distribution(values, current_dist, target_mean)
prop["base_distributions"] = new_dist

updated_yaml = yaml.dump({"spec": updated_spec_data}, allow_unicode=True, sort_keys=False)
updated_spec = df_client.dataframer.specs.update(spec_id=spec_id, content_yaml=updated_yaml)
This is the core drift-testing move: you are not waiting for real 2026 data to accumulate. You are editing the 2015 population description directly and expressing a plausible future applicant pool explicitly and reproducibly.

Part 5: Generate the Shifted Population

With the updated spec in place, generate a synthetic population of 500 applicants:
run = df_client.dataframer.runs.create(
    spec_id=updated_spec.id,
    number_of_samples=500,
    generation_model="anthropic/claude-haiku-4-5",
    outline_model="anthropic/claude-haiku-4-5",
    filtering_types=["conformance", "structural"],
    skip_outline=True
)
After the run completes, download and load the generated CSV files:
dl = poll_download(run_id)
download_url = dl if isinstance(dl, str) else dl.download_url

zip_data = requests.get(download_url).content
with zipfile.ZipFile(io.BytesIO(zip_data)) as zf:
    zf.extractall(output_dir)

gen_dfs = [pd.read_csv(f) for f in sorted(output_dir.glob("*.csv"))]
gen_df = pd.concat(gen_dfs, ignore_index=True)
At this point you can confirm that the synthetic population actually moved in the intended direction by comparing the means and histograms for age, BMI, and medical keyword count across the original and generated data.

Part 6: Score Both Populations with the Same 2015 Calibration

Now apply the unchanged model and unchanged thresholds to both populations:
preds_2015 = model.predict(engineer_features(df[FEATURE_COLS]))
classes_2015 = predict_classes(preds_2015, THRESHOLDS)

gen_features = gen_df.reindex(columns=FEATURE_COLS).copy()
preds_2026 = model.predict(engineer_features(gen_features))
classes_2026 = predict_classes(preds_2026, THRESHOLDS)
The tutorial then compares:
  • Risk-class percentages for classes 1 through 8
  • The share of applicants landing in high-risk classes 7 and 8
  • The full raw score distributions under the fixed 2015 thresholds
for c in range(1, 9):
    p15 = (classes_2015 == c).mean() * 100
    p26 = (classes_2026 == c).mean() * 100
    delta = p26 - p15
    print(f"  {c:<6} {p15:>7.1f}% {p26:>7.1f}%  {delta:+.1f}pp")

The Business Question the Model Cannot Answer on Its Own

This is the key business takeaway from the notebook. The frozen 2015 model, with the same weights and the same class thresholds, classified 37.8 percentage points fewer applicants from the shifted 2026 population as high-risk (classes 7–8). The mean raw score dropped from 5.6 to 4.4. The model looked at a population that is older, heavier, and carries more medical conditions — and concluded it was safer. There are two plausible explanations:
  1. The 2015 weights no longer reflect reality. A feature that was rare and strongly correlated with mortality risk in 2015 may now be common and well-managed. The model still applies the 2015 coefficient to a signal that has lost its predictive content, systematically underpricing risk for exactly the applicants who pose the greatest exposure.
  2. The model is extrapolating outside its training regime. The 2015 training set never saw enough applicants with this specific combination of age, BMI, and keyword count. When pushed outside its training distribution, the model collapses toward the center — producing lower predicted risk scores precisely where the uncertainty is highest.
The model cannot tell you which explanation is true. That is the exact point of the exercise. DataFramer gives you a controlled way to construct the shifted population and expose the behavior change before you have enough real outcomes data to settle the question. It turns “something feels off” into a concrete validation scenario for underwriting, model-risk, and calibration teams.

What DataFramer Enables Here

  • Create a realistic shifted underwriting population without exposing real policyholder data
  • Control which applicant characteristics move and by how much
  • Re-run the same drift scenario against multiple model versions or calibration strategies
  • Reproduce the scenario reliably across model versions or underwriting rules
  • Do the analysis in hours instead of waiting months for naturally accumulated data with the right distribution
The next step is to attach real outcomes to a 2026 validation cohort, compare predicted classes with actual claim frequency by class, and determine whether the observed collapse in high-risk classification is explained by genuine risk landscape change, stale calibration, out-of-distribution extrapolation, or some combination of the three.

Financial Document QA Eval

Build custom eval sets and stress-test extraction quality on synthetic edge-case PDFs

Support Chatbot Broader Evals

Generate broader contextual evaluation sets with golden responses