Open in Google Colab

Run this exact tutorial interactively in Google Colab

DataFramer SDK — PII/PHI Anonymization

In this notebook we will demonstrate how the DataFramer Python SDK (PIP Package: pydataframer) can be used to detect and mask Personally Identifiable Information (PII) and Protected Health Information (PHI) in your datasets. We will walk through the complete anonymization workflow:

Upload a seed dataset containing sensitive data
Create an anonymization run to detect and mask PII/PHI entities
Inspect the masked output sample by sample
Download all anonymized files as a ZIP archive

The anonymization pipeline uses the AIMon-PII-M1 model combined with heuristic rules (aimon_pii_m1+heuristics) for high-accuracy detection across names, emails, phone numbers, SSNs, dates of birth, and more.

Step 1: Install and Setup SDK

Install the DataFramer SDK and additional utilities.

%%capture
%pip install --upgrade pydataframer tenacity pandas requests

A Dataframer API key is required. Retrieve yours from Account → Keys → Copy API Key on the web application and add it as a Colab secret named DATAFRAMER_API_KEY.

import os
from google.colab import userdata

os.environ['DATAFRAMER_API_KEY'] = userdata.get('DATAFRAMER_API_KEY')

import io
import os
import zipfile
from datetime import datetime, timezone
from pathlib import Path

import dataframer
import pandas as pd
import requests
from dataframer import Dataframer
from tenacity import retry, retry_if_result, stop_never, wait_fixed

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")

client = Dataframer(
    api_key=os.environ["DATAFRAMER_API_KEY"],
)

print("✅ Dataframer client initialized successfully")
print(f"   SDK version: {dataframer.__version__}")
print(f"   API key:     {client.api_key[:4]}...")
print(f"   run_id:      {run_id}")

✅ Dataframer client initialized successfully
   SDK version: 0.7.0
   API key:     sk-a...
   run_id:      20260320_143022

Step 2: Create Sample Dataset

We build a small synthetic CSV in-memory — no external files required. Each row is a fictitious patient support record containing PII/PHI fields that the anonymization run will detect and mask.

CSV_DATA = """\
patient_id,first_name,last_name,email,phone,dob,ssn,notes
P001,John,Smith,[email protected],555-867-5309,1985-03-14,123-45-6789,Patient John Smith reports mild chest pain. Born 1985-03-14. SSN 123-45-6789.
P002,Maria,Garcia,[email protected],555-234-5678,1972-11-28,987-65-4321,Follow-up for Maria Garcia. Contact at [email protected] or 555-234-5678.
P003,David,Lee,[email protected],555-321-0987,1990-07-04,456-78-9012,Patient David Lee (DOB 1990-07-04). SSN on file: 456-78-9012.
P004,Sarah,Johnson,[email protected],555-456-7890,1968-12-25,789-01-2345,Emergency contact for Sarah Johnson: 555-456-7890. Email [email protected]
P005,Robert,Chen,[email protected],555-654-3210,1955-09-18,321-54-9876,Mr. Robert Chen (DOB 1955-09-18) was seen on 2024-01-15. SSN 321-54-9876.
"""

pd.set_option("display.max_colwidth", 80)
df_preview = pd.read_csv(io.StringIO(CSV_DATA))
print(f"Sample dataset: {len(df_preview)} rows")
df_preview

Sample dataset: 5 rows

Step 3: Upload the Dataset

Wrap the synthetic CSV in a ZIP buffer and upload it as a seed dataset. If a dataset with the same name already exists it is reused (idempotent).

dataset_name = f"anonymize_seed_{run_id}"


def _find_existing_dataset(name):
    all_datasets = client.dataframer.seed_datasets.list()
    return next((d for d in all_datasets if d.name == name), None)


zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zf:
    zf.writestr("patient_records.csv", CSV_DATA)
zip_buffer.seek(0)

try:
    dataset = client.dataframer.seed_datasets.create_from_zip(
        name=dataset_name,
        description="Synthetic patient support records for PII anonymization demo",
        zip_file=zip_buffer,
    )
except Exception as e:
    if "already exists" in str(e):
        dataset = _find_existing_dataset(dataset_name)
        print(f"  ℹ️  Dataset '{dataset_name}' already exists — reusing it")
    else:
        raise

dataset_id = dataset.id

print("✅ Dataset ready")
print(f"   ID:         {dataset_id}")
print(f"   Name:       {dataset.name}")
print(f"   File count: {dataset.file_count}")

✅ Dataset ready
   ID:         a1b2c3d4-e5f6-7890-abcd-ef1234567890
   Name:       anonymize_seed_20260320_143022
   File count: 1

Step 4: Create an Anonymization Run

Create an anonymization run that will scan every row in the dataset and replace detected entities with masked tokens. Detection method: aimon_pii_m1+heuristics — the recommended setting. Combines the AIMon-PII-M1 neural model with regex-based heuristic rules for high precision and recall. PII types targeted (see the full entity catalogue):

Category	Entity types
Personal	`first_name`, `last_name`, `date_of_birth`
Contact	`email`, `phone_number`, `street_address`
Financial	`ssn`

PII_TYPES = [
    "first_name",
    "last_name",
    "email",
    "phone_number",
    "street_address",
    "date_of_birth",
    "ssn",
]

run = client.dataframer.anonymization_runs.create(
    dataset_id=dataset_id,
    pii_types=PII_TYPES,
    detection_method="aimon_pii_m1+heuristics",
)

run_id = run.id

print("✅ Anonymization run created")
print(f"   ID: {run_id}")

✅ Anonymization run created
   ID: f1e2d3c4-b5a6-7890-1234-abcdef567890

Step 5: Poll Until Run Completes

The anonymization run executes asynchronously. We poll every 10 seconds until it reaches SUCCEEDED or FAILED.

def run_not_finished(result):
    return result.status not in ("SUCCEEDED", "FAILED")


@retry(wait=wait_fixed(10), retry=retry_if_result(run_not_finished), stop=stop_never)
def poll_run_status(client, run_id):
    result = client.dataframer.anonymization_runs.retrieve(run_id)
    print(
        f"[{datetime.now(timezone.utc).isoformat(timespec='seconds')}] "
        f"Run {run_id[:8]}... status: {result.status}",
        flush=True,
    )
    return result


print("Polling for run completion (this may take several minutes)...")
run_result = poll_run_status(client, run_id)

if run_result.status == "FAILED":
    error = (run_result.results or {}).get("error", "unknown error")
    raise RuntimeError(f"Anonymization run failed: {error}")

results = run_result.results or {}
entity_summary = (results.get("entity_summary") or {}).get("overall", {})

print(f"\n✅ Anonymization run completed successfully!")
print(f"   Files processed:       {results.get('samples_processed', 0)}")
print(f"   Total entities redacted: {sum(entity_summary.values())}")
print(f"   Entity breakdown:      {entity_summary}")

Polling for run completion (this may take several minutes)...
[2026-03-20T14:30:35+00:00] Run f1e2d3c4... status: PENDING
[2026-03-20T14:30:45+00:00] Run f1e2d3c4... status: PENDING
[2026-03-20T14:30:55+00:00] Run f1e2d3c4... status: SUCCEEDED

✅ Anonymization run completed successfully!
   Files processed:       1
   Total entities redacted: 35
   Entity breakdown:      {'first_name': 5, 'last_name': 5, 'email': 5, 'phone_number': 5, 'date_of_birth': 10, 'ssn': 5}

Step 6: List All Anonymization Runs

Retrieve all anonymization runs for your company account (newest first). Useful for auditing past anonymization runs.

print("=" * 80)
print("📋 All Anonymization Runs")
print("=" * 80)

all_runs = client.dataframer.anonymization_runs.list()
print(f"Found {len(all_runs)} anonymization run(s)\n")

for i, r in enumerate(all_runs[:5], 1):
    print(f"  Run {i}:")
    print(f"    ID:      {r.id}")
    print(f"    Status:  {r.status}")
    print(f"    Created: {r.created_at}")
    print()

if len(all_runs) > 5:
    print(f"  ... and {len(all_runs) - 5} more")

================================================================================
📋 All Anonymization Runs
================================================================================
Found 1 anonymization run(s)

  Run 1:
    ID:      f1e2d3c4-b5a6-7890-1234-abcdef567890
    Status:  SUCCEEDED
    Created: 2026-03-20 14:30:22.123456+00:00

Step 7: Retrieve Full Run Details

Fetch the complete run record including dataset metadata, configuration parameters, and timing information.

print("=" * 80)
print("📄 Run Details")
print("=" * 80)

run_details = client.dataframer.anonymization_runs.retrieve(run_id)

print(f"Run ID:           {run_details.id}")
print(f"Status:           {run_details.status}")
print(f"Dataset:          {run_details.dataset_name} ({run_details.dataset_id})")
print(f"Detection method: {run_details.detection_method}")
print(f"PII types:        {run_details.pii_types}")
print(f"Duration:         {run_details.duration_seconds}s")
print(f"Completed:        {run_details.completed_at}")

================================================================================
📄 Run Details
================================================================================
Run ID:           f1e2d3c4-b5a6-7890-1234-abcdef567890
Status:           SUCCEEDED
Dataset:          anonymize_seed_20260320_143022 (a1b2c3d4-e5f6-7890-abcd-ef1234567890)
Detection method: aimon_pii_m1+heuristics
PII types:        ['first_name', 'last_name', 'email', 'phone_number', 'street_address', 'date_of_birth', 'ssn']
Duration:         18s
Completed:        2026-03-20 14:30:42.000000+00:00

Step 8: Inspect Masked Sample Content

Download each anonymized file and display a preview. Entity counts come from the run’s results (retrieved in the previous step).

print("=" * 80)
print("🔍 Masked Sample Content")
print("=" * 80)

anonymized_files = run_result.anonymized_files or []

for file in anonymized_files:
    download_info = client.dataframer.anonymization_runs.files.download(run_id, file_id=file.id)

    # Fetch the actual file content via the presigned URL
    file_response = requests.get(download_info.download_url)
    file_response.raise_for_status()
    content = file_response.text

    print(f"\nFile {file.id} — {download_info.file_name}")
    print(f"  Content type: {download_info.content_type}")
    print(f"  Size:         {file.size_in_bytes} bytes")
    print(f"  Masked preview:")
    print(f"  {content[:400]!r}")

# Entity summary from run results
entity_summary = (results.get("entity_summary") or {}).get("overall", {})
entity_rows = [{"entity_type": k, "count": v} for k, v in entity_summary.items()]

================================================================================
🔍 Masked Sample Content
================================================================================

File abc123 — patient_records.csv
  Content type: text/csv
  Size:         1234 bytes
  Masked preview:
  'patient_id,first_name,last_name,email,phone,dob,ssn,notes\nP001,<FIRST NAME>,<LAST NAME>,<EMAIL>,<PHONE>,<DOB>,<SSN>,Patient <FIRST NAME> <LAST NAME> reports mild chest pain. Born <DOB>. SSN <SSN>.\nP002,<FIRST NAME>,<LAST NAME>,<EMAIL>,<PHONE>,<DOB>,<SSN>,Follow-up for <FIRST NAME> <LAST NAME>. Contact at <EMAIL> or <PHONE>.'

Step 9: Download All Anonymized Files as ZIP

Retrieve a presigned URL and download all anonymized files as a single ZIP archive. The URL is valid for 1 hour.

print("=" * 80)
print("📥 Downloading ZIP archive")
print("=" * 80)

download = client.dataframer.anonymization_runs.download(run_id)
print("Presigned URL obtained")
print(f"  URL:    {download.download_url}")
print(f"  Status: {download.status}")

zip_response = requests.get(download.download_url)
zip_response.raise_for_status()

output_file = Path(f"anonymized_{run_id[:8]}.zip")
output_file.write_bytes(zip_response.content)
print(f"\n✅ ZIP saved: {output_file.absolute()} ({output_file.stat().st_size:,} bytes)")

================================================================================
📥 Downloading ZIP archive
================================================================================
Presigned URL obtained
  URL:    https://s3.amazonaws.com/...
  Status: ready

✅ ZIP saved: /content/anonymized_f1e2d3c4.zip (2,048 bytes)

Results

Entity detection summary across all files.

if entity_rows:
    results_df = (
        pd.DataFrame(entity_rows)
        .sort_values("count", ascending=False)
        .reset_index(drop=True)
        .rename(columns={"entity_type": "Entity Type", "count": "Total Detected"})
    )
    print("Entity detection totals across all files:\n")
    print(results_df.to_string(index=False))
else:
    print("No entities detected — check your pii_types configuration.")

Entity detection totals across all files:

   Entity Type  Total Detected
  date_of_birth             10
    first_name               5
     last_name               5
         email               5
  phone_number               5
           ssn               5

What’s Next?

Adjust pii_types: add or remove entity types from the full catalogue to target exactly the entities relevant to your use case
Try a different detection_method: switch to heuristics for faster runs, or all for maximum coverage
Use your own data: replace the synthetic CSV with a real dataset via create_with_files or create_from_zip
Scale up: the same workflow supports MULTI_FILE and MULTI_FOLDER datasets — pass multiple file handles or use dataset_type="MULTI_FOLDER" with folder_names
Integrate downstream: the anonymized ZIP can be stored in S3, fed into further processing pipelines, or used as safe input to your LLM workflows

Folder Generation

Generate multi-file synthetic datasets from seed data

API Reference

Full endpoint documentation

Appendix: Available PII/PHI Types \

Pass any combination of the keys below as the pii_types argument when creating an anonymization run. Each key maps to a default mask token shown in the Masked as column.

Personal

Key	Masked as
`first_name`	`<FIRST NAME>`
`last_name`	`<LAST NAME>`
`date_of_birth`	`<DOB>`
`date`	`<DATE>`
`age`	`<AGE>`
`gender`	`<GENDER>`
`nationality`	`<NATIONALITY>`
`race_ethnicity`	`<RACE ETHNICITY>`
`marital_status`	`<MARITAL STATUS>`

Contact

Key	Masked as
`email`	`<EMAIL>`
`phone_number`	`<PHONE>`
`street_address`	`<ADDRESS>`
`postcode`	`<ZIP>`
`city`	`<CITY>`
`state`	`<STATE>`
`country`	`<COUNTRY>`

Financial

Key	Masked as
`ssn`	`<SSN>`
`credit_debit_card`	`<CREDIT CARD>`
`bank_routing_number`	`<BANK ROUTING>`
`routing_number`	`<ROUTING>`
`tax_id`	`<TAX ID>`
`iban`	`<IBAN>`

Digital

Key	Masked as
`ipv4`	`<IP ADDRESS>`
`url`	`<URL>`
`user_name`	`<USERNAME>`
`password`	`<PASSWORD>`
`mac_address`	`<MAC ADDRESS>`
`device_identifier`	`<DEVICE ID>`

Identity Documents

Key	Masked as
`passport_number`	`<PASSPORT>`
`certificate_license_number`	`<LICENSE>`
`national_id`	`<NATIONAL ID>`
`voter_id`	`<VOTER ID>`

Medical / PHI

Key	Masked as
`medical_record_number`	`<MRN>`
`diagnosis`	`<DIAGNOSIS>`
`medication`	`<MEDICATION>`
`health_plan_beneficiary_number`	`<HEALTH PLAN>`
`patient_id`	`<PATIENT ID>`
`lab_result`	`<LAB RESULT>`

Professional

Key	Masked as
`company_name`	`<COMPANY>`
`occupation`	`<OCCUPATION>`
`employee_id`	`<EMPLOYEE ID>`
`salary`	`<SALARY>`

Main Docs

Tutorials

Integrations

Release Notes

PII/PHI Anonymization

Open in Google Colab

DataFramer SDK — PII/PHI Anonymization

Step 1: Install and Setup SDK

Step 2: Create Sample Dataset

Step 3: Upload the Dataset

Step 4: Create an Anonymization Run

Step 5: Poll Until Run Completes

Step 6: List All Anonymization Runs

Step 7: Retrieve Full Run Details

Step 8: Inspect Masked Sample Content

Step 9: Download All Anonymized Files as ZIP

Results

What’s Next?

Folder Generation

API Reference

Appendix: Available PII/PHI Types \

Personal

Contact

Financial

Digital

Identity Documents

Medical / PHI

Professional

Main Docs

Tutorials

Integrations

Release Notes

Documentation Index

Open in Google Colab

​DataFramer SDK — PII/PHI Anonymization

​Step 1: Install and Setup SDK

​Step 2: Create Sample Dataset

​Step 3: Upload the Dataset

​Step 4: Create an Anonymization Run

​Step 5: Poll Until Run Completes

​Step 6: List All Anonymization Runs

​Step 7: Retrieve Full Run Details

​Step 8: Inspect Masked Sample Content

​Step 9: Download All Anonymized Files as ZIP

​Results

​What’s Next?

Folder Generation

API Reference

​Appendix: Available PII/PHI Types \

​Personal

​Contact

​Financial

​Digital

​Identity Documents

​Medical / PHI

​Professional

DataFramer SDK — PII/PHI Anonymization

Step 1: Install and Setup SDK

Step 2: Create Sample Dataset

Step 3: Upload the Dataset

Step 4: Create an Anonymization Run

Step 5: Poll Until Run Completes

Step 6: List All Anonymization Runs

Step 7: Retrieve Full Run Details

Step 8: Inspect Masked Sample Content

Step 9: Download All Anonymized Files as ZIP

Results

What’s Next?

Appendix: Available PII/PHI Types \

Personal

Contact

Financial

Digital

Identity Documents

Medical / PHI

Professional