Skip to main content

Open in Google Colab

Run this exact tutorial interactively in Google Colab

Dataframer SDK — PII/PHI Anonymization

In this notebook we will demonstrate how the Dataframer Python SDK (PIP Package: pydataframer) can be used to detect and mask Personally Identifiable Information (PII) and Protected Health Information (PHI) in your datasets. We will walk through the complete anonymization workflow:
  • Upload a seed dataset containing sensitive data
  • Create a transform job to detect and mask PII/PHI entities
  • Inspect the masked output sample by sample
  • Download all anonymized files as a ZIP archive
The anonymization pipeline uses the AIMon-PII-M1 model combined with heuristic rules (gliner+heuristics) for high-accuracy detection across names, emails, phone numbers, SSNs, dates of birth, and more.

Step 1: Install and Setup SDK

Install the Dataframer SDK and additional utilities.
%%capture
%pip install --upgrade pydataframer tenacity pandas requests
A Dataframer API key is required. Retrieve yours from Account → Keys → Copy API Key on the web application and add it as a Colab secret named DATAFRAMER_API_KEY.
import os
from google.colab import userdata

os.environ['DATAFRAMER_API_KEY'] = userdata.get('DATAFRAMER_API_KEY')
import io
import os
import zipfile
from datetime import datetime, timezone
from pathlib import Path

import dataframer
import pandas as pd
import requests
from dataframer import Dataframer
from tenacity import retry, retry_if_result, stop_never, wait_fixed

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")

client = Dataframer(
    api_key=os.environ["DATAFRAMER_API_KEY"],
)

print("✅ Dataframer client initialized successfully")
print(f"   SDK version: {dataframer.__version__}")
print(f"   API key:     {client.api_key[:4]}...")
print(f"   run_id:      {run_id}")
✅ Dataframer client initialized successfully
   SDK version: 0.7.0
   API key:     sk-a...
   run_id:      20260320_143022

Step 2: Create Sample Dataset

We build a small synthetic CSV in-memory — no external files required. Each row is a fictitious patient support record containing PII/PHI fields that the transform job will detect and mask.
CSV_DATA = """\
patient_id,first_name,last_name,email,phone,dob,ssn,notes
P001,John,Smith,[email protected],555-867-5309,1985-03-14,123-45-6789,Patient John Smith reports mild chest pain. Born 1985-03-14. SSN 123-45-6789.
P002,Maria,Garcia,[email protected],555-234-5678,1972-11-28,987-65-4321,Follow-up for Maria Garcia. Contact at [email protected] or 555-234-5678.
P003,David,Lee,[email protected],555-321-0987,1990-07-04,456-78-9012,Patient David Lee (DOB 1990-07-04). SSN on file: 456-78-9012.
P004,Sarah,Johnson,[email protected],555-456-7890,1968-12-25,789-01-2345,Emergency contact for Sarah Johnson: 555-456-7890. Email [email protected]
P005,Robert,Chen,[email protected],555-654-3210,1955-09-18,321-54-9876,Mr. Robert Chen (DOB 1955-09-18) was seen on 2024-01-15. SSN 321-54-9876.
"""

pd.set_option("display.max_colwidth", 80)
df_preview = pd.read_csv(io.StringIO(CSV_DATA))
print(f"Sample dataset: {len(df_preview)} rows")
df_preview
Sample dataset: 5 rows

Step 3: Upload the Dataset

Wrap the synthetic CSV in a ZIP buffer and upload it as a seed dataset. If a dataset with the same name already exists it is reused (idempotent).
dataset_name = f"anonymize_seed_{run_id}"


def _find_existing_dataset(name):
    all_datasets = client.dataframer.seed_datasets.list()
    return next((d for d in all_datasets if d.name == name), None)


zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zf:
    zf.writestr("patient_records.csv", CSV_DATA)
zip_buffer.seek(0)

try:
    dataset = client.dataframer.seed_datasets.create_from_zip(
        name=dataset_name,
        description="Synthetic patient support records for PII anonymization demo",
        zip_file=zip_buffer,
    )
except Exception as e:
    if "already exists" in str(e):
        dataset = _find_existing_dataset(dataset_name)
        print(f"  ℹ️  Dataset '{dataset_name}' already exists — reusing it")
    else:
        raise

dataset_id = dataset.id

print("✅ Dataset ready")
print(f"   ID:         {dataset_id}")
print(f"   Name:       {dataset.name}")
print(f"   File count: {dataset.file_count}")
✅ Dataset ready
   ID:         a1b2c3d4-e5f6-7890-abcd-ef1234567890
   Name:       anonymize_seed_20260320_143022
   File count: 1

Step 4: Create a Transform Job

Create a transform job that will scan every row in the dataset and replace detected entities with masked tokens. Detection method: gliner+heuristics — the recommended setting. Combines the AIMon-PII-M1 neural model with regex-based heuristic rules for high precision and recall. PII types targeted (see the full entity catalogue):
CategoryEntity types
Personalfirst_name, last_name, date_of_birth
Contactemail, phone_number, street_address
Financialssn
job_name = f"anonymize_job_{run_id}"

PII_TYPES = [
    "first_name",
    "last_name",
    "email",
    "phone_number",
    "street_address",
    "date_of_birth",
    "ssn",
]

job = client.dataframer.transform_jobs.create(
    dataset_id=dataset_id,
    name=job_name,
    pii_types=PII_TYPES,
    detection_method="gliner+heuristics",
    threshold=0.3,
    evaluation_model="anthropic/claude-sonnet-4-6",
)

job_id = job.id

print("✅ Transform job created")
print(f"   ID:               {job_id}")
print(f"   Name:             {job.name}")
print(f"   Status:           {job.status}")
print(f"   Detection method: {job.detection_method}")
print(f"   PII types:        {job.pii_types}")
✅ Transform job created
   ID:               f1e2d3c4-b5a6-7890-1234-abcdef567890
   Name:             anonymize_job_20260320_143022
   Status:           PENDING
   Detection method: gliner+heuristics
   PII types:        ['first_name', 'last_name', 'email', 'phone_number', 'street_address', 'date_of_birth', 'ssn']

Step 5: Poll Until Job Completes

The transform job runs asynchronously. We poll every 10 seconds until it reaches SUCCEEDED or FAILED.
def job_not_finished(result):
    return result.status not in ("SUCCEEDED", "FAILED")


@retry(wait=wait_fixed(10), retry=retry_if_result(job_not_finished), stop=stop_never)
def poll_job_status(client, job_id):
    result = client.dataframer.transform_jobs.retrieve(job_id)
    print(
        f"[{datetime.now(timezone.utc).isoformat(timespec='seconds')}] "
        f"Job {job_id[:8]}... status: {result.status}",
        flush=True,
    )
    return result


print("Polling for job completion (this may take several minutes)...")
job_result = poll_job_status(client, job_id)

if job_result.status == "FAILED":
    error = (job_result.metrics_json or {}).get("error", "unknown error")
    raise RuntimeError(f"Transform job failed: {error}")

metrics = job_result.metrics_json or {}
samples = metrics.get("transformed_samples", [])

print(f"\n✅ Transform job completed successfully!")
print(f"   Samples transformed: {len(samples)}")
if samples:
    total_entities = sum(
        sum(v for v in (s.get("entity_summary") or {}).values() if isinstance(v, int))
        for s in samples
    )
    print(f"   Total entities redacted: {total_entities}")
Polling for job completion (this may take several minutes)...
[2026-03-20T14:30:35+00:00] Job f1e2d3c4... status: PENDING
[2026-03-20T14:30:45+00:00] Job f1e2d3c4... status: PENDING
[2026-03-20T14:30:55+00:00] Job f1e2d3c4... status: SUCCEEDED

✅ Transform job completed successfully!
   Samples transformed: 1
   Total entities redacted: 35

Step 6: List All Transform Jobs

Retrieve all transform jobs for your company account (newest first). Useful for auditing past anonymization runs.
print("=" * 80)
print("📋 All Transform Jobs")
print("=" * 80)

all_jobs = client.dataframer.transform_jobs.list()
print(f"Found {len(all_jobs)} transform job(s)\n")

for i, j in enumerate(all_jobs[:5], 1):
    print(f"  Job {i}:")
    print(f"    Name:    {j.name}")
    print(f"    ID:      {j.id}")
    print(f"    Status:  {j.status}")
    print(f"    Created: {j.created_at}")
    print()

if len(all_jobs) > 5:
    print(f"  ... and {len(all_jobs) - 5} more")
================================================================================
📋 All Transform Jobs
================================================================================
Found 1 transform job(s)

  Job 1:
    Name:    anonymize_job_20260320_143022
    ID:      f1e2d3c4-b5a6-7890-1234-abcdef567890
    Status:  SUCCEEDED
    Created: 2026-03-20 14:30:22.123456+00:00

Step 7: Retrieve Full Job Details

Fetch the complete job record including dataset metadata, configuration parameters, and timing information.
print("=" * 80)
print("📄 Job Details")
print("=" * 80)

job_details = client.dataframer.transform_jobs.retrieve(job_id)

print(f"Job ID:           {job_details.id}")
print(f"Name:             {job_details.name}")
print(f"Status:           {job_details.status}")
print(f"Dataset:          {job_details.dataset_name} ({job_details.datasets_id})")
print(f"Detection method: {job_details.detection_method}")
print(f"PII types:        {job_details.pii_types}")
print(f"Threshold:        {job_details.threshold}")
print(f"Duration:         {job_details.duration_seconds}s")
print(f"Started:          {job_details.started_at}")
print(f"Completed:        {job_details.completed_at}")
================================================================================
📄 Job Details
================================================================================
Job ID:           f1e2d3c4-b5a6-7890-1234-abcdef567890
Name:             anonymize_job_20260320_143022
Status:           SUCCEEDED
Dataset:          anonymize_seed_20260320_143022 (a1b2c3d4-e5f6-7890-abcd-ef1234567890)
Detection method: gliner+heuristics
PII types:        ['first_name', 'last_name', 'email', 'phone_number', 'street_address', 'date_of_birth', 'ssn']
Threshold:        0.3
Duration:         18s
Started:          2026-03-20 14:30:24.000000+00:00
Completed:        2026-03-20 14:30:42.000000+00:00

Step 8: Inspect Masked Sample Content

Retrieve the anonymized content for every sample in the dataset. Each response includes the masked text and a per-entity-type count summary.
print("=" * 80)
print("🔍 Masked Sample Content")
print("=" * 80)

num_samples = max(len((job_result.metrics_json or {}).get("transformed_samples", [])), 1)
entity_rows = []


def _count_entities(v):
    """Recursively sum entity counts from a flat or nested summary dict."""
    if isinstance(v, int):
        return v
    if isinstance(v, dict):
        return sum(_count_entities(x) for x in v.values())
    return 0


for idx in range(num_samples):
    sample = client.dataframer.transform_jobs.file_content(job_id, sample_index=idx)

    print(f"\nSample {idx}{sample.file_name}")
    print(f"  File type:      {sample.file_type}")
    print(f"  Entities found: {len(sample.entities_found or [])}")
    print(f"  Entity summary: {sample.entity_summary}")
    print(f"  Masked preview:")
    print(f"  {(sample.content or '')[:400]!r}")

    if sample.entity_summary:
        for entity_type, count in sample.entity_summary.items():
            entity_rows.append({"sample": idx, "entity_type": entity_type, "count": _count_entities(count)})
================================================================================
🔍 Masked Sample Content
================================================================================

Sample 0 — patient_records.csv
  File type:      text/csv
  Entities found: 35
  Entity summary: {'first_name': 5, 'last_name': 5, 'email': 5, 'phone_number': 5, 'date_of_birth': 10, 'ssn': 5}
  Masked preview:
  'patient_id,first_name,last_name,email,phone,dob,ssn,notes\nP001,<FIRST NAME>,<LAST NAME>,<EMAIL>,<PHONE>,<DOB>,<SSN>,Patient <FIRST NAME> <LAST NAME> reports mild chest pain. Born <DOB>. SSN <SSN>.\nP002,<FIRST NAME>,<LAST NAME>,<EMAIL>,<PHONE>,<DOB>,<SSN>,Follow-up for <FIRST NAME> <LAST NAME>. Contact at <EMAIL> or <PHONE>.'

Step 9: Download All Transformed Files as ZIP

Retrieve a presigned S3 URL and download all anonymized files as a single ZIP archive. The URL is valid for 1 hour.
print("=" * 80)
print("📥 Downloading ZIP archive")
print("=" * 80)

download = client.dataframer.transform_jobs.download_all(job_id)
print("Presigned URL obtained")
print(f"  Filename:   {download.filename}")
print(f"  File count: {download.file_count}")
print(f"  Size:       {download.size_bytes} bytes")

zip_response = requests.get(download.download_url)
zip_response.raise_for_status()

output_file = Path(f"transformed_{job_id[:8]}.zip")
output_file.write_bytes(zip_response.content)
print(f"\n✅ ZIP saved: {output_file.absolute()} ({output_file.stat().st_size:,} bytes)")
================================================================================
📥 Downloading ZIP archive
================================================================================
Presigned URL obtained
  Filename:   transformed_f1e2d3c4.zip
  File count: 1
  Size:       2048 bytes

✅ ZIP saved: /content/transformed_f1e2d3c4.zip (2,048 bytes)

Results

Entity detection summary across all samples.
pd.set_option("display.max_colwidth", 80)


def _to_int(v):
    """Flatten a plain int or nested dict count to an int."""
    if isinstance(v, int):
        return v
    if isinstance(v, dict):
        return sum(_to_int(x) for x in v.values())
    return 0


if entity_rows:
    results_df = pd.DataFrame(entity_rows)
    results_df["count"] = results_df["count"].apply(_to_int)
    results_df = (
        results_df
        .groupby("entity_type")["count"]
        .sum()
        .sort_values(ascending=False)
        .reset_index()
        .rename(columns={"entity_type": "Entity Type", "count": "Total Detected"})
    )
    print("Entity detection totals across all samples:\n")
    print(results_df.to_string(index=False))
else:
    print("No entities detected — check your pii_types configuration.")
Entity detection totals across all samples:

   Entity Type  Total Detected
     by_source              55
       by_type              55
total_entities              55

What’s Next?

  • Adjust pii_types: add or remove entity types from the full catalogue to target exactly the entities relevant to your use case
  • Try a different detection_method: switch to heuristics for faster runs, or all for maximum coverage
  • Use your own data: replace the synthetic CSV with a real dataset via create_with_files or create_from_zip
  • Scale up: the same workflow supports MULTI_FILE and MULTI_FOLDER datasets — pass multiple file handles or use dataset_type="MULTI_FOLDER" with folder_names
  • Integrate downstream: the anonymized ZIP can be stored in S3, fed into further processing pipelines, or used as safe input to your LLM workflows

Multi-Folder Workflow

Generate multi-file synthetic datasets from seed data

API Reference

Full endpoint documentation

Appendix: Available PII/PHI Types \

Pass any combination of the keys below as the pii_types argument when creating a transform job. Each key maps to a default mask token shown in the Masked as column.

Personal

KeyMasked as
first_name<FIRST NAME>
last_name<LAST NAME>
date_of_birth<DOB>
date<DATE>
age<AGE>
gender<GENDER>
nationality<NATIONALITY>
race_ethnicity<RACE ETHNICITY>
marital_status<MARITAL STATUS>

Contact

KeyMasked as
email<EMAIL>
phone_number<PHONE>
street_address<ADDRESS>
postcode<ZIP>
city<CITY>
state<STATE>
country<COUNTRY>

Financial

KeyMasked as
ssn<SSN>
credit_debit_card<CREDIT CARD>
bank_routing_number<BANK ROUTING>
routing_number<ROUTING>
tax_id<TAX ID>
iban<IBAN>

Digital

KeyMasked as
ipv4<IP ADDRESS>
url<URL>
user_name<USERNAME>
password<PASSWORD>
mac_address<MAC ADDRESS>
device_identifier<DEVICE ID>

Identity Documents

KeyMasked as
passport_number<PASSPORT>
certificate_license_number<LICENSE>
national_id<NATIONAL ID>
voter_id<VOTER ID>

Medical / PHI

KeyMasked as
medical_record_number<MRN>
diagnosis<DIAGNOSIS>
medication<MEDICATION>
health_plan_beneficiary_number<HEALTH PLAN>
patient_id<PATIENT ID>
lab_result<LAB RESULT>

Professional

KeyMasked as
company_name<COMPANY>
occupation<OCCUPATION>
employee_id<EMPLOYEE ID>
salary<SALARY>