In this notebook we will demonstrate how the DataFramer Python SDK (PIP Package: pydataframer) can be used to detect and mask Personally Identifiable Information (PII) and Protected Health Information (PHI) in your datasets.We will walk through the complete anonymization workflow:
Upload a seed dataset containing sensitive data
Create an anonymization run to detect and mask PII/PHI entities
Inspect the masked output sample by sample
Download all anonymized files as a ZIP archive
The anonymization pipeline uses the AIMon-PII-M1 model combined with heuristic rules (aimon_pii_m1+heuristics) for high-accuracy detection across names, emails, phone numbers, SSNs, dates of birth, and more.
A Dataframer API key is required. Retrieve yours from Account → Keys → Copy API Key on the web application and add it as a Colab secret named DATAFRAMER_API_KEY.
We build a small synthetic CSV in-memory — no external files required. Each row is a fictitious patient support record containing PII/PHI fields that the anonymization run will detect and mask.
CSV_DATA = """\patient_id,first_name,last_name,email,phone,dob,ssn,notesP001,John,Smith,[email protected],555-867-5309,1985-03-14,123-45-6789,Patient John Smith reports mild chest pain. Born 1985-03-14. SSN 123-45-6789.P002,Maria,Garcia,[email protected],555-234-5678,1972-11-28,987-65-4321,Follow-up for Maria Garcia. Contact at [email protected] or 555-234-5678.P003,David,Lee,[email protected],555-321-0987,1990-07-04,456-78-9012,Patient David Lee (DOB 1990-07-04). SSN on file: 456-78-9012.P004,Sarah,Johnson,[email protected],555-456-7890,1968-12-25,789-01-2345,Emergency contact for Sarah Johnson: 555-456-7890. Email [email protected]P005,Robert,Chen,[email protected],555-654-3210,1955-09-18,321-54-9876,Mr. Robert Chen (DOB 1955-09-18) was seen on 2024-01-15. SSN 321-54-9876."""pd.set_option("display.max_colwidth", 80)df_preview = pd.read_csv(io.StringIO(CSV_DATA))print(f"Sample dataset: {len(df_preview)} rows")df_preview
Create an anonymization run that will scan every row in the dataset and replace detected entities with masked tokens.Detection method: aimon_pii_m1+heuristics — the recommended setting. Combines the AIMon-PII-M1 neural model with regex-based heuristic rules for high precision and recall.PII types targeted (see the full entity catalogue):
The anonymization run executes asynchronously. We poll every 10 seconds until it reaches SUCCEEDED or FAILED.
def run_not_finished(result): return result.status not in ("SUCCEEDED", "FAILED")@retry(wait=wait_fixed(10), retry=retry_if_result(run_not_finished), stop=stop_never)def poll_run_status(client, run_id): result = client.dataframer.anonymization_runs.retrieve(run_id) print( f"[{datetime.now(timezone.utc).isoformat(timespec='seconds')}] " f"Run {run_id[:8]}... status: {result.status}", flush=True, ) return resultprint("Polling for run completion (this may take several minutes)...")run_result = poll_run_status(client, run_id)if run_result.status == "FAILED": error = (run_result.results or {}).get("error", "unknown error") raise RuntimeError(f"Anonymization run failed: {error}")results = run_result.results or {}entity_summary = (results.get("entity_summary") or {}).get("overall", {})print(f"\n✅ Anonymization run completed successfully!")print(f" Files processed: {results.get('samples_processed', 0)}")print(f" Total entities redacted: {sum(entity_summary.values())}")print(f" Entity breakdown: {entity_summary}")
Polling for run completion (this may take several minutes)...[2026-03-20T14:30:35+00:00] Run f1e2d3c4... status: PENDING[2026-03-20T14:30:45+00:00] Run f1e2d3c4... status: PENDING[2026-03-20T14:30:55+00:00] Run f1e2d3c4... status: SUCCEEDED✅ Anonymization run completed successfully! Files processed: 1 Total entities redacted: 35 Entity breakdown: {'first_name': 5, 'last_name': 5, 'email': 5, 'phone_number': 5, 'date_of_birth': 10, 'ssn': 5}
Adjust pii_types: add or remove entity types from the full catalogue to target exactly the entities relevant to your use case
Try a different detection_method: switch to heuristics for faster runs, or all for maximum coverage
Use your own data: replace the synthetic CSV with a real dataset via create_with_files or create_from_zip
Scale up: the same workflow supports MULTI_FILE and MULTI_FOLDER datasets — pass multiple file handles or use dataset_type="MULTI_FOLDER" with folder_names
Integrate downstream: the anonymized ZIP can be stored in S3, fed into further processing pipelines, or used as safe input to your LLM workflows
Folder Generation
Generate multi-file synthetic datasets from seed data
Pass any combination of the keys below as the pii_types argument when creating an anonymization run. Each key maps to a default mask token shown in the Masked as column.