Open in Google Colab
Run this exact tutorial interactively in Google Colab
Dataframer SDK — PII/PHI Anonymization
In this notebook we will demonstrate how the Dataframer Python SDK (PIP Package:pydataframer) can be used to detect and mask Personally Identifiable Information (PII) and Protected Health Information (PHI) in your datasets.
We will walk through the complete anonymization workflow:
- Upload a seed dataset containing sensitive data
- Create a transform job to detect and mask PII/PHI entities
- Inspect the masked output sample by sample
- Download all anonymized files as a ZIP archive
AIMon-PII-M1 model combined with heuristic rules (gliner+heuristics) for high-accuracy detection across names, emails, phone numbers, SSNs, dates of birth, and more.
Step 1: Install and Setup SDK
Install the Dataframer SDK and additional utilities.DATAFRAMER_API_KEY.
Step 2: Create Sample Dataset
We build a small synthetic CSV in-memory — no external files required. Each row is a fictitious patient support record containing PII/PHI fields that the transform job will detect and mask.Step 3: Upload the Dataset
Wrap the synthetic CSV in a ZIP buffer and upload it as a seed dataset. If a dataset with the same name already exists it is reused (idempotent).Step 4: Create a Transform Job
Create a transform job that will scan every row in the dataset and replace detected entities with masked tokens. Detection method:gliner+heuristics — the recommended setting. Combines the AIMon-PII-M1 neural model with regex-based heuristic rules for high precision and recall.
PII types targeted (see the full entity catalogue):
| Category | Entity types |
|---|---|
| Personal | first_name, last_name, date_of_birth |
| Contact | email, phone_number, street_address |
| Financial | ssn |
Step 5: Poll Until Job Completes
The transform job runs asynchronously. We poll every 10 seconds until it reachesSUCCEEDED or FAILED.
Step 6: List All Transform Jobs
Retrieve all transform jobs for your company account (newest first). Useful for auditing past anonymization runs.Step 7: Retrieve Full Job Details
Fetch the complete job record including dataset metadata, configuration parameters, and timing information.Step 8: Inspect Masked Sample Content
Retrieve the anonymized content for every sample in the dataset. Each response includes the masked text and a per-entity-type count summary.Step 9: Download All Transformed Files as ZIP
Retrieve a presigned S3 URL and download all anonymized files as a single ZIP archive. The URL is valid for 1 hour.Results
Entity detection summary across all samples.What’s Next?
- Adjust
pii_types: add or remove entity types from the full catalogue to target exactly the entities relevant to your use case - Try a different
detection_method: switch toheuristicsfor faster runs, orallfor maximum coverage - Use your own data: replace the synthetic CSV with a real dataset via
create_with_filesorcreate_from_zip - Scale up: the same workflow supports
MULTI_FILEandMULTI_FOLDERdatasets — pass multiple file handles or usedataset_type="MULTI_FOLDER"withfolder_names - Integrate downstream: the anonymized ZIP can be stored in S3, fed into further processing pipelines, or used as safe input to your LLM workflows
Multi-Folder Workflow
Generate multi-file synthetic datasets from seed data
API Reference
Full endpoint documentation
Appendix: Available PII/PHI Types \
Pass any combination of the keys below as thepii_types argument when creating a transform job. Each key maps to a default mask token shown in the Masked as column.
Personal
| Key | Masked as |
|---|---|
first_name | <FIRST NAME> |
last_name | <LAST NAME> |
date_of_birth | <DOB> |
date | <DATE> |
age | <AGE> |
gender | <GENDER> |
nationality | <NATIONALITY> |
race_ethnicity | <RACE ETHNICITY> |
marital_status | <MARITAL STATUS> |
Contact
| Key | Masked as |
|---|---|
email | <EMAIL> |
phone_number | <PHONE> |
street_address | <ADDRESS> |
postcode | <ZIP> |
city | <CITY> |
state | <STATE> |
country | <COUNTRY> |
Financial
| Key | Masked as |
|---|---|
ssn | <SSN> |
credit_debit_card | <CREDIT CARD> |
bank_routing_number | <BANK ROUTING> |
routing_number | <ROUTING> |
tax_id | <TAX ID> |
iban | <IBAN> |
Digital
| Key | Masked as |
|---|---|
ipv4 | <IP ADDRESS> |
url | <URL> |
user_name | <USERNAME> |
password | <PASSWORD> |
mac_address | <MAC ADDRESS> |
device_identifier | <DEVICE ID> |
Identity Documents
| Key | Masked as |
|---|---|
passport_number | <PASSPORT> |
certificate_license_number | <LICENSE> |
national_id | <NATIONAL ID> |
voter_id | <VOTER ID> |
Medical / PHI
| Key | Masked as |
|---|---|
medical_record_number | <MRN> |
diagnosis | <DIAGNOSIS> |
medication | <MEDICATION> |
health_plan_beneficiary_number | <HEALTH PLAN> |
patient_id | <PATIENT ID> |
lab_result | <LAB RESULT> |
Professional
| Key | Masked as |
|---|---|
company_name | <COMPANY> |
occupation | <OCCUPATION> |
employee_id | <EMPLOYEE ID> |
salary | <SALARY> |

