> ## Documentation Index
> Fetch the complete documentation index at: https://docs.dataframer.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# PII/PHI Anonymization

> Detect and mask Personally Identifiable Information and Protected Health Information in your datasets using the DataFramer SDK

<Card title="Open in Google Colab" icon="book" href="https://colab.research.google.com/github/aimonlabs/dataframer-docs-public/blob/main/anonymize-workflow.ipynb" horizontal>
  Run this exact tutorial interactively in Google Colab
</Card>

# DataFramer SDK — PII/PHI Anonymization

In this notebook we will demonstrate how the [DataFramer](https://dataframer.ai) Python [SDK](https://pypi.org/project/pydataframer) (PIP Package: `pydataframer`) can be used to detect and mask **Personally Identifiable Information (PII)** and **Protected Health Information (PHI)** in your datasets.

We will walk through the complete anonymization workflow:

* Upload a seed dataset containing sensitive data
* Create an anonymization run to detect and mask PII/PHI entities
* Inspect the masked output sample by sample
* Download all anonymized files as a ZIP archive

The anonymization pipeline uses the `AIMon-PII-M1` model combined with heuristic rules (`aimon_pii_m1+heuristics`) for high-accuracy detection across names, emails, phone numbers, SSNs, dates of birth, and more.

### Step 1: Install and Setup SDK

Install the [DataFramer SDK](https://pypi.org/project/pydataframer) and additional utilities.

```python theme={null}
%%capture
%pip install --upgrade pydataframer tenacity pandas requests
```

A Dataframer API key is required. Retrieve yours from **Account → Keys → Copy API Key** on the [web application](https://app.dataframer.ai) and add it as a Colab secret named `DATAFRAMER_API_KEY`.

```python theme={null}
import os
from google.colab import userdata

os.environ['DATAFRAMER_API_KEY'] = userdata.get('DATAFRAMER_API_KEY')
```

```python theme={null}
import io
import os
import zipfile
from datetime import datetime, timezone
from pathlib import Path

import dataframer
import pandas as pd
import requests
from dataframer import Dataframer
from tenacity import retry, retry_if_result, stop_never, wait_fixed

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")

client = Dataframer(
    api_key=os.environ["DATAFRAMER_API_KEY"],
)

print("✅ Dataframer client initialized successfully")
print(f"   SDK version: {dataframer.__version__}")
print(f"   API key:     {client.api_key[:4]}...")
print(f"   run_id:      {run_id}")
```

```
✅ Dataframer client initialized successfully
   SDK version: 0.7.0
   API key:     sk-a...
   run_id:      20260320_143022
```

### Step 2: Create Sample Dataset

We build a small synthetic CSV in-memory — no external files required. Each row is a fictitious patient support record containing PII/PHI fields that the anonymization run will detect and mask.

```python theme={null}
CSV_DATA = """\
patient_id,first_name,last_name,email,phone,dob,ssn,notes
P001,John,Smith,john.smith@email.com,555-867-5309,1985-03-14,123-45-6789,Patient John Smith reports mild chest pain. Born 1985-03-14. SSN 123-45-6789.
P002,Maria,Garcia,m.garcia@healthnet.org,555-234-5678,1972-11-28,987-65-4321,Follow-up for Maria Garcia. Contact at m.garcia@healthnet.org or 555-234-5678.
P003,David,Lee,dlee@provider.com,555-321-0987,1990-07-04,456-78-9012,Patient David Lee (DOB 1990-07-04). SSN on file: 456-78-9012.
P004,Sarah,Johnson,sarah.j@clinic.org,555-456-7890,1968-12-25,789-01-2345,Emergency contact for Sarah Johnson: 555-456-7890. Email sarah.j@clinic.org
P005,Robert,Chen,r.chen@medcenter.com,555-654-3210,1955-09-18,321-54-9876,Mr. Robert Chen (DOB 1955-09-18) was seen on 2024-01-15. SSN 321-54-9876.
"""

pd.set_option("display.max_colwidth", 80)
df_preview = pd.read_csv(io.StringIO(CSV_DATA))
print(f"Sample dataset: {len(df_preview)} rows")
df_preview
```

```
Sample dataset: 5 rows
```

### Step 3: Upload the Dataset

Wrap the synthetic CSV in a ZIP buffer and upload it as a seed dataset. If a dataset with the same name already exists it is reused (idempotent).

```python theme={null}
dataset_name = f"anonymize_seed_{run_id}"


def _find_existing_dataset(name):
    all_datasets = client.dataframer.seed_datasets.list()
    return next((d for d in all_datasets if d.name == name), None)


zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, "w", zipfile.ZIP_DEFLATED) as zf:
    zf.writestr("patient_records.csv", CSV_DATA)
zip_buffer.seek(0)

try:
    dataset = client.dataframer.seed_datasets.create_from_zip(
        name=dataset_name,
        description="Synthetic patient support records for PII anonymization demo",
        zip_file=zip_buffer,
    )
except Exception as e:
    if "already exists" in str(e):
        dataset = _find_existing_dataset(dataset_name)
        print(f"  ℹ️  Dataset '{dataset_name}' already exists — reusing it")
    else:
        raise

dataset_id = dataset.id

print("✅ Dataset ready")
print(f"   ID:         {dataset_id}")
print(f"   Name:       {dataset.name}")
print(f"   File count: {dataset.file_count}")
```

```
✅ Dataset ready
   ID:         a1b2c3d4-e5f6-7890-abcd-ef1234567890
   Name:       anonymize_seed_20260320_143022
   File count: 1
```

### Step 4: Create an Anonymization Run

Create an anonymization run that will scan every row in the dataset and replace detected entities with masked tokens.

**Detection method**: `aimon_pii_m1+heuristics` — the recommended setting. Combines the `AIMon-PII-M1` neural model with regex-based heuristic rules for high precision and recall.

**PII types targeted** (see the [full entity catalogue](#appendix-piiphi-types)):

| Category  | Entity types                               |
| --------- | ------------------------------------------ |
| Personal  | `first_name`, `last_name`, `date_of_birth` |
| Contact   | `email`, `phone_number`, `street_address`  |
| Financial | `ssn`                                      |

```python theme={null}
PII_TYPES = [
    "first_name",
    "last_name",
    "email",
    "phone_number",
    "street_address",
    "date_of_birth",
    "ssn",
]

run = client.dataframer.anonymization_runs.create(
    dataset_id=dataset_id,
    pii_types=PII_TYPES,
    detection_method="aimon_pii_m1+heuristics",
)

run_id = run.id

print("✅ Anonymization run created")
print(f"   ID: {run_id}")
```

```
✅ Anonymization run created
   ID: f1e2d3c4-b5a6-7890-1234-abcdef567890
```

### Step 5: Poll Until Run Completes

The anonymization run executes asynchronously. We poll every 10 seconds until it reaches `SUCCEEDED` or `FAILED`.

```python theme={null}
def run_not_finished(result):
    return result.status not in ("SUCCEEDED", "FAILED")


@retry(wait=wait_fixed(10), retry=retry_if_result(run_not_finished), stop=stop_never)
def poll_run_status(client, run_id):
    result = client.dataframer.anonymization_runs.retrieve(run_id)
    print(
        f"[{datetime.now(timezone.utc).isoformat(timespec='seconds')}] "
        f"Run {run_id[:8]}... status: {result.status}",
        flush=True,
    )
    return result


print("Polling for run completion (this may take several minutes)...")
run_result = poll_run_status(client, run_id)

if run_result.status == "FAILED":
    error = (run_result.results or {}).get("error", "unknown error")
    raise RuntimeError(f"Anonymization run failed: {error}")

results = run_result.results or {}
entity_summary = (results.get("entity_summary") or {}).get("overall", {})

print(f"\n✅ Anonymization run completed successfully!")
print(f"   Files processed:       {results.get('samples_processed', 0)}")
print(f"   Total entities redacted: {sum(entity_summary.values())}")
print(f"   Entity breakdown:      {entity_summary}")
```

```
Polling for run completion (this may take several minutes)...
[2026-03-20T14:30:35+00:00] Run f1e2d3c4... status: PENDING
[2026-03-20T14:30:45+00:00] Run f1e2d3c4... status: PENDING
[2026-03-20T14:30:55+00:00] Run f1e2d3c4... status: SUCCEEDED

✅ Anonymization run completed successfully!
   Files processed:       1
   Total entities redacted: 35
   Entity breakdown:      {'first_name': 5, 'last_name': 5, 'email': 5, 'phone_number': 5, 'date_of_birth': 10, 'ssn': 5}
```

### Step 6: List All Anonymization Runs

Retrieve all anonymization runs for your company account (newest first). Useful for auditing past anonymization runs.

```python theme={null}
print("=" * 80)
print("📋 All Anonymization Runs")
print("=" * 80)

all_runs = client.dataframer.anonymization_runs.list()
print(f"Found {len(all_runs)} anonymization run(s)\n")

for i, r in enumerate(all_runs[:5], 1):
    print(f"  Run {i}:")
    print(f"    ID:      {r.id}")
    print(f"    Status:  {r.status}")
    print(f"    Created: {r.created_at}")
    print()

if len(all_runs) > 5:
    print(f"  ... and {len(all_runs) - 5} more")
```

```
================================================================================
📋 All Anonymization Runs
================================================================================
Found 1 anonymization run(s)

  Run 1:
    ID:      f1e2d3c4-b5a6-7890-1234-abcdef567890
    Status:  SUCCEEDED
    Created: 2026-03-20 14:30:22.123456+00:00
```

### Step 7: Retrieve Full Run Details

Fetch the complete run record including dataset metadata, configuration parameters, and timing information.

```python theme={null}
print("=" * 80)
print("📄 Run Details")
print("=" * 80)

run_details = client.dataframer.anonymization_runs.retrieve(run_id)

print(f"Run ID:           {run_details.id}")
print(f"Status:           {run_details.status}")
print(f"Dataset:          {run_details.dataset_name} ({run_details.dataset_id})")
print(f"Detection method: {run_details.detection_method}")
print(f"PII types:        {run_details.pii_types}")
print(f"Duration:         {run_details.duration_seconds}s")
print(f"Completed:        {run_details.completed_at}")
```

```
================================================================================
📄 Run Details
================================================================================
Run ID:           f1e2d3c4-b5a6-7890-1234-abcdef567890
Status:           SUCCEEDED
Dataset:          anonymize_seed_20260320_143022 (a1b2c3d4-e5f6-7890-abcd-ef1234567890)
Detection method: aimon_pii_m1+heuristics
PII types:        ['first_name', 'last_name', 'email', 'phone_number', 'street_address', 'date_of_birth', 'ssn']
Duration:         18s
Completed:        2026-03-20 14:30:42.000000+00:00
```

### Step 8: Inspect Masked Sample Content

Download each anonymized file and display a preview. Entity counts come from the run's `results` (retrieved in the previous step).

```python theme={null}
print("=" * 80)
print("🔍 Masked Sample Content")
print("=" * 80)

anonymized_files = run_result.anonymized_files or []

for file in anonymized_files:
    download_info = client.dataframer.anonymization_runs.files.download(run_id, file_id=file.id)

    # Fetch the actual file content via the presigned URL
    file_response = requests.get(download_info.download_url)
    file_response.raise_for_status()
    content = file_response.text

    print(f"\nFile {file.id} — {download_info.file_name}")
    print(f"  Content type: {download_info.content_type}")
    print(f"  Size:         {file.size_in_bytes} bytes")
    print(f"  Masked preview:")
    print(f"  {content[:400]!r}")

# Entity summary from run results
entity_summary = (results.get("entity_summary") or {}).get("overall", {})
entity_rows = [{"entity_type": k, "count": v} for k, v in entity_summary.items()]
```

```
================================================================================
🔍 Masked Sample Content
================================================================================

File abc123 — patient_records.csv
  Content type: text/csv
  Size:         1234 bytes
  Masked preview:
  'patient_id,first_name,last_name,email,phone,dob,ssn,notes\nP001,<FIRST NAME>,<LAST NAME>,<EMAIL>,<PHONE>,<DOB>,<SSN>,Patient <FIRST NAME> <LAST NAME> reports mild chest pain. Born <DOB>. SSN <SSN>.\nP002,<FIRST NAME>,<LAST NAME>,<EMAIL>,<PHONE>,<DOB>,<SSN>,Follow-up for <FIRST NAME> <LAST NAME>. Contact at <EMAIL> or <PHONE>.'
```

### Step 9: Download All Anonymized Files as ZIP

Retrieve a presigned URL and download all anonymized files as a single ZIP archive. The URL is valid for 1 hour.

```python theme={null}
print("=" * 80)
print("📥 Downloading ZIP archive")
print("=" * 80)

download = client.dataframer.anonymization_runs.download(run_id)
print("Presigned URL obtained")
print(f"  URL:    {download.download_url}")
print(f"  Status: {download.status}")

zip_response = requests.get(download.download_url)
zip_response.raise_for_status()

output_file = Path(f"anonymized_{run_id[:8]}.zip")
output_file.write_bytes(zip_response.content)
print(f"\n✅ ZIP saved: {output_file.absolute()} ({output_file.stat().st_size:,} bytes)")
```

```
================================================================================
📥 Downloading ZIP archive
================================================================================
Presigned URL obtained
  URL:    https://s3.amazonaws.com/...
  Status: ready

✅ ZIP saved: /content/anonymized_f1e2d3c4.zip (2,048 bytes)
```

## Results

Entity detection summary across all files.

```python theme={null}
if entity_rows:
    results_df = (
        pd.DataFrame(entity_rows)
        .sort_values("count", ascending=False)
        .reset_index(drop=True)
        .rename(columns={"entity_type": "Entity Type", "count": "Total Detected"})
    )
    print("Entity detection totals across all files:\n")
    print(results_df.to_string(index=False))
else:
    print("No entities detected — check your pii_types configuration.")
```

```
Entity detection totals across all files:

   Entity Type  Total Detected
  date_of_birth             10
    first_name               5
     last_name               5
         email               5
  phone_number               5
           ssn               5
```

## What's Next?

* **Adjust `pii_types`**: add or remove entity types from the [full catalogue](#appendix-piiphi-types) to target exactly the entities relevant to your use case
* **Try a different `detection_method`**: switch to `heuristics` for faster runs, or `all` for maximum coverage
* **Use your own data**: replace the synthetic CSV with a real dataset via `create_with_files` or `create_from_zip`
* **Scale up**: the same workflow supports `MULTI_FILE` and `MULTI_FOLDER` datasets — pass multiple file handles or use `dataset_type="MULTI_FOLDER"` with `folder_names`
* **Integrate downstream**: the anonymized ZIP can be stored in S3, fed into further processing pipelines, or used as safe input to your LLM workflows

<CardGroup cols={2}>
  <Card title="Folder Generation" icon="folder-tree" href="/tutorials/multi-folder-workflow">
    Generate multi-file synthetic datasets from seed data
  </Card>

  <Card title="API Reference" icon="code" href="/api-reference">
    Full endpoint documentation
  </Card>
</CardGroup>

***

<h2 id="appendix-piiphi-types">
  Appendix: Available PII/PHI Types \\
</h2>

Pass any combination of the keys below as the `pii_types` argument when creating an anonymization run. Each key maps to a default mask token shown in the **Masked as** column.

### Personal

| Key              | Masked as          |
| ---------------- | ------------------ |
| `first_name`     | `<FIRST NAME>`     |
| `last_name`      | `<LAST NAME>`      |
| `date_of_birth`  | `<DOB>`            |
| `date`           | `<DATE>`           |
| `age`            | `<AGE>`            |
| `gender`         | `<GENDER>`         |
| `nationality`    | `<NATIONALITY>`    |
| `race_ethnicity` | `<RACE ETHNICITY>` |
| `marital_status` | `<MARITAL STATUS>` |

### Contact

| Key              | Masked as   |
| ---------------- | ----------- |
| `email`          | `<EMAIL>`   |
| `phone_number`   | `<PHONE>`   |
| `street_address` | `<ADDRESS>` |
| `postcode`       | `<ZIP>`     |
| `city`           | `<CITY>`    |
| `state`          | `<STATE>`   |
| `country`        | `<COUNTRY>` |

### Financial

| Key                   | Masked as        |
| --------------------- | ---------------- |
| `ssn`                 | `<SSN>`          |
| `credit_debit_card`   | `<CREDIT CARD>`  |
| `bank_routing_number` | `<BANK ROUTING>` |
| `routing_number`      | `<ROUTING>`      |
| `tax_id`              | `<TAX ID>`       |
| `iban`                | `<IBAN>`         |

### Digital

| Key                 | Masked as       |
| ------------------- | --------------- |
| `ipv4`              | `<IP ADDRESS>`  |
| `url`               | `<URL>`         |
| `user_name`         | `<USERNAME>`    |
| `password`          | `<PASSWORD>`    |
| `mac_address`       | `<MAC ADDRESS>` |
| `device_identifier` | `<DEVICE ID>`   |

### Identity Documents

| Key                          | Masked as       |
| ---------------------------- | --------------- |
| `passport_number`            | `<PASSPORT>`    |
| `certificate_license_number` | `<LICENSE>`     |
| `national_id`                | `<NATIONAL ID>` |
| `voter_id`                   | `<VOTER ID>`    |

### Medical / PHI

| Key                              | Masked as       |
| -------------------------------- | --------------- |
| `medical_record_number`          | `<MRN>`         |
| `diagnosis`                      | `<DIAGNOSIS>`   |
| `medication`                     | `<MEDICATION>`  |
| `health_plan_beneficiary_number` | `<HEALTH PLAN>` |
| `patient_id`                     | `<PATIENT ID>`  |
| `lab_result`                     | `<LAB RESULT>`  |

### Professional

| Key            | Masked as       |
| -------------- | --------------- |
| `company_name` | `<COMPANY>`     |
| `occupation`   | `<OCCUPATION>`  |
| `employee_id`  | `<EMPLOYEE ID>` |
| `salary`       | `<SALARY>`      |
