Open in Google Colab
Run this exact tutorial interactively in Google Colab
Dataframer SDK Demo
In this notebook we will demonstrate how the Dataframer Python SDK (PIP Package:pydataframer) can be used to generate large amounts of high quality synthetic datasets where each sample size can be arbitrarily large.
We will specifically demonstrate how to generate Electronic Health Records (EHR) where each sample represents a patient health record. These can include results of lab tests, diagnoses, medical opinions etc.
Step 1: Install and Setup SDK
Install the Dataframer SDK and a few other useful utilities.Using base URL: https://df-api.dataframer.ai
Dataframer SDK version: 0.2.1
Step 2: Upload data
EHR records are uploaded as multiple folders. Each folder contains multiple files where each file is a patient health record. A health record can be a lab test report, a medical opinion for a doctor on the overall case, a case history etc. The structure of the folders should look like this:List All Datasets
This API allows you to list all datasets that have been uploaded across your entire company.Name: patient_dataset_1
ID: 8cb54375-63ce-485b-b382-561f7239064d
Type: MULTI_FILE
Files: 2 | Folders: 0
Created: 2025-12-09 23:45:06
Retrieve Dataset Details
This API demonstrates how to retrieve a specific dataset given a dataset ID.Name: patient_dataset_1
Type: MULTI_FILE (MULTI_FILE)
Description: Patient record dataset uploaded from Google Drive ZIP
Created: 2025-12-09 23:45:06.348298+00:00 📁 Contents:
Files: 2
Folders: 0 🔧 Compatibility:
Short samples: ✅
Long samples: ✅
Step 3: Generate Specification via AI Analysis
A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Dataframer automatically generates specifications by analyzing your seed data.Task ID: d44e783e-9bce-48c0-9095-c7a9c729bd5e\ Analysis completed successfully! ✅
Spec ID: cd7f343f-4c87-446c-b919-7f4cafb3b6e8
List All Specs
This API allows you view all the specifications across the entire company.Review Generated Specification
For the dataset for which a specification was triggered above, view the results of the specification generation via thespecs.retrieve(spec_id=spec_id) API.
Config YAML length: 21232 chars Data property variations:
• Primary diagnosis category: 10 values
• Patient age group: 5 values
• Patient gender: 2 values
• Admission acuity and initial location: 7 values
• Length of stay: 6 values
• Number of specialists consulted: 5 values
• Procedures performed during hospitalization: 7 values
• Complexity of medication changes at discharge: 4 values
• Social complexity and discharge planning needs: 7 values
• Number of follow-up appointments scheduled: 4 values
• Document formatting style: 5 values
• Hospital course narrative style: 5 values
• Level of medical abbreviation density: 3 values
• Inclusion of prognostic information: 3 values
• Patient education and counseling documentation detail: 3 values
• Complication or adverse event occurrence: 4 values
• Diagnostic certainty at discharge: 4 values
• Severity of primary condition: 4 values
• Number of secondary diagnoses: 4 values
• Hospital system/institution name style: 5 values
• Geographic region indicators: 5 values
• Race/ethnicity mention: 3 values
• Chief complaint presentation style: 3 values
• Lab values presentation format: 4 values
• Discharge instructions emphasis on warning signs: 4 values
• Medication list format at discharge: 4 values
Step 4: Update Specification (Optional)
This cell demonstrates how to programmatically update a given specification. Here, we will add a new data property calledCase Severity with values 'Critical', 'Severe', 'Moderate', 'Mild' and expected distributions of these values [10, 25, 40, 25]
✓ Updated requirements for medical context ✓ Updated spec. New version: 2
Step 5: Generate Multi-Folder Samples
Once the spec is generated and finalized after any manual modifications, we will use this spec to generate synthetic data. Refer to this document for more details on the sample generation.Task ID: df53ce8b-8102-4b49-b9aa-7a0a2b476fa5
Run ID : 063287b0-d01e-4e14-8fc9-bb0cb3ad3436
Status : ACCEPTED Generation completed successfully! ✅
Run ID : 063287b0-d01e-4e14-8fc9-bb0cb3ad3436
Run state: SUCCEEDED
Step 6: Evaluate Generated Samples
While Dataframer evaluates each sample as it is generated, it also supports a post-generation evaluation. This API shows how to evaluate the generated dataset. Read the documentation for more details.Evaluation ID: c3c00680-ca98-4611-b0f1-ffe7b1b6a478
Created at : 2025-12-09 23:51:21.646228+00:00 Evaluation completed successfully! ✅
Step 7: Download Generated Folders with Metadata
List Generated Files
This API lists all the files that were present in the generated dataset.Total files: 3 📄 File 1: Name: generated_sample_1.txt ID: txt_sample_1 Status: Completed Size: 16733 bytes Type: text/plain Details: Completed successfully in 0 iterations Model: litellm/anthropic/claude-sonnet-4-5-20250929 📄 File 2: Name: generated_sample_2.txt ID: txt_sample_2 Status: Completed Size: 14441 bytes Type: text/plain Details: Completed successfully in 0 iterations Model: litellm/anthropic/claude-sonnet-4-5-20250929 📄 File 3: Name: generated_sample_3.txt ID: txt_sample_3 Status: Completed Size: 18705 bytes Type: text/plain Details: Completed successfully in 0 iterations Model: litellm/anthropic/claude-sonnet-4-5-20250929
Download All Files as ZIP
This API allows you to download all the generated files as a compressed ZIP file.📦 ZIP file: /content/generated_samples_7528a30a-a46a-4042-9b82-8358872d37d9.zip
Cleanup (Optional)
Uncomment the code to delete the spec and datasets generated in this notebook.Deletion is commented out for safety
Uncomment the code above to delete when ready

