Open in Google Colab
Run this exact tutorial interactively in Google Colab
Dataframer SDK Demo
In this notebook we will demonstrate how the Dataframer Python SDK (PIP Package:pydataframer) can be used to generate large amounts of high quality synthetic datasets where each sample size can be arbitrarily large.
We will specifically demonstrate how to generate Electronic Health Records (EHR) where each sample represents a patient health record. These can include results of lab tests, diagnoses, medical opinions etc.
Step 1: Install and Setup SDK
Install the Dataframer SDK and a few other useful utilities.Step 2: Upload data
EHR records are uploaded as multiple folders. Each folder contains multiple files where each file is a patient health record. A health record can be a lab test report, a medical opinion for a doctor on the overall case, a case history etc. The structure of the folders should look like this:List All Datasets
This API allows you to list all datasets that have been uploaded across your entire company.Retrieve Dataset Details
This API demonstrates how to retrieve a specific dataset given a dataset ID.Step 3: Generate Specification via AI Analysis
A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Dataframer automatically generates specifications by analyzing your seed data.List All Specs
This API allows you view all the specifications across the entire company.Review Generated Specification
For the dataset for which a specification was triggered above, view the results of the specification generation via thespecs.retrieve(spec_id=spec_id) API.
Step 4: Update Specification (Optional)
This cell demonstrates how to programmatically update a given specification. Here, we will add a new data property calledCase Severity with values 'Critical', 'Severe', 'Moderate', 'Mild' and expected distributions of these values [10, 25, 40, 25]

