In this notebook we will demonstrate how the DataFramer Python SDK (PIP Package: pydataframer) can be used to generate large amounts of high quality synthetic datasets where each sample size can be arbitrarily large.We will specifically demonstrate how to generate Electronic Health Records (EHR) where each sample represents a patient health record. These can include results of lab tests, diagnoses, medical opinions etc.
Initialize the Client. A Dataframer API key is required for this step. This can be retrieved by navigating to Account -> Keys -> Copy API Key on the web application.
EHR records are uploaded as multiple folders. Each folder contains multiple files where each file is a patient health record. A health record can be a lab test report, a medical opinion for a doctor on the overall case, a case history etc. The structure of the folders should look like this:
zip_buffer = io.BytesIO(requests.get("https://drive.google.com/uc?export=download&id=1V4mY8_c5lXHUa9pYxmBbzAFfkK8b-PJk").content)dataset = client.dataframer.seed_datasets.create_from_zip( name=f"patient_dataset_{datetime.now().strftime('%Y%m%d_%H%M%S')}", description="Patient record dataset uploaded from Google Drive ZIP", zip_file=zip_buffer)print(f"Upload complete\nDataset ID: {dataset.id}")# Store the dataset_id for later usedataset_id = dataset.id
A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. DataFramer automatically generates specifications by analyzing your seed data.
This API allows you view all the specifications across the entire company.
# Retrieve all specs to get the spec IDspecs = client.dataframer.specs.list()print("Available specs:")for spec in specs: print(f" - {spec.name} (ID: {spec.id})") print(f" Dataset: {spec.dataset_name}")
Available specs: - Spec for dataset a023c3fb-e6eb-4368-a12a-d3385cb7b6a2 (ID: cce55d46-8741-44a0-9ce0-836961d487f9) Dataset: patient_dataset
For the dataset for which a specification was triggered above, view the results of the specification generation via the specs.retrieve(spec_id=spec_id) API.
# Get the spec (latest content_yaml is returned directly)spec = client.dataframer.specs.retrieve(spec_id=spec_id)print(f"Content YAML length: {len(spec.content_yaml)} chars")# Parse the config to see data propertiesconfig = yaml.safe_load(spec.content_yaml)# Access spec dataspec_data = config.get('spec', config)print(f"\nData property variations:")if 'data_property_variations' in spec_data: for prop in spec_data['data_property_variations']: print(f" - {prop['property_name']}: {len(prop['property_values'])} values")
Content YAML length: 21232 charsData property variations: - Primary diagnosis category: 10 values - Patient age group: 5 values - Patient gender: 2 values - Admission acuity and initial location: 7 values - Length of stay: 6 values - Number of specialists consulted: 5 values - Procedures performed during hospitalization: 7 values - Complexity of medication changes at discharge: 4 values - Social complexity and discharge planning needs: 7 values - Number of follow-up appointments scheduled: 4 values - Document formatting style: 5 values - Hospital course narrative style: 5 values - Level of medical abbreviation density: 3 values - Inclusion of prognostic information: 3 values - Patient education and counseling documentation detail: 3 values - Complication or adverse event occurrence: 4 values - Diagnostic certainty at discharge: 4 values - Severity of primary condition: 4 values - Number of secondary diagnoses: 4 values - Hospital system/institution name style: 5 values - Geographic region indicators: 5 values - Race/ethnicity mention: 3 values - Chief complaint presentation style: 3 values - Lab values presentation format: 4 values - Discharge instructions emphasis on warning signs: 4 values - Medication list format at discharge: 4 values
This cell demonstrates how to programmatically update a given specification. Here, we will add a new data property called Case Severity with values 'Critical', 'Severe', 'Moderate', 'Mild' and expected distributions of these values [10, 25, 40, 25]
# Get the spec (latest content_yaml is returned directly)spec = client.dataframer.specs.retrieve(spec_id=spec_id)# Parse the current configcurrent_config = yaml.safe_load(spec.content_yaml)spec_data = current_config.get('spec', current_config)# Example: Add a new data property variation for EHRif 'data_property_variations' in spec_data: new_property = { 'property_name': 'Case Severity', 'property_values': ['Critical', 'Severe', 'Moderate', 'Mild'], 'base_distributions': { 'Critical': 10, 'Severe': 25, 'Moderate': 40, 'Mild': 25 }, 'conditional_distributions': {} } spec_data['data_property_variations'].append(new_property) print(f"Added new property: {new_property['property_name']}") # Update requirements for medical context if 'requirements' in spec_data: spec_data['requirements'] += ( "\n\nGenerated patient cases must maintain medical accuracy " "and include appropriate clinical correlations between symptoms, " "test results, and diagnoses." ) print(f"Updated requirements for medical context")# Convert back to YAMLnew_content_yaml = yaml.dump(current_config, default_flow_style=False, sort_keys=False)# Update the spec (creates a new version automatically)updated_spec = client.dataframer.specs.update( spec_id=spec_id, content_yaml=new_content_yaml,)print(f"\nSpec updated successfully")
Added new property: Case SeverityUpdated requirements for medical contextSpec updated successfully
While DataFramer evaluates each sample as it is generated, it also supports a post-generation evaluation. This API shows how to evaluate the generated dataset. Read the documentation for more details.
This API allows you to download all the generated files as a compressed ZIP file. The download is asynchronous - first request triggers ZIP generation, then poll until ready.
print(f"Downloading generated files with metadata as ZIP...")def download_not_ready(response): return not hasattr(response, 'download_url') or response.download_url is None@retry(wait=wait_fixed(2), retry=retry_if_result(download_not_ready), stop=stop_never, before_sleep=lambda rs: print(" ZIP generation in progress, waiting..."))def poll_download(client, run_id): return client.dataframer.runs.files.download_all(run_id=run_id)response = poll_download(client, run_id)zip_response = requests.get(response.download_url)output_file = Path(f"generated_samples_{run_id}.zip")output_file.write_bytes(zip_response.content)print(f"\nDownload complete!")print(f"ZIP file: {output_file.absolute()}")
Downloading generated files with metadata as ZIP...Download complete!ZIP file: /content/generated_samples_7528a30a-a46a-4042-9b82-8358872d37d9.zip