Multi-folder datasets are used when each sample consists of multiple related files organized in folders. This tutorial covers the complete workflow from creating a multi-folder dataset to generating and evaluating complex multi-file samples using the Dataframer Python SDK.
Each folder represents one complete sample. For a medical EHR dataset, we’ll create two patient case folders:
Copy
from pathlib import Path# Create patient case foldersPath("Dataset/patient_case_001").mkdir(parents=True, exist_ok=True)Path("Dataset/patient_case_002").mkdir(parents=True, exist_ok=True)print("✓ Created patient case folders")
import zipfileimport iodataset_name = "EHR_patient_records_demo"# Create a ZIP file in memory containing the entire Dataset folderzip_buffer = io.BytesIO()with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zip_file: for patient_folder in sorted(dataset_folder_path.iterdir()): if patient_folder.is_dir(): for file_path in sorted(patient_folder.iterdir()): if file_path.is_file(): # Add file to ZIP with folder structure preserved arcname = f"{patient_folder.name}/{file_path.name}" zip_file.write(file_path, arcname) print(f"📦 Added: {arcname}")zip_buffer.seek(0)# Upload ZIP file - backend auto-detects structure and validatesprint("\n🚀 Uploading dataset...")dataset_response = client.dataframer.datasets.create_from_zip( name=dataset_name, description="Electronic Health Records (EHR) dataset with multiple patient cases", zip_file=zip_buffer)print(f"\n✅ Dataset created: {dataset_name}")print(f" ID: {dataset_response.id}")print(f" Type: {dataset_response.dataset_type} (auto-detected)")print(f" Files: {dataset_response.file_count} | Folders: {dataset_response.folder_count}")# Store the dataset_id for later usedataset_id = dataset_response.id
Use an LLM to automatically analyze your dataset and create a spec. This process reads all files, analyzes patterns in the input files and generates a spec (blueprint) that will be used for synthetic data generation.
# Create a spec by analyzing the datasetspec_name = f"Spec for {dataset_info.name}"analysis_result = client.dataframer.analyze.create( dataset_id=dataset_id, name=spec_name, # Optional: Specify the AI model for analysis analysis_model_name="claude-sonnet-4-5", # Note: This model can be used to generate evals data but not data to train competing models # Optional: Analysis configuration extrapolate_values=True, # Extrapolate new values beyond existing ranges generate_distributions=True, # Generate statistical distributions)print(f"✓ Created spec '{spec_name}' via AI analysis")print(f" Task ID: {analysis_result.task_id}")print(f" Status: {analysis_result.status}")
Recommended AI Models:
claude-sonnet-4-5* (recommended for quality)
claude-sonnet-4-5-thinking*
claude-haiku-4-5* (fast & cheap)
deepseek-ai/DeepSeek-V3.1
moonshotai/Kimi-K2-Instruct
openai/gpt-oss-120b (slow)
deepseek-ai/DeepSeek-R1-0528-tput (slow)
Qwen/Qwen2.5-72B-Instruct-Turbo
* These models can be used to generate evals data but not data to train competing models.
# Retrieve all specs to get the spec IDspecs = client.dataframer.specs.list()print("Available specs:")for spec in specs: print(f" • {spec.name} (ID: {spec.id})") print(f" Dataset: {spec.dataset_name}") print(f" Latest version: {spec.latest_version}")# Store the spec_id for generationspec_id = specs[0].id # Use your newly created spec
import yaml# Get the latest versionversions = client.dataframer.specs.versions.list(spec_id=spec_id)latest_version = client.dataframer.specs.versions.retrieve( spec_id=spec_id, version_id=versions[0].id)# Parse the current configcurrent_config = yaml.safe_load(latest_version.config_yaml)spec_data = current_config.get('spec', current_config)# Example: Add a new data property variation for EHRif 'data_property_variations' in spec_data: new_property = { 'property_name': 'Case Severity', 'property_values': ['Critical', 'Severe', 'Moderate', 'Mild'], 'base_distributions': { 'Critical': 10, 'Severe': 25, 'Moderate': 40, 'Mild': 25 }, 'conditional_distributions': {} } spec_data['data_property_variations'].append(new_property) print(f"✓ Added new property: {new_property['property_name']}") # Update requirements for medical context if 'requirements' in spec_data: spec_data['requirements'] += ( "\n\nGenerated patient cases must maintain medical accuracy " "and include appropriate clinical correlations between symptoms, " "test results, and diagnoses." ) print(f"✓ Updated requirements for medical context")# Convert back to YAMLnew_config_yaml = yaml.dump(current_config, default_flow_style=False, sort_keys=False)# Update the spec (creates a new version automatically)updated_spec = client.dataframer.specs.update( spec_id=spec_id, config_yaml=new_config_yaml, # Your edits results_yaml=latest_version.results_yaml, # Historical reference orig_results_yaml=latest_version.orig_results_yaml, # Backup runtime_params=latest_version.runtime_params # Metadata)print(f"\n✓ Updated spec. New version: {updated_spec.latest_version}")
YAML Fields Explained:
config_yaml: Your updated configuration (used for generation) - REQUIRED
results_yaml: Original AI analysis from version 1 (never changes)
orig_results_yaml: Backup of original analysis
runtime_params: Metadata about analysis/generation
# Create a generation rungeneration_result = client.dataframer.generate.create( spec_id=spec_id, generation_model="claude-sonnet-4-5", # Note: This model can be used to generate evals data but not data to train competing models number_of_samples=5, sample_type="long", # Multi-folder requires "long" for proper file relationships ## Advanced configuration for long samples outline_model="claude-sonnet-4-5", # Note: This model can be used to generate evals data but not data to train competing models enable_revisions=True, max_revision_cycles=2, outline_thinking_budget=2000, revision_model="claude-sonnet-4-5", # Note: This model can be used to generate evals data but not data to train competing models revision_thinking_budget=1500,)print(f"✓ Started generation run")print(f" Task ID: {generation_result.task_id}")print(f" Run ID: {generation_result.run_id}")print(f" Status: {generation_result.status}")# Store for later usetask_id = generation_result.task_idrun_id = generation_result.run_id
Available Generation Models:
claude-sonnet-4-5* (recommended for quality)
claude-sonnet-4-5-thinking*
claude-haiku-4-5* (fast & cheap)
deepseek-ai/DeepSeek-V3.1
moonshotai/Kimi-K2-Instruct
openai/gpt-oss-120b (slow)
deepseek-ai/DeepSeek-R1-0528-tput (slow)
Qwen/Qwen2.5-72B-Instruct-Turbo
* These models can be used to generate evals data but not data to train competing models.
Multi-folder generation requires long sample type for proper file relationships.
Generation takes 5-15 minutes per sample folder.
Once the generateion is complete, the evaluation step can be used to automatically tag the generated samples with the appropriate labels, observe distributions of the data across the various attributes and run adhoc queries against the generated dataset.
# Create an evaluation for the completed run# Note: Run must be in 'SUCCEEDED' status with generated filesprint(f"Creating evaluation for run: {run_id}")evaluation = client.dataframer.evaluations.create( run_id=run_id, evaluation_model="claude-sonnet-4-5" # Note: This model can be used to generate evals data but not data to train competing models)print(f"\nEvaluation created")print(f" Evaluation ID: {evaluation.id}")print(f" Status: {evaluation.status}")print(f" Created at: {evaluation.created_at}")# Store the evaluation_id for later useevaluation_id = evaluation.id
from pathlib import Pathprint(f"📥 Downloading generated files with metadata as ZIP...")# Download ZIP file from backend# The ZIP contains:# - All generated files with folder structure# - .metadata files with evaluation tags/classifications# - top_level.metadata with evaluation summarydownloaded_zip = client.dataframer.runs.generated_files.download_all( run_id=run_id)# Save ZIP fileoutput_file = Path(f"generated_samples_{run_id}.zip")output_file.write_bytes(downloaded_zip.read())print(f"\n✅ Download complete!")print(f"📦 ZIP file: {output_file.absolute()}")
When you’re done with testing, you can delete the dataset. First, delete all specs that reference it:
Copy
## ⚠️ Warning: This action cannot be undone. All files will be permanently deleted.## Step 1: Get all specs for this datasetall_specs = client.dataframer.specs.list()dataset_specs = [spec for spec in all_specs if spec.dataset_name == dataset_name]print(f"Found {len(dataset_specs)} spec(s) referencing this dataset")## Step 2: Delete all specs that reference this datasetfor spec in dataset_specs: print(f" Deleting spec: {spec.name} (ID: {spec.id})") ## Uncomment to delete the spec # client.dataframer.specs.delete(spec_id=spec.id) # print(f" ✓ Deleted spec {spec.id}")## Step 3: Delete the dataset## Note: Cannot delete a dataset that is referenced by any specs.## Uncomment to delete the dataset after deleting all specs# client.dataframer.datasets.delete(dataset_id=dataset_id)# print(f"✓ Deleted dataset {dataset_id}")print("\n⚠️ Deletion is commented out for safety")print("Uncomment the code above to delete when ready")
✅ Check requirements first: Review the Folder Structure Requirements before starting✅ Consistent file naming: Use same filenames across all folders when possible✅ Clear relationships: Document how files relate in specification requirements✅ Explicit requirements: List all required files explicitly in the spec✅ Test with small batches: Generate 3-5 folders first to verify quality✅ Validate structure: Check folder structure before uploading✅ Quality seed data: Provide high-quality, consistent examples✅ Monitor file sizes: Keep files under 1MB for optimal performance✅ Use appropriate models:
claude-sonnet-4-5* for highest quality
claude-sonnet-4-5-thinking*
claude-haiku-4-5* for fast & cheap generation
deepseek-ai/DeepSeek-V3.1
moonshotai/Kimi-K2-Instruct
openai/gpt-oss-120b (slow)
deepseek-ai/DeepSeek-R1-0528-tput (slow)
Qwen/Qwen2.5-72B-Instruct-Turbo
* These models can be used to generate evals data but not data to train competing models.
✅ Long sample type: Always use sample_type="long" for multi-folder generation