Skip to main content

Open in Google Colab

Run this tutorial interactively in Google Colab

Overview

Multi-folder datasets are used when each sample consists of multiple related files organized in folders. This tutorial covers the complete workflow from creating a multi-folder dataset to generating and evaluating complex multi-file samples using the Dataframer Python SDK.

What You’ll Learn

  • When to use multi-folder datasets
  • Setting up the Dataframer SDK
  • Creating multi-folder datasets with files
  • Generating specifications via AI analysis
  • Generating multi-file samples
  • Evaluating and downloading generated folders

Prerequisites

  • Python 3.8 or higher
  • Understanding of dataset types
  • Multiple related files to use as seed data
  • API key for authentication (set as DATAFRAMER_API_KEY environment variable)

Use Cases

Multi-folder datasets are perfect for: Medical Records (EHR)
patient_case_001/
├── chest_xray_report.txt
└── discharge_summary.md

patient_case_002/
├── blood_work.txt
└── clinical_notes.md
Project Documentation
project_alpha/
├── README.md
├── requirements.txt
└── design_doc.md

project_beta/
├── README.md
├── api_spec.md
└── deployment_guide.txt
Multi-Language Content
article_001/
├── english.txt
├── spanish.txt
└── metadata.md

article_002/
├── english.txt
├── french.txt
└── metadata.md

Step 1: Install and Setup SDK

Install the Dataframer SDK

pip install --upgrade pydataframer dotenv pyyaml

Setup and Initialize Client

from pathlib import Path
from dataframer import Dataframer
import os

# Initialize the Dataframer client
client = Dataframer(
    api_key=os.getenv('DATAFRAMER_API_KEY')
)

print("✓ Dataframer client initialized successfully")
print(f"  Using base URL: {client.base_url}")

# Check SDK version
import dataframer
print(f"Dataframer SDK version: {dataframer.__version__}")

Step 2: Prepare Folder Structure

Folder Structure Requirements

Before creating your dataset, review these requirements carefully: Do:
  • Create folders at the root level (Dataset → patient_case_001, patient_case_002, etc.)
  • Include at least 2 sample folders
  • Use supported file types: .md, .txt, .json, .csv, .jsonl (see all formats)
  • Keep files under 1MB each
  • Stay under 50MB total dataset size
  • Limit to 20 files per folder, 1000 files total
  • Maintain consistent structure across folders
Don’t:
  • Put files directly in root (must be in folders)
  • Exceed 2 folder levels (Dataset → subfolder → files only)
  • Include empty folders
  • Use unsupported file types (.pdf, .docx, .xlsx, etc.)
  • Mix different structures between folders

Create Organized Folders

Each folder represents one complete sample. For a medical EHR dataset, we’ll create two patient case folders:
from pathlib import Path

# Create patient case folders
Path("Dataset/patient_case_001").mkdir(parents=True, exist_ok=True)
Path("Dataset/patient_case_002").mkdir(parents=True, exist_ok=True)

print("✓ Created patient case folders")

Add Files to Each Folder

from pathlib import Path

# Create directories
Path("Dataset/patient_case_001").mkdir(parents=True, exist_ok=True)
Path("Dataset/patient_case_002").mkdir(parents=True, exist_ok=True)

# Patient Case 001 - Chest X-ray Report
with open("Dataset/patient_case_001/chest_xray_report.txt", "w") as f:
    f.write("Patient: Patient 1\n")
    f.write("Date: January 15, 2024\n")
    f.write("Findings: Clear lung fields, no infiltrates or masses detected.\n")
    f.write("Impression: Normal chest radiograph.\n")

# Patient Case 001 - Discharge Summary
with open("Dataset/patient_case_001/discharge_summary.md", "w") as f:
    f.write("# Discharge Summary\n\n")
    f.write("**Patient:** Patient 1\n")
    f.write("**Date:** January 15, 2024\n\n")
    f.write("## Clinical Course\n")
    f.write("Patient presented for routine checkup. All vitals within normal range.\n\n")
    f.write("## Discharge Diagnosis\n")
    f.write("Healthy status maintained.\n")

# Patient Case 002 - Blood Work
with open("Dataset/patient_case_002/blood_work.txt", "w") as f:
    f.write("Patient: Patient 2\n")
    f.write("Date: January 20, 2024\n")
    f.write("WBC: 7.2 K/uL (Normal)\n")
    f.write("RBC: 4.8 M/uL (Normal)\n")
    f.write("Platelets: 250 K/uL (Normal)\n")

# Patient Case 002 - Clinical Notes
with open("Dataset/patient_case_002/clinical_notes.md", "w") as f:
    f.write("# Clinical Notes\n\n")
    f.write("**Patient:** Patient 2\n")
    f.write("**Date:** January 20, 2024\n\n")
    f.write("## Visit Reason\n")
    f.write("Annual physical examination.\n\n")
    f.write("## Assessment\n")
    f.write("Patient in excellent health. All laboratory values within normal limits.\n")

print("✓ Created 2 patient case folders with files")

Verify Your Folder Structure

from pathlib import Path

dataset_folder_path = Path("Dataset")

if dataset_folder_path.exists():
    print(f"✓ Found Dataset folder at: {dataset_folder_path.absolute()}")
    print(f"\nFolder structure:")
    
    for patient_folder in sorted(dataset_folder_path.iterdir()):
        if patient_folder.is_dir():
            print(f"\n📁 {patient_folder.name}/")
            for file in sorted(patient_folder.iterdir()):
                size_kb = file.stat().st_size / 1024
                print(f"   📄 {file.name} ({size_kb:.1f} KB)")
else:
    print(f"Dataset folder not found")

Step 3: Create Multi-Folder Dataset

The SDK provides a simple ZIP-based upload method.

Create ZIP and Upload

import zipfile
import io

dataset_name = "EHR_patient_records_demo"

# Create a ZIP file in memory containing the entire Dataset folder
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zip_file:
    for patient_folder in sorted(dataset_folder_path.iterdir()):
        if patient_folder.is_dir():
            for file_path in sorted(patient_folder.iterdir()):
                if file_path.is_file():
                    # Add file to ZIP with folder structure preserved
                    arcname = f"{patient_folder.name}/{file_path.name}"
                    zip_file.write(file_path, arcname)
                    print(f"📦 Added: {arcname}")

zip_buffer.seek(0)

# Upload ZIP file - backend auto-detects structure and validates
print("\n🚀 Uploading dataset...")
dataset_response = client.dataframer.datasets.create_from_zip(
    name=dataset_name,
    description="Electronic Health Records (EHR) dataset with multiple patient cases",
    zip_file=zip_buffer
)

print(f"\n✅ Dataset created: {dataset_name}")
print(f"   ID: {dataset_response.id}")
print(f"   Type: {dataset_response.dataset_type} (auto-detected)")
print(f"   Files: {dataset_response.file_count} | Folders: {dataset_response.folder_count}")

# Store the dataset_id for later use
dataset_id = dataset_response.id

List All Datasets

# List all datasets to verify creation
datasets = client.dataframer.datasets.list()

print("=" * 80)
print(f"Found {len(datasets)} dataset(s)")
print("=" * 80)

for i, dataset in enumerate(datasets, 1):
    print(f"\n📁 Dataset {i}:")
    print(f"  Name: {dataset.name}")
    print(f"  ID: {dataset.id}")
    print(f"  Type: {dataset.dataset_type_display}")
    print(f"  Files: {dataset.file_count} | Folders: {dataset.folder_count}")
    print(f"  Created: {dataset.created_at.strftime('%Y-%m-%d %H:%M:%S')}")

Retrieve Dataset Details

# Get detailed information about the dataset
dataset_info = client.dataframer.datasets.retrieve(dataset_id=dataset_id)

print("📋 Dataset Information:")
print("=" * 80)
print(f"ID: {dataset_info.id}")
print(f"Name: {dataset_info.name}")
print(f"Type: {dataset_info.dataset_type} ({dataset_info.dataset_type_display})")
print(f"Description: {dataset_info.description}")
print(f"Created: {dataset_info.created_at}")
print()
print(f"📁 Contents:")
print(f"  Files: {dataset_info.file_count}")
print(f"  Folders: {dataset_info.folder_count}")
print()
print(f"🔧 Compatibility:")
compat = dataset_info.short_sample_compatibility
print(f"  Short samples: {'✅' if compat.is_short_samples_compatible else '❌'}")
print(f"  Long samples: {'✅' if compat.is_long_samples_compatible else '❌'}")
if compat.reason:
    print(f"  Reason: {compat.reason}")
print("=" * 80)

Step 4: Generate a Spec using the Analysis API

Use an LLM to automatically analyze your dataset and create a spec. This process reads all files, analyzes patterns in the input files and generates a spec (blueprint) that will be used for synthetic data generation.

Start Analysis

# Create a spec by analyzing the dataset
spec_name = f"Spec for {dataset_info.name}"

analysis_result = client.dataframer.analyze.create(
    dataset_id=dataset_id,
    name=spec_name,

    # Optional: Specify the AI model for analysis
    analysis_model_name="claude-sonnet-4-5",
    # Note: This model can be used to generate evals data but not data to train competing models

    # Optional: Analysis configuration
    extrapolate_values=True,        # Extrapolate new values beyond existing ranges
    generate_distributions=True,    # Generate statistical distributions
)

print(f"✓ Created spec '{spec_name}' via AI analysis")
print(f"  Task ID: {analysis_result.task_id}")
print(f"  Status: {analysis_result.status}")
Recommended AI Models:
  • claude-sonnet-4-5* (recommended for quality)
  • claude-sonnet-4-5-thinking*
  • claude-haiku-4-5* (fast & cheap)
  • deepseek-ai/DeepSeek-V3.1
  • moonshotai/Kimi-K2-Instruct
  • openai/gpt-oss-120b (slow)
  • deepseek-ai/DeepSeek-R1-0528-tput (slow)
  • Qwen/Qwen2.5-72B-Instruct-Turbo
* These models can be used to generate evals data but not data to train competing models.

Check Analysis Status

The analysis is asynchronous and may take 3-10 minutes for multi-folder datasets:
# Poll for analysis completion
task_id = analysis_result.task_id
status = client.dataframer.analyze.get_status(task_id=task_id)

print(f"Analysis status: {status['status']}")
# Status values: PENDING, RUNNING, COMPLETED, FAILED

List All Specs

# Retrieve all specs to get the spec ID
specs = client.dataframer.specs.list()

print("Available specs:")
for spec in specs:
    print(f"  • {spec.name} (ID: {spec.id})")
    print(f"    Dataset: {spec.dataset_name}")
    print(f"    Latest version: {spec.latest_version}")

# Store the spec_id for generation
spec_id = specs[0].id  # Use your newly created spec

Review Generated Specification

The specification captures the structure and patterns from your dataset:
# Get the latest version details
spec = client.dataframer.specs.retrieve(spec_id=spec_id)
versions = client.dataframer.specs.versions.list(spec_id=spec_id)

if len(versions) > 0:
    latest_version = client.dataframer.specs.versions.retrieve(
        spec_id=spec_id,
        version_id=versions[0].id
    )
    
    print(f"Latest version: {latest_version.version}")
    print(f"Config YAML length: {len(latest_version.config_yaml)} chars")
    
    # Parse the config to see data properties
    import yaml
    config = yaml.safe_load(latest_version.config_yaml)
    
    # Access spec data
    spec_data = config.get('spec', config)
    
    print(f"\nData property variations:")
    if 'data_property_variations' in spec_data:
        for prop in spec_data['data_property_variations']:
            print(f"  • {prop['property_name']}: {len(prop['property_values'])} values")

Step 5: Update Spec (Optional)

You can update the spec by modifying the config YAML and creating a new version:

Modify Specification Config

import yaml

# Get the latest version
versions = client.dataframer.specs.versions.list(spec_id=spec_id)
latest_version = client.dataframer.specs.versions.retrieve(
    spec_id=spec_id,
    version_id=versions[0].id
)

# Parse the current config
current_config = yaml.safe_load(latest_version.config_yaml)
spec_data = current_config.get('spec', current_config)

# Example: Add a new data property variation for EHR
if 'data_property_variations' in spec_data:
    new_property = {
        'property_name': 'Case Severity',
        'property_values': ['Critical', 'Severe', 'Moderate', 'Mild'],
        'base_distributions': {
            'Critical': 10,
            'Severe': 25,
            'Moderate': 40,
            'Mild': 25
        },
        'conditional_distributions': {}
    }
    spec_data['data_property_variations'].append(new_property)
    print(f"✓ Added new property: {new_property['property_name']}")
    
    # Update requirements for medical context
    if 'requirements' in spec_data:
        spec_data['requirements'] += (
            "\n\nGenerated patient cases must maintain medical accuracy "
            "and include appropriate clinical correlations between symptoms, "
            "test results, and diagnoses."
        )
        print(f"✓ Updated requirements for medical context")

# Convert back to YAML
new_config_yaml = yaml.dump(current_config, default_flow_style=False, sort_keys=False)

# Update the spec (creates a new version automatically)
updated_spec = client.dataframer.specs.update(
    spec_id=spec_id,
    config_yaml=new_config_yaml,  # Your edits
    results_yaml=latest_version.results_yaml,  # Historical reference
    orig_results_yaml=latest_version.orig_results_yaml,  # Backup
    runtime_params=latest_version.runtime_params  # Metadata
)

print(f"\n✓ Updated spec. New version: {updated_spec.latest_version}")
YAML Fields Explained:
  • config_yaml: Your updated configuration (used for generation) - REQUIRED
  • results_yaml: Original AI analysis from version 1 (never changes)
  • orig_results_yaml: Backup of original analysis
  • runtime_params: Metadata about analysis/generation

Step 6: Generate Multi-Folder Samples

Generate new synthetic patient record folders based on your specification:

Start Generation Run

# Create a generation run
generation_result = client.dataframer.generate.create(
    spec_id=spec_id,
    
    generation_model="claude-sonnet-4-5",
    # Note: This model can be used to generate evals data but not data to train competing models
    number_of_samples=5,
    sample_type="long",  # Multi-folder requires "long" for proper file relationships
    
    ## Advanced configuration for long samples

    outline_model="claude-sonnet-4-5",
    # Note: This model can be used to generate evals data but not data to train competing models
    enable_revisions=True,
    max_revision_cycles=2,
    outline_thinking_budget=2000,
    
    revision_model="claude-sonnet-4-5",
    # Note: This model can be used to generate evals data but not data to train competing models
    revision_thinking_budget=1500,
)

print(f"✓ Started generation run")
print(f"  Task ID: {generation_result.task_id}")
print(f"  Run ID: {generation_result.run_id}")
print(f"  Status: {generation_result.status}")

# Store for later use
task_id = generation_result.task_id
run_id = generation_result.run_id
Available Generation Models:
  • claude-sonnet-4-5* (recommended for quality)
  • claude-sonnet-4-5-thinking*
  • claude-haiku-4-5* (fast & cheap)
  • deepseek-ai/DeepSeek-V3.1
  • moonshotai/Kimi-K2-Instruct
  • openai/gpt-oss-120b (slow)
  • deepseek-ai/DeepSeek-R1-0528-tput (slow)
  • Qwen/Qwen2.5-72B-Instruct-Turbo
* These models can be used to generate evals data but not data to train competing models.
Multi-folder generation requires long sample type for proper file relationships. Generation takes 5-15 minutes per sample folder.

Monitor Generation Status

# Check generation status
status = client.dataframer.generate.retrieve_status(task_id=task_id)
print(f"Generation status: {status['status']}")
# Status values: PENDING, RUNNING, COMPLETED, FAILED

List All Runs

# View all generation runs
runs = client.dataframer.runs.list()

print("All generation runs:")
for run in runs:
    print(f"  • Run {run.id}: {run.status}")
    print(f"    Spec: {run.spec_name}")
    print(f"    Samples: {run.number_of_samples}")

Retrieve Run Status

# Get detailed run status
run_status = client.dataframer.runs.status(run_id=run_id)
print(f"Run status: {run_status['status']}")

Step 7: Evaluate Generated Samples

Once the generateion is complete, the evaluation step can be used to automatically tag the generated samples with the appropriate labels, observe distributions of the data across the various attributes and run adhoc queries against the generated dataset.

Create Evaluation

# Create an evaluation for the completed run
# Note: Run must be in 'SUCCEEDED' status with generated files

print(f"Creating evaluation for run: {run_id}")

evaluation = client.dataframer.evaluations.create(
    run_id=run_id,
    evaluation_model="claude-sonnet-4-5"
    # Note: This model can be used to generate evals data but not data to train competing models
)

print(f"\nEvaluation created")
print(f"  Evaluation ID: {evaluation.id}")
print(f"  Status: {evaluation.status}")
print(f"  Created at: {evaluation.created_at}")

# Store the evaluation_id for later use
evaluation_id = evaluation.id

Check Evaluation Status

# Poll this endpoint until status is 'COMPLETED' or 'FAILED'
evaluation_results = client.dataframer.evaluations.retrieve(
    evaluation_id=evaluation_id
)

print(f"Evaluation Status: {evaluation_results.status}")

if evaluation_results.status == 'FAILED':
    print(f"\n❌ Evaluation failed")
    if evaluation_results.error_message:
        print(f"  Error: {evaluation_results.error_message}")
elif evaluation_results.status == 'COMPLETED':
    print(f"\n✅ Evaluation completed successfully")

Evaluation Checks

The evaluation process analyzes:
  • All required files present in each folder
  • File formats are correct
  • Data consistency across files
  • Content quality of each file
  • Conformance to spec requirements
  • Distribution of data property variations

Step 8: Download Generated Folders with Metadata

After evaluation completes, download the generated files with their evaluation metadata.

List Generated Files

# Get generated files for the run
result = client.dataframer.runs.generated_files.list(run_id=run_id)

print("📁 Generated Files:")
print("=" * 80)
print(f"Run ID: {result.run_id}")
print(f"Total files: {len(result.generated_files)}")
print("=" * 80)

for i, file in enumerate(result.generated_files, 1):
    print(f"\n📄 File {i}:")
    print(f"  Name: {file.name}")
    print(f"  ID: {file.id}")
    print(f"  Status: {file.status}")
    print(f"  Size: {file.size} bytes")
    print(f"  Type: {file.type}")
    if file.status_details:
        print(f"  Details: {file.status_details}")
    if file.generation_model:
        print(f"  Model: {file.generation_model}")

Download All Files as ZIP

from pathlib import Path

print(f"📥 Downloading generated files with metadata as ZIP...")

# Download ZIP file from backend
# The ZIP contains:
# - All generated files with folder structure
# - .metadata files with evaluation tags/classifications
# - top_level.metadata with evaluation summary
downloaded_zip = client.dataframer.runs.generated_files.download_all(
    run_id=run_id
)

# Save ZIP file
output_file = Path(f"generated_samples_{run_id}.zip")
output_file.write_bytes(downloaded_zip.read())

print(f"\n✅ Download complete!")
print(f"📦 ZIP file: {output_file.absolute()}")

Cleanup: Delete Dataset

When you’re done with testing, you can delete the dataset. First, delete all specs that reference it:
## ⚠️ Warning: This action cannot be undone. All files will be permanently deleted.

## Step 1: Get all specs for this dataset
all_specs = client.dataframer.specs.list()
dataset_specs = [spec for spec in all_specs if spec.dataset_name == dataset_name]

print(f"Found {len(dataset_specs)} spec(s) referencing this dataset")

## Step 2: Delete all specs that reference this dataset
for spec in dataset_specs:
    print(f"  Deleting spec: {spec.name} (ID: {spec.id})")
    ## Uncomment to delete the spec
    # client.dataframer.specs.delete(spec_id=spec.id)
    # print(f" ✓ Deleted spec {spec.id}")

## Step 3: Delete the dataset
## Note: Cannot delete a dataset that is referenced by any specs.
## Uncomment to delete the dataset after deleting all specs
# client.dataframer.datasets.delete(dataset_id=dataset_id)
# print(f"✓ Deleted dataset {dataset_id}")

print("\n⚠️ Deletion is commented out for safety")
print("Uncomment the code above to delete when ready")

Common Issues

Problem: Files are being skipped during uploadCauses:
  • Using unsupported file types
Solution:
  • Check the Folder Structure Requirements for supported file types
  • Convert your data to a supported format
  • Verify file extensions before upload
  • Check the skipped files list during upload
Problem: Dataset creation fails with errorCauses:
  • No valid files to upload (all skipped)
  • Files exceed size limits
  • Too many files
  • Total size too large
Solution:
  • Ensure at least one supported file type exists
  • Review the Folder Structure Requirements for all limits
  • Split large files into smaller chunks
  • Reduce number of files per folder
  • Check total dataset size
Problem: Some folders missing filesCauses:
  • Unclear specification
  • Model timeout
  • Complex file relationships
Solution:
  • Explicitly list required files in spec
  • Provide clear file format examples
  • Simplify file relationships
  • Try generating fewer samples
  • Use sample_type="long" for multi-folder
Problem: Data doesn’t match across files in a folderCauses:
  • Specification doesn’t emphasize consistency
  • Files treated independently
Solution:
  • Add consistency requirements to spec
  • Provide examples of consistent folders
  • Use explicit cross-file constraints in requirements
  • Update spec to emphasize relationships between files
Expected: 5-15 minutes per folderIf much slower:
  • Multi-folder is inherently slower than single files
  • Large/complex files take longer
  • Check status endpoint for progress
  • Consider reducing sample count
  • Try using faster models like “claude-haiku-4-5” or open source alternatives
Problem: Cannot import dataframer moduleCauses:
  • SDK not installed
  • Wrong virtual environment
  • Installation failed
Solution:
pip install --upgrade pydataframer dotenv pyyaml
  • Verify installation: pip show pydataframer
  • Check Python version (3.8+)

Best Practices

Check requirements first: Review the Folder Structure Requirements before starting Consistent file naming: Use same filenames across all folders when possible Clear relationships: Document how files relate in specification requirements Explicit requirements: List all required files explicitly in the spec Test with small batches: Generate 3-5 folders first to verify quality Validate structure: Check folder structure before uploading Quality seed data: Provide high-quality, consistent examples Monitor file sizes: Keep files under 1MB for optimal performance Use appropriate models:
  • claude-sonnet-4-5* for highest quality
  • claude-sonnet-4-5-thinking*
  • claude-haiku-4-5* for fast & cheap generation
  • deepseek-ai/DeepSeek-V3.1
  • moonshotai/Kimi-K2-Instruct
  • openai/gpt-oss-120b (slow)
  • deepseek-ai/DeepSeek-R1-0528-tput (slow)
  • Qwen/Qwen2.5-72B-Instruct-Turbo
* These models can be used to generate evals data but not data to train competing models.
Long sample type: Always use sample_type="long" for multi-folder generation

Next Steps