Apr 15 2026
Apr 15 2026
Contradiction Check
Distribution Breakdown
Progressive Results
Anonymization API
Faster App
Auth Reliability
New Features
🛡️ Contradiction Check
A new LLM-based quality gate that catches logically contradictory attribute combinations in sampled data (e.g. “sunny weather” + “heavy rain”) and automatically re-samples.📊 Distribution Breakdown
Distribution analysis now shows three separate views for each property axis:- Requested — what the spec asks for (target percentages)
- Expected — what’s achievable given the sample count (accounts for conditional probability cascades)
- Evaluated — what an LLM classifier determined each sample actually is
⚡ Progressive Results
The run detail page now shows samples and cost as they complete in real time, instead of waiting for the entire run to finish.🔒 Anonymization API
Released the first mature version of the anonymization (PII/PHI redaction) API, with full documentation, SDK support, and MCP integration.🚀 Faster App Experience
Faster page loads across the app — initial navigation, run details, and profile pages all render noticeably quicker.🔑 Auth Reliability
Fixed several edge cases that could cause login failures or unexpected logouts.Mar 23 2026
Mar 23 2026
New PDF Engine
Calculator Tool
Cost Controls
Conformance Filtering
PII/PHI Anonymization
New Features
📄 New PDF Generation Engine
The PDF pipeline has been rebuilt from the ground up. DataFramer now generates a unique visual style for each document—layout, fonts, colors, structure.Highlights:- Automatic visual QA: Every generated PDF goes through automated revision cycles and quality checks to ensure its visual excellence.
🔢 Calculator Tool
A sandboxed Python execution environment is now available to the LLM during data generation, ensuring numerical accuracy across all generated tables and documents.Highlights:- Arithmetically correct output: The LLM computes totals, percentages, and cross-checks figures in the document, eliminating hallucinated numbers that don’t add up. This happens multiple times both before and after writing them into the document, ensuring extremely low error rates.
💰 New Cost Controls
New run parameters give you fine-grained control over generation cost and speed, enabling cost savings.Features:- One-shot generation: Generates the entire document in a single LLM call instead of the default outline → sections → concatenation process. Enabled by default — disable it when using weaker models or if the document is too long for even state-of-the-art models.
- Selective revision types: Instead of all-or-nothing document revisions, you can now enable only the specific revision passes you need for your use case.
🎯 Conformance Filtering
A new post-generation quality gate that ensures every sample in your dataset actually matches its target specification and desired properties.Highlights:- Automatic regeneration: Documents that clearly violate their target properties are automatically discarded and regenerated from scratch — no manual review needed.
🔒 PII/PHI Anonymization
A new experimental tool for redacting sensitive information from existing datasets at scale and at extremely low cost (~$0.1 / million tokens).Capabilities:- Quality evaluation: An optional LLM-based evaluation step measures precision, recall, and F1 of the redaction.
- Broad file support: Works with CSV, JSON, JSONL, Markdown, and plain text datasets.
Feb 23 2026
Feb 23 2026
New Features
📄 PDF Support
DataFramer now supports PDFs as a first-class file type across datasets and generation workflows.Highlights:- Dataset ingestion:
.pdfis now accepted anywhere you upload dataset files, alongside.txt,.md,.json,.csv, and.jsonl. - Template prompts for PDFs: you can pass a prompt to control the visual style of generated PDFs (e.g., “Professional corporate style with blue headers”).
Feb 10 2026
Feb 10 2026
Blog Posts & Studies
Databricks Integration
Cost & Time Estimates
Public API
MCP Server
PDF Generation
New Features
📝 New Blog Posts & Studies
New research and tutorials on the DataFramer blog:- How to Generate 50K-Token Documents: Same LLM, Different Results — benchmark study comparing DataFramer vs. raw Claude Sonnet 4.5 for long-form text generation, with a companion dataset on HuggingFace
- Generation of Synthetic Text2SQL Data with 100% Validity — tutorial on generating diverse verified text-to-SQL samples using DataFramer
🧱 Databricks Integration
Full integration with Databricks for data ingestion, generation, and model hosting. See the Databricks integration guide for a full walkthrough.Capabilities:- pydataframer-databricks — new Python package for working with DataFramer directly from Databricks notebooks. Includes
DatabricksConnectorfor fetching sample data from Unity Catalog tables and loading generated data back into Delta tables via service principal M2M OAuth. - Databricks native models — Databricks-hosted models can be used for specs, generation, evaluation, and chat. A DataFramer admin configures service principal credentials once in the DataFramer UI, and any team member can then select
databricks/models without passing credentials in API calls
💰 Cost & Time Estimates
See estimated cost and generation time before starting a run. Because DataFramer uses an agentic generation workflow with multiple LLM calls per sample, costs were previously difficult to predict. The estimator uses a simulated model of the full generation pipeline to produce forecasts before you commit to a run.How it works:- Estimates update live as you adjust sample count, model, dataset type, and other parameters on the Create Run page
- Accounts for all stages of generation: outline, content, revision cycles, and evaluation
🔌 Public API
Stable public REST API for programmatic access to the full DataFramer workflow — datasets, specs, generation, evaluation, and red-teaming. The API went through a major overhaul to reach a stable, consistent interface.Highlights:- Python SDK (pydataframer) with typed methods for every endpoint
- Thoroughly documented in the API Reference with Python code examples for every endpoint
🤖 MCP Server
DataFramer is now available as an MCP (Model Context Protocol) server, allowing AI assistants like Claude Code, Cursor, and other MCP-compatible clients to interact with the platform directly.Capabilities:- Upload datasets, create specs, generate data, and download results — all through natural conversation with an AI assistant
- Unlike the raw API, MCP also provides your AI assistant with detailed instructions on how to use DataFramer effectively — so it can guide you through the entire workflow conversationally
- See API & MCP for setup instructions
📄 PDF Generation
Generate synthetic PDF documents with custom styling. Describe the visual style you want (e.g., “professional corporate style with blue headers”) and DataFramer generates styled PDFs automatically.Capabilities:- Full PDF input/output — use PDF seed examples and generate new PDF documents
- Custom styling via a natural language prompt that controls headers, fonts, colors, and layout
Jan 8 2026
Jan 8 2026
Seedless Generation
Admin Tools
Billing System
ToolBox - SQL
Gemini 3 Pro
New Features
🌱 Seedless Generation
Generate high-quality synthetic data without requiring any seed examples. Simply describe what you want and let DataFramer create it from scratch.How to create a spec (blueprint for the data) without uploading examples:- Select “Seedless” as the specification type in the spec creation wizard
- Provide a spec name and generation objectives
- Set your target token range (e.g., 2,000-5,000 tokens)
👥 Admin Tools
New internal administration capabilities for managing teams and users.Features:- Role-based access control with Admin and User roles
- Admins can promote/demote users between Admin and User roles
- Company-wide user visibility and management from the Profile page
💳 Billing System
Usage-based billing with transparent pricing and detailed invoicing.How it works:- Calendar month billing cycles (1st to last day of each month)
- Run Details page now shows the cost of your run
- Failed task cost exclusion - you’re not charged for failed runs
- Team and Enterprise plan types
🗃️ ToolBox - SQL Execution Environment
Multi-database SQL validation engine for generating high-quality Text-to-SQL datasets.Capabilities:- Validates both schema DDL and query SQL
- Parallel testing against 3 databases: PostgreSQL, MySQL, SQLite
- REST API integration for programmatic access
🤖 Gemini 3 Pro Support
Full integration of Google’s latest Gemini 3 Pro models across the platform.Capabilities:- Minimal reasoning mode (gemini/gemini-3-pro-preview) and high reasoning mode (gemini/gemini-3-pro-preview-thinking)
- 1 million token context window
- Available for spec analysis, generation, evaluation, red-teaming, and chat

