Skip to main content

Contradiction Check

Distribution Breakdown

Progressive Results

Anonymization API

Faster App

Auth Reliability

New Features

🛡️ Contradiction Check

A new LLM-based quality gate that catches logically contradictory attribute combinations in sampled data (e.g. “sunny weather” + “heavy rain”) and automatically re-samples.

📊 Distribution Breakdown

Distribution analysis now shows three separate views for each property axis:
  • Requested — what the spec asks for (target percentages)
  • Expected — what’s achievable given the sample count (accounts for conditional probability cascades)
  • Evaluated — what an LLM classifier determined each sample actually is
The “Expected” calculation now properly accounts for conditional probability cascades instead of naively using base distributions.

⚡ Progressive Results

The run detail page now shows samples and cost as they complete in real time, instead of waiting for the entire run to finish.

🔒 Anonymization API

Released the first mature version of the anonymization (PII/PHI redaction) API, with full documentation, SDK support, and MCP integration.

🚀 Faster App Experience

Faster page loads across the app — initial navigation, run details, and profile pages all render noticeably quicker.

🔑 Auth Reliability

Fixed several edge cases that could cause login failures or unexpected logouts.

New PDF Engine

Calculator Tool

Cost Controls

Conformance Filtering

PII/PHI Anonymization

New Features

📄 New PDF Generation Engine

The PDF pipeline has been rebuilt from the ground up. DataFramer now generates a unique visual style for each document—layout, fonts, colors, structure.Highlights:
  • Automatic visual QA: Every generated PDF goes through automated revision cycles and quality checks to ensure its visual excellence.

🔢 Calculator Tool

A sandboxed Python execution environment is now available to the LLM during data generation, ensuring numerical accuracy across all generated tables and documents.Highlights:
  • Arithmetically correct output: The LLM computes totals, percentages, and cross-checks figures in the document, eliminating hallucinated numbers that don’t add up. This happens multiple times both before and after writing them into the document, ensuring extremely low error rates.

💰 New Cost Controls

New run parameters give you fine-grained control over generation cost and speed, enabling cost savings.Features:
  • One-shot generation: Generates the entire document in a single LLM call instead of the default outline → sections → concatenation process. Enabled by default — disable it when using weaker models or if the document is too long for even state-of-the-art models.
  • Selective revision types: Instead of all-or-nothing document revisions, you can now enable only the specific revision passes you need for your use case.

🎯 Conformance Filtering

A new post-generation quality gate that ensures every sample in your dataset actually matches its target specification and desired properties.Highlights:
  • Automatic regeneration: Documents that clearly violate their target properties are automatically discarded and regenerated from scratch — no manual review needed.

🔒 PII/PHI Anonymization

A new experimental tool for redacting sensitive information from existing datasets at scale and at extremely low cost (~$0.1 / million tokens).Capabilities:
  • Quality evaluation: An optional LLM-based evaluation step measures precision, recall, and F1 of the redaction.
  • Broad file support: Works with CSV, JSON, JSONL, Markdown, and plain text datasets.

New Features

📄 PDF Support

DataFramer now supports PDFs as a first-class file type across datasets and generation workflows.Highlights:
  • Dataset ingestion: .pdf is now accepted anywhere you upload dataset files, alongside .txt, .md, .json, .csv, and .jsonl.
  • Template prompts for PDFs: you can pass a prompt to control the visual style of generated PDFs (e.g., “Professional corporate style with blue headers”).

Blog Posts & Studies

Databricks Integration

Cost & Time Estimates

Public API

MCP Server

PDF Generation

New Features

📝 New Blog Posts & Studies

New research and tutorials on the DataFramer blog:

🧱 Databricks Integration

Full integration with Databricks for data ingestion, generation, and model hosting. See the Databricks integration guide for a full walkthrough.Capabilities:
  • pydataframer-databricks — new Python package for working with DataFramer directly from Databricks notebooks. Includes DatabricksConnector for fetching sample data from Unity Catalog tables and loading generated data back into Delta tables via service principal M2M OAuth.
  • Databricks native models — Databricks-hosted models can be used for specs, generation, evaluation, and chat. A DataFramer admin configures service principal credentials once in the DataFramer UI, and any team member can then select databricks/ models without passing credentials in API calls

💰 Cost & Time Estimates

See estimated cost and generation time before starting a run. Because DataFramer uses an agentic generation workflow with multiple LLM calls per sample, costs were previously difficult to predict. The estimator uses a simulated model of the full generation pipeline to produce forecasts before you commit to a run.How it works:
  • Estimates update live as you adjust sample count, model, dataset type, and other parameters on the Create Run page
  • Accounts for all stages of generation: outline, content, revision cycles, and evaluation

🔌 Public API

Stable public REST API for programmatic access to the full DataFramer workflow — datasets, specs, generation, evaluation, and red-teaming. The API went through a major overhaul to reach a stable, consistent interface.Highlights:
  • Python SDK (pydataframer) with typed methods for every endpoint
  • Thoroughly documented in the API Reference with Python code examples for every endpoint

🤖 MCP Server

DataFramer is now available as an MCP (Model Context Protocol) server, allowing AI assistants like Claude Code, Cursor, and other MCP-compatible clients to interact with the platform directly.Capabilities:
  • Upload datasets, create specs, generate data, and download results — all through natural conversation with an AI assistant
  • Unlike the raw API, MCP also provides your AI assistant with detailed instructions on how to use DataFramer effectively — so it can guide you through the entire workflow conversationally
  • See API & MCP for setup instructions

📄 PDF Generation

Generate synthetic PDF documents with custom styling. Describe the visual style you want (e.g., “professional corporate style with blue headers”) and DataFramer generates styled PDFs automatically.Capabilities:
  • Full PDF input/output — use PDF seed examples and generate new PDF documents
  • Custom styling via a natural language prompt that controls headers, fonts, colors, and layout

Seedless Generation

Admin Tools

Billing System

ToolBox - SQL

Gemini 3 Pro

New Features

🌱 Seedless Generation

Generate high-quality synthetic data without requiring any seed examples. Simply describe what you want and let DataFramer create it from scratch.How to create a spec (blueprint for the data) without uploading examples:
  1. Select “Seedless” as the specification type in the spec creation wizard
  2. Provide a spec name and generation objectives
  3. Set your target token range (e.g., 2,000-5,000 tokens)

👥 Admin Tools

New internal administration capabilities for managing teams and users.Features:
  • Role-based access control with Admin and User roles
  • Admins can promote/demote users between Admin and User roles
  • Company-wide user visibility and management from the Profile page

💳 Billing System

Usage-based billing with transparent pricing and detailed invoicing.How it works:
  • Calendar month billing cycles (1st to last day of each month)
  • Run Details page now shows the cost of your run
  • Failed task cost exclusion - you’re not charged for failed runs
  • Team and Enterprise plan types

🗃️ ToolBox - SQL Execution Environment

Multi-database SQL validation engine for generating high-quality Text-to-SQL datasets.Capabilities:
  • Validates both schema DDL and query SQL
  • Parallel testing against 3 databases: PostgreSQL, MySQL, SQLite
  • REST API integration for programmatic access

🤖 Gemini 3 Pro Support

Full integration of Google’s latest Gemini 3 Pro models across the platform.Capabilities:
  • Minimal reasoning mode (gemini/gemini-3-pro-preview) and high reasoning mode (gemini/gemini-3-pro-preview-thinking)
  • 1 million token context window
  • Available for spec analysis, generation, evaluation, red-teaming, and chat