Prerequisites
-
Service Principal Permissions (one-time admin setup)
This notebook must be run using a service principal that has access to the required Unity Catalog objects. A Databricks admin should create a service principal and grant it:USE CATALOGon the catalogUSE SCHEMA,CREATE TABLE,SELECT, andMODIFYon the schema
-
Dataframer API Key
A Dataframer API key is required for this demo. This can be retrieved by navigating to Account -> Keys -> Copy API Key on the web application. Note that you can use the fully hosted Dataframer solution or an on-prem deployed version (reach out to [email protected] for more details). -
Databricks Secrets Setup
This notebook expects the following secrets to be stored in a Databricks secret scope. In this example, we use a scope nameddataframer.DATAFRAMER_API_KEYDATABRICKS_HTTP_PATHDATABRICKS_CLIENT_IDDATABRICKS_CLIENT_SECRETDATABRICKS_SERVER_HOSTNAME
Step 1: Install and Setup SDK
Install the Dataframer SDK and the Databricks connector package for Dataframer.Initialize the Dataframer client and the DatabricksConnector
In this step, we initialize the Dataframer client using an API key stored securely in Databricks Secrets under thedataframer scope. We also initialize the DatabricksConnector with dataframer scope to access/persist data from/into Unity Catalog.
Fetch sample data
For this demo, we use sample data available in the Databricks catalog, specifically thesamples.bakehouse.media_customer_reviews table. To keep the example lightweight, we select only the top 25 rows, export them as a CSV file, and upload the file to Dataframer.
Step 2: Upload data
Prepare CSV and upload to Dataframer
To use Dataframer, only a small sample of data is needed. To derive these samples from a table, we recommend creating a CSV file that contain the relevant rows from the table and supplying that as aSeed dataset to Dataframer.
Retrieve Dataset Details
This API demonstrates how to retrieve a specific dataset given a dataset ID.Step 3: Generate Specification via the analysis API
A specification (or “spec”) is a detailed description that captures the structure, patterns, and requirements of your data. Think of aspec as a blueprint for your data generation task. Dataframer automatically generates specifications by analyzing your seed data. This cell ensures a specification exists for the dataset by reusing an existing one or generating a new one.
Review Generated Specification
This cell retrieves the latest version of the generated specification and inspects key properties inferred from the dataset, such as data property variations.Step 4: Update Specification (Optional)
This cell demonstrates how to programmatically update a given specification. To keep this demo simple, the update is applied only when the specification is newly created (i.e., when the latest version is1). If the specification has already been updated, this step is skipped. In this step, we add a new data property called Review Detail Level with values 'Very brief', 'Brief', 'Moderate', 'Detailed' and expected distributions [15, 30, 35, 20].
Step 5: Generate New Samples
Once the spec is generated and finalized after any manual modifications, we will use this spec to generate synthetic data. Refer to this document for more details on the sample generation.Step 6: Evaluate Generated Samples
While Dataframer evaluates each sample as it is generated, it also supports a post-generation evaluation. This API shows how to evaluate the generated dataset. Read the documentation for more details.Step 7: Download Generated Files
List Generated Files
This API lists all the files that were present in the generated dataset.Download All Files as ZIP
This API allows you to download all the generated files with metadata as a compressed ZIP file.Load generated data into a Delta table
This cell writes the generated data into a Delta table<catalog>.<schema>.<table_name>. If the table already exists, its contents are overwritten.

