Skip to main content

Open in Google Colab

Run the full tutorial interactively in Google Colab

Fine-Tuning with Synthetic Training Data

Fine-tuning requires dozens to hundreds of training examples, but collecting that many high-quality samples by hand is slow and expensive. If you already have a few examples that capture the style, format, or behavior you want, DataFramer can analyze those examples and generate a full training dataset that preserves the style while introducing diversity in content. This tutorial demonstrates the workflow end-to-end: starting from just 2 hand-written examples, generating 100 diverse training samples with DataFramer, fine-tuning, and evaluating how well the style transfers to unseen topics.

The approach

  1. Upload a few examples as seed data — these define the target style
  2. Describe what should be consistent via generation objectives (structure, tone, formatting rules)
  3. DataFramer analyzes the seeds and discovers axes of variation — in this case, it extrapolated 30 values for “Specific topic being explained” across 20 values for “Scientific or technical domain of the topic”, from just 2 examples
  4. Generate training samples that are diverse in content but consistent in style
  5. Fine-tune on the generated dataset using any training provider

What the notebook covers

The Colab notebook walks through this with a concrete example: teaching a model to write technical explanations in a specific style (result-first structure, one idea per paragraph, specific transition phrases, no metaphors, etc.). Generating training data — The 2 seed examples are uploaded as a JSON file and a spec is created with generation_objectives describing the style criteria and extrapolate_values=True to let DataFramer invent new topics beyond the seeds. 100 samples are generated in a single run. Fine-tuning — The generated samples are converted to JSONL chat format and used to full fine-tune Meta-Llama-3-8B-Instruct on Together AI (20 epochs, lr=1e-5). Evaluation — The fine-tuned model is tested on held-out questions spanning technology, math, and philosophy — none seen during training. An LLM judge scores each response against the 8 style criteria:
Question              C1  C2  C3  C4  C5  C6  C7  C8  Total
-----------------------------------------------------------
Touchscreen            ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓  8/8
Noise-canceling        ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓  8/8
QR code                ✓   ✓   ✓   ✓   ✓   ✓   ✓   ✓  8/8
Derivative             ✓   ✓   ✓   ✓   ✓   ✗   ✗   ✓  6/8
Consciousness          ✓   ✗   ✓   ✓   ✓   ✓   ✗   ✓  6/8
...
-----------------------------------------------------------
Overall: 78/128 (61%)
The model achieves 61% style adherence overall, with perfect scores on several topics and solid generalization to domains (math, philosophy) not present in the training data.

Why synthetic diversity matters

The notebook also runs a baseline experiment: duplicating the 2 seed examples 50 times each to create 100 “training” samples. The result is catastrophic memorization — the model learns the content of the seeds, not the style. When asked about touchscreens, it hallucinates GPS satellites and microwave frequencies from the original examples. Diverse synthetic training data is what makes the difference between memorizing content and learning style in this case.

What’s Next?

Quickstart

Get started with DataFramer

API Reference

Full endpoint documentation