Case Study

How we 4×'d our training corpus in 7 days without writing a single article

From 229 to 854 training pairs in a week using WizardLM-style evolution. The architectural decision, the bug that nearly killed it, and what we learned about LLM data augmentation.

DMOOP Engineering June 1, 2026 7 min read

The starting line

Seven days ago the DMOOP training corpus had 229 pairs across 11 marketing intents and 7 asset types. Today it has 854 across 12 intents and 9 asset types. We added zero new scraping sources, didn't write a single article ourselves, and stayed inside the same Groq free-tier quota.

The lever was a single architectural choice we'd been postponing: WizardLM-style evol-instruct on the existing pipeline.

The before state

The pipeline ran like this. Tavily scraped marketing articles four times a day. Each article ran through Groq's 8B-instant model with an asset-type-aware prompt (case studies → Situation/Approach/Result, playbooks → numbered steps, etc.) that produced 3 Q&A training pairs per article. Pairs were inserted with quality=1.0 and surfaced to the Tuned model via trigram-similarity retrieval.

The corpus was growing, but linearly. At 229 pairs with a ~20-article-per-cron yield, hitting 1,000 pairs would take ~10 weeks. We needed exponential.

The evol-instruct decision

The WizardLM evolution method takes a single training pair and rewrites it through three "lenses" — making it more specific, more tactical, or more strategic. The same source article that produces "How do I build an ABM list?" becomes "How do I build an ABM list for a B2B SaaS targeting Series A-B founders?" (specific), "Walk me through the step-by-step in HubSpot + 6sense" (tactical), and "Given a $2M ARR target, how should I prioritize tier-1 vs tier-2 accounts?" (strategic).

The math: 3 originals × (1 + 3 variants) = 12 pairs per article. Same Tavily quota, 4× the corpus output. Each variant lives in the same training_pairs table with is_evolved = true and a parent_pair_id FK back to the original, so we can audit augmentation quality and slice retrieval to originals-only if needed.

What broke

Day 2: tactical evolution silently went to zero pairs/day while specific and strategic kept landing. The corpus was growing, but the tactical lens — the one we cared most about — was dead.

The cause turned out to be a JSON-parsing edge case. Our parser used /\{[\s\S]*\}/ to extract the JSON object from the model's response. Strategic and specific produced single-paragraph prose with no internal newlines. Tactical, by design, produced numbered execution steps with raw newlines inside the JSON string values — which JSON.parse rejects with a syntax error. Combined with our 1,400-token output cap occasionally truncating mid-JSON, the lens silently dropped.

The fix was a brace-counting, escape-aware JSON walker that finds the first complete object and escapes raw newlines inside string values before parsing. Plus a 2,000-token output cap. Tactical resumed.

What we learned

Three observations that may transfer to other RAG-corpus efforts:

  1. Augmentation has lens-specific failure modes. Strategic answers are short and clean. Tactical answers are long and structured. Treating them as interchangeable in the parsing layer was the bug we shipped.

  2. Quality stratification is cheap and load-bearing. Originals get quality=1.0. Evolved pairs get quality=0.85. Pairs caught by the source-leakage guard get quality=0.5 (invisible to retrieval). One numeric field, no schema migration, surfaces three distinct concerns simultaneously.

  3. Self-augmentation has a quality ceiling we haven't hit yet. Conventional wisdom says synthetic data converges to lower quality than real data. At our scale (854 pairs) and our ratio (61% originals : 39% evolved) we don't see any retrieval quality regression. The thumbs-up rate from real users on Tuned answers is roughly flat across pre-evolution and post-evolution periods.

What's next

By next month the corpus should clear 3,000 pairs at the current growth rate. The real test of whether evolution is actually working will be the retrieval similarity scores — if evolved pairs are matching at lower similarity than originals, they're noise. So far they're matching at parity.

Ready to try it?

Put DMOOP on your next campaign.

Upload your brand docs, name your Brand Agent, and ship your first on-voice asset in under 5 minutes. No credit card.

Get started free