Whitepaper

The Marketing Intel Taxonomy: a 13×13 schema for B2B content

Why we organized DMOOP's training corpus as a cross-product of 13 asset types and 13 marketing intents — and what we'd do differently if we were starting over.

DMOOP Research May 22, 2026 11 min read

Abstract

DMOOP organizes its marketing training corpus as a 13-asset-type × 13-intent matrix — a 169-cell taxonomy that drives both scraping coverage and retrieval ranking. This document walks through how the schema was designed, where it leaks, and what we'd structure differently if we were starting over today.

Section 1 — Why a taxonomy at all

The naive approach to a marketing training corpus is "scrape a lot, dedupe, embed." This approach treats marketing knowledge as a single bag of unstructured text. It produces a model that knows everything about marketing in general and nothing about your specific task. A user asking "draft an ABM playbook for tier-1 accounts" and a user asking "explain Q3 attribution math" pull from the same undifferentiated pile.

A taxonomy lets retrieval discriminate. When a user's prompt classifies as "abm/playbook," the retrieval RPC can prefer pairs from the same cell. The Tuned model's answer inherits the structural shape that asset type implies — numbered steps for playbooks, Situation/Approach/Result for case studies, hook + CTA breakdown for social posts.

Section 2 — The 13 asset types

We chose asset types from a structural axis, not a topical one. The question is: what shape does this content have, not what is it about. The 13:

Asset typeStructural shape
articleFree-form journalistic prose with a thesis
reportCited research with charts and benchmarks
case_studySituation → Approach → Result with metrics
whitepaperLong-form research with primary citations
playbookSequenced tactical steps with timing
ebookMulti-chapter book-length guide
guideHow-to with hierarchical sections
templatePre-filled framework with slot variables
social_postShort hook + thread + CTA
ad_campaignInsight → Creative → Channel → Result
newsletterCurated recap with editorial commentary
podcastConversational long-form with quotes
videoWebinar / explainer with structured arguments

The schema is intentionally orthogonal. A whitepaper about ABM has a different shape than a case study about ABM, even though both are about ABM. The Tuned model needs to learn both shapes.

Section 3 — The 13 intents

Intents are topical. They answer "what is this marketing problem about." The 13: seo, aeo_geo, abm, buyer_signals, company_signals, demand_gen, ad_copy, email, analytics, competitor, orm, strategy, trend. Plus an implicit general fallback.

These cluster into four functional groups: pipeline (abm, buyer_signals, company_signals, demand_gen), content (ad_copy, email, seo, aeo_geo), measurement (analytics), and strategic (competitor, orm, strategy, trend). The grouping matters for routing: pipeline-tagged retrieval can prefer pairs that mention named platforms (6sense, Bombora, Mutiny); strategic-tagged retrieval can prefer pairs with budget tradeoff framing.

Section 4 — Where the taxonomy leaks

Three places we wish we'd designed differently:

Multi-intent content. A "GTM playbook for SaaS" is plausibly strategy/playbook and demand_gen/playbook. We assign one intent at scrape time, picked by the Tavily query that surfaced it. Retrieval misses on the alternate framing. A multi-intent column with weights would fix this.

Mixed-shape content. Many real articles are 60% blog post + 30% case study + 10% playbook. We classify by the dominant shape, which means the case study material inside an "article" never gets retrieval boost when a case-study-shaped question comes in. A multi-asset-type label would help.

Asset types we should retire. Three asset types — ebook, podcast, video — produce <5% of our training pairs because Tavily can't crawl their primary distribution channels. We keep the labels because the structural shape is real and content does occasionally land. But we'd cap them at 5% of scrape budget instead of treating them as equally weighted.

Section 5 — The cross-product

169 cells. In practice, ~110 are populated. The most productive: demand_gen/report (33 pairs), strategy/article (24), seo/report (21), analytics/report (19), orm/case_study (18). The least: most of the */podcast and */video row.

The distribution shape matters because retrieval ranking is a function of both similarity and density. A dense cell produces sharper similarity differentiation. A sparse cell produces wider similarity bands. If we were ranking strictly by similarity, sparse cells would dominate every retrieval — exactly the wrong outcome.

The RPC compensates with a composite score: similarity × 0.65 + intent_match × 0.25 + quality × 0.10. Intent match smooths out the density differential.

Section 6 — What we'd do differently

If we were starting over:

  1. Multi-label intent and asset_type from day one. Don't force a single value when content is plausibly two.
  2. Confidence weights on each label. "70% playbook, 30% case study" is better than picking one.
  3. A "freshness" axis separate from days query parameter. A 2-year-old playbook on Google Ads bidding is still useful; a 2-month-old AEO article may already be stale.
  4. An explicit "evergreen" tag for content that retrieval can downweight on freshness-sensitive queries without removing entirely.

The schema is the model's worldview. Designing it well is one of the highest-leverage things you can do in a domain-specific RAG system.

Ready to try it?

Put DMOOP on your next campaign.

Upload your brand docs, name your Brand Agent, and ship your first on-voice asset in under 5 minutes. No credit card.

Get started free