Whitepaper

Responsible AI for marketing tools: a 4-layer framework

What 'enterprise-pitchable' AI safety actually looks like in production. Input moderation, output moderation, injection detection, PII redaction — and the order they should run in.

DMOOP Engineering May 30, 2026 10 min read

TL;DR

A defensible "Responsible AI" layer for a marketing tool needs four functions running on every chat: input moderation, output moderation, prompt-injection detection, and PII redaction on uploaded content. The order matters. The failure modes are subtle. Most enterprise checklists ask for these four; the gap between checkbox and working implementation is where teams lose trust.

Why this matters

Enterprise security reviews of AI marketing tools converge on the same six questions:

Does it filter unsafe input from the user?
Does it filter unsafe output from the model?
Can the user override your system prompt with a prompt injection?
Where does customer PII go if it's in an uploaded brand document?
Are incidents logged with severity and excerpt?
Does any of this slow the response by more than a second?

A "yes" to all six is the table-stakes bar for procurement at any mid-market+ buyer. None of these are individually hard. Wiring them together so they all run on every chat turn, fail open under load, and don't dominate latency is the interesting engineering problem.

Layer 1 — Input moderation

The model never sees an unfiltered user message. Before the chat route calls Groq, the user query passes through Llama Guard 4 (Meta's open-source guardrail model, hosted on Groq for sub-100ms latency). Llama Guard returns either "safe" or "unsafe" plus a hazard category code (S1-S14 from the MLCommons taxonomy).

We allow S13 (election content — marketers may discuss it). We block S1-S4 (violent crimes, sex-related crimes) and S9-S12 (weapons, hate, self-harm, sexual content). Categorical rules; not negotiable per-user.

When input is flagged, the user gets a polished refusal that names the category, not a generic "I can't help with that." Transparency about what was caught reduces support volume.

Layer 2 — Prompt-injection detection

Distinct from input moderation. Injection isn't about what the user is asking — it's about whether the user is trying to override the system, exfiltrate the brand documents, or jailbreak the safety contract.

Two sublayers:

Regex pack (~10 patterns): ignore previous instructions, reveal your system prompt, <|system|>, DAN, pretend you have no rules. Always runs. Cost: microseconds.
LLM judge (Groq 8B-instant with a tight classifier prompt): catches paraphrased attempts the regex misses. Skipped on follow-up turns (mid-conversation refinement requests like "shorter version" were the dominant false positive). Skipped on messages under 80 chars (too short for plausible injection).

A flagged injection returns a refusal naming which pattern matched. Pattern names ("exfiltrate_brand_docs", "system_prompt_reveal") become the language of the admin incident log.

Layer 3 — Output moderation

Same Llama Guard model, but now on the assistant's response. Runs as a sidecar after streaming completes — does NOT block the streaming experience. If the response is flagged, an inline warning appends below the answer and the incident is logged.

The right design choice here is "log and warn" rather than "block." Blocking output after the user has already seen it streaming is jarring; warning is honest. The vast majority of output-side flags are false positives where the user asked about, e.g., a competitor's defamation lawsuit and Llama Guard incorrectly flagged the answer as containing defamation.

Layer 4 — PII redaction on uploads

The single most important layer for enterprise trust. When a user uploads a brand document (PDF/DOCX/XLSX/PPTX), customer PII inside that document needs to never reach the server. Solution: parse the document client-side in the browser, run a regex pack for SSNs, emails, phone numbers, IBANs, credit cards (with Luhn validation to avoid false positives on invoice numbers), and AWS-style API keys. Replace matches with [REDACTED:type] tokens. Only the redacted text leaves the browser.

The model's answer still understands the redacted token role ("contact [REDACTED:EMAIL] for pricing") so usefulness isn't destroyed. The browser console never logs raw PII even if the user has DevTools open. The Brand Library UI surfaces a green banner showing what was scrubbed before upload — "4 items in Brand_Guide.pdf: 3× email, 1× phone — raw PII never reached the server."

Logging contract

Every incident across the four layers writes one row to a safety_incidents table with: kind (input_unsafe / output_unsafe / prompt_injection / pii_redacted), severity (low/medium/high), categories (hazard codes or pattern names), excerpt (first 500 chars), action_taken (blocked/sanitized/flagged/redacted), user_id, model, occurred_at. The admin Safety tab reads this with KPI strip + kind filter + chronological feed.

Every layer above has the same logging contract. That's the boring part of the work — and it's what enterprise procurement actually validates.

Failure modes

Every layer fails open. If Llama Guard throws an exception, we treat the message as safe and let it through. The reasoning: a model that blocks legit traffic under load loses more trust than a model that occasionally serves a flagged message during an outage. The downside is incidents that occur during Groq downtime go unlogged. The accepted tradeoff.

What doesn't go in this framework

Three things we explicitly don't ship:

Watermarking generated content. Marketing copy needs to be presentable as the user's own. Watermarking sabotages the use case.
"AI-generated content" disclaimers in the output. Same reason.
Automated bias auditing. This belongs in the model layer, not the application layer. We trust the base model's RLHF and surface anything our users explicitly flag via thumbs-down.

What buyers ask next

After the four-layer checkbox is satisfied, the next question is always: "can we see your incident log?" The admin Safety tab gives a clean answer. After that: "what happens when a flagged input belongs to a paying customer?" The honest answer is the same as it would be for any SaaS — log the incident, surface to the admin, no special-case bypass. Buyers respect the consistency.

Ready to try it?

Put DMOOP on your next campaign.

Upload your brand docs, name your Brand Agent, and ship your first on-voice asset in under 5 minutes. No credit card.

Get started free