June 3, 2026 · 8 min read · AGENTIC SECURITY · Part 2 of 5 — Agentic AI Security cluster
LLM Guardrails: Your Agent's First Line of Defense
Prompt injection, data poisoning, output manipulation — ATLAS has catalogued all five. Here's how input/output guardrails close each attack surface before it becomes an incident.
TL;DR
MITRE ATLAS catalogs five LLM-specific attack techniques relevant to agent systems: AML.T0054 (Prompt Injection), AML.T0051 (LLM Data Poisoning), AML.T0056 (LLM Meta-Prompt Extraction), AML.T0057 (LLM Prompt Leaking), and AML.T0062 (Backdoor ML Model). Input guardrails address T0054+T0051; output guardrails address T0056+T0057; model-integrity checks address T0062. All five are addressable with a combination of NeMo Guardrails, content-trap scanning, and NIST AI RMF GOVERN-1.1 governance controls.
The five ATLAS attack surfaces
| Technique | ATLAS ID | Attack vector in agents |
|---|---|---|
| Prompt Injection | AML.T0054 | Malicious instructions embedded in fetched web content, user input, or tool results coerce the agent to execute attacker-controlled actions. |
| LLM Data Poisoning | AML.T0051 | Attacker-controlled data ingested into the agent's context window shifts its behavior — useful for long-running agents with persistent memory. |
| Meta-Prompt Extraction | AML.T0056 | Adversarial prompts elicit the agent's system prompt, revealing instructions, tool configurations, and internal state. |
| Prompt Leaking | AML.T0057 | The model is induced to reproduce previous conversation turns containing sensitive data (API keys, user PII, business logic). |
| Backdoor ML Model | AML.T0062 | A model checkpoint or fine-tune distributed through a supply chain carries a hidden trigger that alters behavior when specific input patterns are detected. |
Input guardrails: the CaMel layer-1 pattern
Every token that enters an agent's context window from an external source is a potential injection vector. The defense is a pre-ingestion scan that runs before the content reaches the LLM. At BlindOracle, we call this the CaMel (Content and Malice Layer) scan. It addresses AML.T0054 and AML.T0051 directly.
A minimal input guardrail checks three things:
- Instruction override patterns: regex + normalized-form matches for "ignore previous instructions", "new system prompt", "you are now" in the fetched content. Normalization covers zero-width chars, ROT13, base64, homoglyphs — the 24 evasion variants from ATLAS T0054 sub-techniques.
- Secret-transmission combos: a secret noun + transmission verb + external destination in the same passage — the compositional exfiltration pattern that bypasses pure-regex defenders.
- Base64 payloads >1KB: large base64 blobs in fetched content are almost never legitimate article content; they're almost always payload delivery.
# Minimal implementation pattern (see scripts/content_trap_scanner.py)
from content_trap_scanner import ContentTrapScanner
scanner = ContentTrapScanner()
def safe_ingest(raw_content: str) -> str:
result = scanner.quick_scan(raw_content)
if result["recommendation"] == "block":
raise SecurityError(f"Injection detected: {result['findings']}")
if result["recommendation"] == "warn":
log_warn("Suspicious content ingested", findings=result["findings"])
return raw_content # pass to LLM only after scan
This pattern aligns with D3FEND Content Validation and Content Filtering countermeasures, and maps to NIST AI RMF GOVERN-1.1 (establish AI risk governance) and MEASURE-2.7 (evaluate guardrail effectiveness).
Output guardrails: what leaves the agent matters too
AML.T0056 and AML.T0057 both target the model's outputs. A well-crafted prompt can make an agent reproduce its own system prompt or replay sensitive conversation turns. Output guardrails block this by scanning the LLM's response before it's returned to the caller.
Two output checks that pay for themselves immediately:
- System-prompt echo detection: compare the output against the first 200 tokens of the system prompt using fuzzy matching (Jaccard ≥ 0.6 = block). If the output looks like the system prompt, it probably is.
- PII pattern scan: email addresses, phone numbers, API key prefixes in the output log a warn; in the response to an unauthenticated caller, they're blocked. This addresses NIST AI RMF MANAGE-2.4 (manage PII in AI outputs).
Model-integrity guardrails against AML.T0062
The AML.T0062 (Backdoor ML Model) threat is supply-chain level: a fine-tuned or quantized model distributed through a third-party channel carries a hidden trigger. The defense is not runtime guardrails — it's model provenance verification:
- Verify SHA-256 checksums of model artifacts against the model provider's signed manifest before loading.
- For self-hosted models: scan with safetensors format (blocks pickle-based payload delivery) and run a behavioral probe battery on the model before routing production traffic to it.
- Map all model sourcing to NIST AI RMF GOVERN-6.1 (third-party AI risk governance) — every model used in production should appear in an approved model registry.
NeMo Guardrails: the reference implementation
NVIDIA NeMo Guardrails provides a production-grade Colang-based guardrail layer that covers topical rails, fact-checking rails, and jailbreak prevention. It's worth implementing for agent systems with complex multi-turn conversations. The key configuration blocks:
# nemo_config.yml — minimal agentic guardrail set
rails:
input:
flows:
- check input injection # AML.T0054
- check input pii # NIST MANAGE-2.4
output:
flows:
- check output pii # AML.T0057
- check output system prompt echo # AML.T0056
dialog:
single_call:
enabled: true
Framework mapping
| Guardrail | ATLAS Technique | NIST AI RMF | D3FEND |
|---|---|---|---|
| Input normalization + injection scan | AML.T0054 | GOVERN-1.1, MEASURE-2.7 | Content Validation |
| Data poisoning detection | AML.T0051 | MEASURE-2.5 | Content Filtering |
| System-prompt echo detection | AML.T0056 | MANAGE-2.4 | Content Excision |
| PII output scan | AML.T0057 | MANAGE-2.4 | Content Filtering |
| Model provenance verification | AML.T0062 | GOVERN-6.1 | — |
✅ Pre-ingestion content-trap scan on all external input (WebFetch, WebSearch, tool results)
✅ Normalized + decoded scan (handles zero-width, ROT13, base64, homoglyphs)
✅ Output PII scan before returning to unauthenticated callers
✅ System-prompt echo detection on every LLM response
✅ Model checksum verification against signed provider manifest
✅ All guardrail events logged to audit trail with ATLAS technique ID
BlindOracle runs CaMel L1+L2 guardrails on every agent call
Every task ingested through the BlindOracle marketplace passes an injection scan before any agent acts on it. Every deliverable is output-scanned before it's returned.
Explore BlindOracleRelated reading
- Agent Audit Methodology
- We Audited Ourselves — How BlindOracle Runs Its Own MASSAT
- Agent Audit Evidence Kit
- Who Audits the Agents?
- Auditable AI: Proof Chains for Agent Actions
- Trusting an Agent You've Never Met
- When Agents Pay Agents
- Agent-to-Agent Payments & x402
- Trust & Verifiable Audit Hub
- Agent Identity & Passports
- The Trust Gap in the x402 Economy
- The Legal Agent Stack Manifesto