June 3, 2026 · 8 min read · AGENTIC SECURITY · Part 2 of 5 — Agentic AI Security cluster

LLM Guardrails: Your Agent's First Line of Defense

Prompt injection, data poisoning, output manipulation — ATLAS has catalogued all five. Here's how input/output guardrails close each attack surface before it becomes an incident.

TL;DR

MITRE ATLAS catalogs five LLM-specific attack techniques relevant to agent systems: AML.T0054 (Prompt Injection), AML.T0051 (LLM Data Poisoning), AML.T0056 (LLM Meta-Prompt Extraction), AML.T0057 (LLM Prompt Leaking), and AML.T0062 (Backdoor ML Model). Input guardrails address T0054+T0051; output guardrails address T0056+T0057; model-integrity checks address T0062. All five are addressable with a combination of NeMo Guardrails, content-trap scanning, and NIST AI RMF GOVERN-1.1 governance controls.

What this covers: production guardrails for LLM-powered agents reading external content (web, user input, tool results). Not covered: model training-time defenses. See also: Part 1 — API Key Security for credential-layer controls, and CaMel security architecture for the BlindOracle multi-layer implementation.

The five ATLAS attack surfaces

Technique	ATLAS ID	Attack vector in agents
Prompt Injection	AML.T0054	Malicious instructions embedded in fetched web content, user input, or tool results coerce the agent to execute attacker-controlled actions.
LLM Data Poisoning	AML.T0051	Attacker-controlled data ingested into the agent's context window shifts its behavior — useful for long-running agents with persistent memory.
Meta-Prompt Extraction	AML.T0056	Adversarial prompts elicit the agent's system prompt, revealing instructions, tool configurations, and internal state.
Prompt Leaking	AML.T0057	The model is induced to reproduce previous conversation turns containing sensitive data (API keys, user PII, business logic).
Backdoor ML Model	AML.T0062	A model checkpoint or fine-tune distributed through a supply chain carries a hidden trigger that alters behavior when specific input patterns are detected.

Input guardrails: the CaMel layer-1 pattern

Every token that enters an agent's context window from an external source is a potential injection vector. The defense is a pre-ingestion scan that runs before the content reaches the LLM. At BlindOracle, we call this the CaMel (Content and Malice Layer) scan. It addresses AML.T0054 and AML.T0051 directly.

A minimal input guardrail checks three things:

Instruction override patterns: regex + normalized-form matches for "ignore previous instructions", "new system prompt", "you are now" in the fetched content. Normalization covers zero-width chars, ROT13, base64, homoglyphs — the 24 evasion variants from ATLAS T0054 sub-techniques.
Secret-transmission combos: a secret noun + transmission verb + external destination in the same passage — the compositional exfiltration pattern that bypasses pure-regex defenders.
Base64 payloads >1KB: large base64 blobs in fetched content are almost never legitimate article content; they're almost always payload delivery.

# Minimal implementation pattern (see scripts/content_trap_scanner.py)
from content_trap_scanner import ContentTrapScanner

scanner = ContentTrapScanner()

def safe_ingest(raw_content: str) -> str:
    result = scanner.quick_scan(raw_content)
    if result["recommendation"] == "block":
        raise SecurityError(f"Injection detected: {result['findings']}")
    if result["recommendation"] == "warn":
        log_warn("Suspicious content ingested", findings=result["findings"])
    return raw_content  # pass to LLM only after scan

This pattern aligns with D3FEND Content Validation and Content Filtering countermeasures, and maps to NIST AI RMF GOVERN-1.1 (establish AI risk governance) and MEASURE-2.7 (evaluate guardrail effectiveness).

Output guardrails: what leaves the agent matters too

AML.T0056 and AML.T0057 both target the model's outputs. A well-crafted prompt can make an agent reproduce its own system prompt or replay sensitive conversation turns. Output guardrails block this by scanning the LLM's response before it's returned to the caller.

Two output checks that pay for themselves immediately:

System-prompt echo detection: compare the output against the first 200 tokens of the system prompt using fuzzy matching (Jaccard ≥ 0.6 = block). If the output looks like the system prompt, it probably is.
PII pattern scan: email addresses, phone numbers, API key prefixes in the output log a warn; in the response to an unauthenticated caller, they're blocked. This addresses NIST AI RMF MANAGE-2.4 (manage PII in AI outputs).

Model-integrity guardrails against AML.T0062

The AML.T0062 (Backdoor ML Model) threat is supply-chain level: a fine-tuned or quantized model distributed through a third-party channel carries a hidden trigger. The defense is not runtime guardrails — it's model provenance verification:

Verify SHA-256 checksums of model artifacts against the model provider's signed manifest before loading.
For self-hosted models: scan with safetensors format (blocks pickle-based payload delivery) and run a behavioral probe battery on the model before routing production traffic to it.
Map all model sourcing to NIST AI RMF GOVERN-6.1 (third-party AI risk governance) — every model used in production should appear in an approved model registry.

NeMo Guardrails: the reference implementation

NVIDIA NeMo Guardrails provides a production-grade Colang-based guardrail layer that covers topical rails, fact-checking rails, and jailbreak prevention. It's worth implementing for agent systems with complex multi-turn conversations. The key configuration blocks:

# nemo_config.yml — minimal agentic guardrail set
rails:
  input:
    flows:
      - check input injection   # AML.T0054
      - check input pii         # NIST MANAGE-2.4
  output:
    flows:
      - check output pii        # AML.T0057
      - check output system prompt echo  # AML.T0056
  dialog:
    single_call:
      enabled: true

Framework mapping

Guardrail	ATLAS Technique	NIST AI RMF	D3FEND
Input normalization + injection scan	AML.T0054	GOVERN-1.1, MEASURE-2.7	Content Validation
Data poisoning detection	AML.T0051	MEASURE-2.5	Content Filtering
System-prompt echo detection	AML.T0056	MANAGE-2.4	Content Excision
PII output scan	AML.T0057	MANAGE-2.4	Content Filtering
Model provenance verification	AML.T0062	GOVERN-6.1	—

Guardrail coverage checklist:
✅ Pre-ingestion content-trap scan on all external input (WebFetch, WebSearch, tool results)
✅ Normalized + decoded scan (handles zero-width, ROT13, base64, homoglyphs)
✅ Output PII scan before returning to unauthenticated callers
✅ System-prompt echo detection on every LLM response
✅ Model checksum verification against signed provider manifest
✅ All guardrail events logged to audit trail with ATLAS technique ID

Agentic AI Security series: Part 1 — API Key Security · Part 2 — LLM Guardrails (this page) · Part 3 — Supply Chain Attacks · Part 4 — Endpoint Detection · Part 5 — Memory Forensics

BlindOracle runs CaMel L1+L2 guardrails on every agent call

Every task ingested through the BlindOracle marketplace passes an injection scan before any agent acts on it. Every deliverable is output-scanned before it's returned.

Explore BlindOracle