June 3, 2026 · 8 min read · AGENTIC SECURITY · Part 2 of 5 — Agentic AI Security cluster

LLM Guardrails: Your Agent's First Line of Defense

Prompt injection, data poisoning, output manipulation — ATLAS has catalogued all five. Here's how input/output guardrails close each attack surface before it becomes an incident.

TL;DR

MITRE ATLAS catalogs five LLM-specific attack techniques relevant to agent systems: AML.T0054 (Prompt Injection), AML.T0051 (LLM Data Poisoning), AML.T0056 (LLM Meta-Prompt Extraction), AML.T0057 (LLM Prompt Leaking), and AML.T0062 (Backdoor ML Model). Input guardrails address T0054+T0051; output guardrails address T0056+T0057; model-integrity checks address T0062. All five are addressable with a combination of NeMo Guardrails, content-trap scanning, and NIST AI RMF GOVERN-1.1 governance controls.

What this covers: production guardrails for LLM-powered agents reading external content (web, user input, tool results). Not covered: model training-time defenses. See also: Part 1 — API Key Security for credential-layer controls, and CaMel security architecture for the BlindOracle multi-layer implementation.

The five ATLAS attack surfaces

TechniqueATLAS IDAttack vector in agents
Prompt InjectionAML.T0054Malicious instructions embedded in fetched web content, user input, or tool results coerce the agent to execute attacker-controlled actions.
LLM Data PoisoningAML.T0051Attacker-controlled data ingested into the agent's context window shifts its behavior — useful for long-running agents with persistent memory.
Meta-Prompt ExtractionAML.T0056Adversarial prompts elicit the agent's system prompt, revealing instructions, tool configurations, and internal state.
Prompt LeakingAML.T0057The model is induced to reproduce previous conversation turns containing sensitive data (API keys, user PII, business logic).
Backdoor ML ModelAML.T0062A model checkpoint or fine-tune distributed through a supply chain carries a hidden trigger that alters behavior when specific input patterns are detected.

Input guardrails: the CaMel layer-1 pattern

Every token that enters an agent's context window from an external source is a potential injection vector. The defense is a pre-ingestion scan that runs before the content reaches the LLM. At BlindOracle, we call this the CaMel (Content and Malice Layer) scan. It addresses AML.T0054 and AML.T0051 directly.

A minimal input guardrail checks three things:

# Minimal implementation pattern (see scripts/content_trap_scanner.py)
from content_trap_scanner import ContentTrapScanner

scanner = ContentTrapScanner()

def safe_ingest(raw_content: str) -> str:
    result = scanner.quick_scan(raw_content)
    if result["recommendation"] == "block":
        raise SecurityError(f"Injection detected: {result['findings']}")
    if result["recommendation"] == "warn":
        log_warn("Suspicious content ingested", findings=result["findings"])
    return raw_content  # pass to LLM only after scan

This pattern aligns with D3FEND Content Validation and Content Filtering countermeasures, and maps to NIST AI RMF GOVERN-1.1 (establish AI risk governance) and MEASURE-2.7 (evaluate guardrail effectiveness).

Output guardrails: what leaves the agent matters too

AML.T0056 and AML.T0057 both target the model's outputs. A well-crafted prompt can make an agent reproduce its own system prompt or replay sensitive conversation turns. Output guardrails block this by scanning the LLM's response before it's returned to the caller.

Two output checks that pay for themselves immediately:

Model-integrity guardrails against AML.T0062

The AML.T0062 (Backdoor ML Model) threat is supply-chain level: a fine-tuned or quantized model distributed through a third-party channel carries a hidden trigger. The defense is not runtime guardrails — it's model provenance verification:

NeMo Guardrails: the reference implementation

NVIDIA NeMo Guardrails provides a production-grade Colang-based guardrail layer that covers topical rails, fact-checking rails, and jailbreak prevention. It's worth implementing for agent systems with complex multi-turn conversations. The key configuration blocks:

# nemo_config.yml — minimal agentic guardrail set
rails:
  input:
    flows:
      - check input injection   # AML.T0054
      - check input pii         # NIST MANAGE-2.4
  output:
    flows:
      - check output pii        # AML.T0057
      - check output system prompt echo  # AML.T0056
  dialog:
    single_call:
      enabled: true

Framework mapping

GuardrailATLAS TechniqueNIST AI RMFD3FEND
Input normalization + injection scanAML.T0054GOVERN-1.1, MEASURE-2.7Content Validation
Data poisoning detectionAML.T0051MEASURE-2.5Content Filtering
System-prompt echo detectionAML.T0056MANAGE-2.4Content Excision
PII output scanAML.T0057MANAGE-2.4Content Filtering
Model provenance verificationAML.T0062GOVERN-6.1
Guardrail coverage checklist:
✅ Pre-ingestion content-trap scan on all external input (WebFetch, WebSearch, tool results)
✅ Normalized + decoded scan (handles zero-width, ROT13, base64, homoglyphs)
✅ Output PII scan before returning to unauthenticated callers
✅ System-prompt echo detection on every LLM response
✅ Model checksum verification against signed provider manifest
✅ All guardrail events logged to audit trail with ATLAS technique ID
Agentic AI Security series: Part 1 — API Key Security · Part 2 — LLM Guardrails (this page) · Part 3 — Supply Chain Attacks · Part 4 — Endpoint Detection · Part 5 — Memory Forensics

BlindOracle runs CaMel L1+L2 guardrails on every agent call

Every task ingested through the BlindOracle marketplace passes an injection scan before any agent acts on it. Every deliverable is output-scanned before it's returned.

Explore BlindOracle

Related reading