Memory Poisoning: The Attack That Waits

The industry’s mental model of prompt injection is session-scoped: attacker crafts a malicious input, it executes in the current context, the session ends, and the attack ends with it. Defenses are designed around this model—input filtering, system prompt hardening, output validation. Every major framework has a story for it.

Memory-augmented agents break that model entirely.

When your agent writes to long-term memory—and most production agents running in 2026 do—a successful injection doesn’t need to execute immediately. It can plant a record, go dormant, and activate three sessions later when a semantically related query retrieves it. The attacker doesn’t need to be in the session at exploit time. The agent’s own reasoning, presented with a poisoned memory entry it trusts, does the rest. Forensics are brutal: the bad decision looks indistinguishable from the agent’s own learned behavior.

Three research papers published in late 2025 and early 2026—MINJA, MemoryGraft, and eTAMP—have demonstrated this attack class with success rates ranging from 19% to 95% across production-grade models. Palo Alto’s Unit 42 team has already documented a working exploit against Amazon Bedrock Agents with up to 365 days of persistence. The tooling to build memory-augmented agents is ahead of the security tooling to protect them.

This post maps the threat landscape, walks through the attack mechanics in detail, and proposes a concrete layered defense architecture with working code.

Why Memory Poisoning Is Categorically Different

Session-scoped injection, RAG poisoning, and memory poisoning are frequently conflated. They are distinct threat classes with different attack windows, detectability characteristics, and required defenses.

Property	Session Injection	RAG Poisoning	Memory Poisoning
Attack window	Current session only	Whenever doc is retrieved	Persists across all future sessions
Privilege required	User-level input	Write access to knowledge base	Query-only (no write privilege needed)
Detectability	Input monitoring catches it	Doc-level auditing detects it	Extremely hard—looks like agent learning
Effect	Ephemeral behavior change	Factual corruption	Persistent behavioral drift
Attacker presence at exploit	Required	Not required	Not required
Defense gap	Well-studied	Moderately addressed	Largely unaddressed

The key differentiator in the last row is what makes this category dangerous. The defenses that work for session injection—input sanitization, context window monitoring—provide almost no protection against memory poisoning. You can sanitize every user input perfectly and still end up with a poisoned memory store if tool outputs, session summarizers, or retrieved documents flow into memory without controls.

This is, structurally, the same attack class as a supply-chain compromise in software: the malicious code isn’t in your binary, it’s in a dependency you trust. The memory store is the dependency.

The Three Attacks That Prove the Threat

How does MINJA work, and why does 95% success matter?

MINJA—“Memory Injection Attack and Defense on Memory Based LLM-Agents” (Devarangadi Sunil et al., arXiv:2601.05504, January 2026)—is the most systematic study of query-only memory injection to date.

The core insight: an attacker with no write privileges can cause malicious content to be stored in memory simply by crafting queries that generate poisoned outputs. The agent processes the query, produces a response that includes the attacker-controlled content, and stores the interaction in memory. Future queries retrieve it and act on it.

MINJA was validated on EHR (electronic health record) agents running on GPT-4o-mini, Gemini-2.0-Flash, and Llama-3.1-8B-Instruct using MIMIC-III clinical datasets. The results under idealized conditions: 95%+ injection success rate (the malicious content makes it into memory) and 70% attack success rate (subsequent queries retrieve and act on the poisoned content).

The important caveat—which the authors include and you should too: realistic deployments with pre-existing legitimate memories reduce both rates. The poisoned records compete with a large corpus of valid memories; retrieval may surface the real ones instead. This is an argument for populating your memory stores aggressively with high-quality legitimate records, not an argument that the attack is benign.

MINJA also proposes the two core defensive patterns that form the foundation of the architecture later in this post: composite trust scoring for input/output moderation, and memory sanitization with temporal decay.

What is MemoryGraft’s “semantic imitation heuristic” exploit?

MemoryGraft (Saksham Sahai Srivastava, Haoyu He, arXiv:2512.16962, December 2025) targets a different mechanism: the agent’s tendency to replicate patterns from retrieved successful past experiences.

Production agents that use episodic memory to improve over time—learning from successful task completions—are especially vulnerable. MemoryGraft plants fake “successful task templates” into long-term memory as forged experiences. When future benign tasks trigger retrieval of semantically similar experiences, the grafted template surfaces and the agent adopts its embedded behavioral patterns.

The paper’s key finding deserves quoting directly: “A small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads.” You don’t need to flood the memory store. A handful of well-crafted entries, targeting high-frequency task patterns, can dominate retrieval for those patterns indefinitely.

MemoryGraft was validated on MetaGPT’s DataInterpreter agent with GPT-4o. Unlike MINJA (which targets factual injection) or RAG poisoning (which corrupts a knowledge corpus), MemoryGraft creates persistent behavioral drift—the agent’s procedural approach to entire classes of tasks is compromised, not just its factual outputs.

In multi-agent systems, the blast radius compounds: an orchestrator agent whose memory is poisoned with bad coordination templates propagates the pattern to every sub-agent it dispatches.

How does eTAMP cross site boundaries to trigger days later?

eTAMP—“Environment-Injected Trajectory-based Agent Memory Poisoning” (arXiv:2604.02623, April 2026)—is the most operationally realistic of the three attacks. It targets web agents and exploits the gap between where injection happens and where activation occurs.

The attack chain: malicious content embedded in the user-generated content of a web page is encountered during Task A on Site X. The agent stores its observations from that task in memory. Days or sessions later, during a semantically related Task B on Site Y, the poisoned memory activates. The attack is cross-site by design—permission-based defenses that restrict the agent to the current task’s domain are entirely ineffective because injection and activation happen in different sessions, on different sites.

eTAMP documents three injection strategies with increasing sophistication:

Baseline URL-triggered — simple payload in page content, activated by URL pattern matching
Authority Framing — impersonates urgent system recovery messages to override the agent’s default skepticism
Frustration Exploitation — payload activates specifically when the agent encounters obstacles, exploiting a documented behavioral vulnerability where models under “environmental stress” exhibit reduced critical evaluation

That third strategy produces the paper’s most alarming finding: environmental stress creates an 8x vulnerability window. Under normal conditions, GPT-5-mini shows up to 32.5% eTAMP success; GPT-5.2 shows 23.4%; GPT-OSS-120B shows 19.5%. Qwen models show more robustness but reduced general capability—a real tradeoff.

Tool-calling traces are the primary write path for this attack. The web-fetch tool reads the poisoned page, the tool output enters the agent’s observation stream, and the observation gets stored. The attack exploits MCP tool outputs and similar retrieval mechanisms as injection vectors.

The Real Exploit: Unit 42 on Amazon Bedrock

Research papers establish theoretical attack viability. Palo Alto’s Unit 42 team has documented a working exploit against a production platform: Amazon Bedrock Agents with memory enabled. The demo used a travel assistant chatbot running on Amazon Nova Premier v1.

The attack chain is worth walking through in detail because the mechanism—forged XML tags in session summarization—is subtle enough to survive naive defenses.

Step 1: Attacker creates a webpage with a hidden prompt-injection payload, embedded in otherwise normal content.

Step 2: Victim is socially engineered into submitting the URL to the travel assistant (“can you summarize the hotels on this page?”).

Step 3: The agent’s web-fetch tool pulls the page. The tool output enters the context window.

Step 4: This is the critical step. Amazon Bedrock’s memory system uses a session-summarization prompt to condense conversations before writing them to persistent memory. The attacker’s payload is crafted with forged XML tags—the malicious content is placed outside any <conversation> block. The LLM reads content outside the conversation block as system-level instructions rather than user or tool output. The payload tells the model to write a specific crafted summary to memory.

Step 5: The forged summary is written to persistent memory. Retention period: configurable, up to 365 days.

Every subsequent session for that user now has access to the poisoned memory. The consequences Unit 42 documents: silent cross-session data exfiltration, autonomous execution of attacker-directed actions, and zero user awareness. The travel assistant looks normal; it has simply acquired additional persistent instructions.

The forged-XML-tag trick is not a Bedrock-specific bug. Any memory system that uses LLM-based summarization to produce memory writes is potentially vulnerable to the same class of attack—because the summarization prompt trusts the content it’s summarizing to be inert data, and the attacker exploits that trust.

Amazon Bedrock’s named mitigations: Guardrails with prompt-attack policy, content filtering, domain allowlists, Model Invocation Logs with Trace, and anomaly detection. These are the right layers—but they require explicit configuration and represent a defense-in-depth approach, not a single fix.

The Memory Attack Surface

Every path that writes to memory is a potential injection vector. Most systems have more write paths than their builders realize.

MEMORY ATTACK SURFACE --- ALL WRITE PATHS

  User Turns                ──────────────────────────────┐
  Tool Outputs (fetch, DB)  ──────────────────────────────┤
  Session Summarizers       ──────────────────────────────┤──▶  MEMORY STORE
  Retrieved Documents       ──────────────────────────────┤
  Cross-Agent Handoffs      ──────────────────────────────┤
  Inferred Preferences      ──────────────────────────────┘

User turns are the obvious vector—and the one most teams have some controls for. But direct user injection is the least interesting path because it requires the attacker to interact directly.

Tool outputs are the highest-risk write path. A web-fetch tool that returns attacker-controlled page content, a database query result poisoned at the source, a search API returning adversarial snippets—any of these can inject content into the context that subsequently gets stored. eTAMP exploits exactly this path.

Session summarizers are the Bedrock exploit path. If you use an LLM to summarize a session before writing it to memory, that summarization step is a trust boundary where injected content from earlier in the session can be amplified into a memory write.

Retrieved documents create a circular risk: RAG-retrieved content that enters the context window can influence what gets stored as a memory record if the agent “learns” from the retrieval interaction.

Cross-agent handoffs in multi-agent systems are a propagation path. When Agent A hands context to Agent B and Agent B stores that context in its own memory, the chain-of-custody extends back to everything Agent A encountered. A poisoned memory in one agent spreads to others through handoffs.

Inferred preferences are the subtlest path. Agents that learn from interaction build memory from behavioral inference. An attacker who manipulates the interactions that trigger inference can plant persistent preferences.

This is a context engineering problem: everything that enters the context window is a potential memory write, and every memory write is a potential future injection. Treat the memory write path as a trust boundary.

The Trust-Weighted Memory Stack

The defense architecture has five layers. Each layer is independently useful; together they eliminate most of the attack surface. The core principle: trust is a first-class property of every memory record, not something inferred at retrieval time.

THE TRUST-WEIGHTED MEMORY STACK

  ┌─────────────────────────────────────────────────────────────────┐
  │                   RETRIEVAL-TIME AUDITING                       │
  │  Log every retrieval: trace_id + source tags + trust score      │  Layer 5
  │  Enables incident reconstruction when attacks are discovered     │
  ├─────────────────────────────────────────────────────────────────┤
  │                   SANITIZATION PIPELINE                         │
  │  Pattern filters: imperative language, forged tags, role-       │  Layer 4
  │  impersonation markers, URLs in non-URL fields. Reject pre-write│
  ├─────────────────────────────────────────────────────────────────┤
  │                   TEMPORAL DECAY                                │
  │  Diminish retrieval weight of unverified old entries.           │  Layer 3
  │  Quarantine entries with zero reuse. Corroboration resets clock  │
  ├─────────────────────────────────────────────────────────────────┤
  │                   COMPOSITE TRUST SCORING                       │
  │  Score = f(source_trust, content_anomaly, behavioral_drift)     │  Layer 2
  │  Blended with semantic similarity at retrieval time             │
  ├─────────────────────────────────────────────────────────────────┤
  │                   SOURCE PROVENANCE                             │
  │  Tag every entry: user_turn | tool_output | inferred_summary |  │  Layer 1
  │  agent_handoff. Carry provenance through all retrievals         │
  └─────────────────────────────────────────────────────────────────┘

Layer 1: Source Provenance

Every memory record is tagged at write time with its origin. This tag propagates through retrieval and is available to the agent when the record is surfaced. The agent can—and should—treat records from different sources with different levels of trust.

# memory/models.py
import time
import uuid
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class MemorySource(str, Enum):
    USER_TURN = "user_turn"         # Direct user input
    TOOL_OUTPUT = "tool_output"     # Result from a tool call
    INFERRED_SUMMARY = "inferred_summary"  # LLM-generated session summary
    RETRIEVED_DOC = "retrieved_doc"  # Content retrieved from external source
    AGENT_HANDOFF = "agent_handoff"  # Context passed from another agent
    SYSTEM_WRITE = "system_write"   # Trusted internal system write

# Source trust priors --- adjust to your threat model
SOURCE_TRUST_PRIORS = {
    MemorySource.SYSTEM_WRITE: 1.0,
    MemorySource.USER_TURN: 0.7,
    MemorySource.INFERRED_SUMMARY: 0.6,
    MemorySource.RETRIEVED_DOC: 0.5,
    MemorySource.TOOL_OUTPUT: 0.5,
    MemorySource.AGENT_HANDOFF: 0.4,
}

@dataclass
class MemoryEntry:
    """A single record in long-term agent memory."""
    content: str
    source: MemorySource
    embedding: list[float] = field(default_factory=list)
    entry_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    created_at: float = field(default_factory=time.time)
    last_retrieved: Optional[float] = None
    retrieval_count: int = 0
    trust_score: float = 0.0        # Composite, computed on write
    anomaly_flags: list[str] = field(default_factory=list)
    corroborated_by: list[str] = field(default_factory=list)  # entry_ids that confirm this

Layer 2 and 3: Trust Scoring and Temporal Decay at Retrieval

Retrieval blends semantic similarity with trust score and a temporal decay factor. Entries that have never been corroborated by subsequent interactions lose weight over time.

# memory/retrieval.py
import math
import time
from typing import Optional
from .models import MemoryEntry, MemorySource, SOURCE_TRUST_PRIORS

# Half-life in seconds for unverified entries (default: 30 days)
DECAY_HALF_LIFE_S = 30 * 24 * 3600

def temporal_decay_factor(entry: MemoryEntry, now: Optional[float] = None) -> float:
    """
    Exponential decay applied to entries that lack corroboration.
    Corroborated entries decay at 1/4 the normal rate.
    """
    now = now or time.time()
    age_s = now - entry.created_at
    half_life = DECAY_HALF_LIFE_S * (4 if entry.corroborated_by else 1)
    return math.exp(-math.log(2) * age_s / half_life)

def compute_retrieval_score(
    entry: MemoryEntry,
    semantic_similarity: float,      # Cosine similarity from vector search, 0.0-1.0
    semantic_weight: float = 0.6,
    trust_weight: float = 0.25,
    recency_weight: float = 0.15,
) -> float:
    """
    Blended retrieval score. Weights should sum to 1.0.
    Penalizes flagged entries by 50% regardless of other scores.
    """
    trust = entry.trust_score
    decay = temporal_decay_factor(entry)

    # Hard penalty for entries with anomaly flags
    flag_multiplier = 0.5 if entry.anomaly_flags else 1.0

    score = (
        semantic_weight * semantic_similarity
        + trust_weight * trust
        + recency_weight * decay
    ) * flag_multiplier

    return round(score, 4)

def trust_weighted_retrieve(
    query_embedding: list[float],
    memory_store: list[MemoryEntry],
    k: int = 5,
    min_trust_threshold: float = 0.3,
) -> list[tuple[MemoryEntry, float]]:
    """
    Retrieve top-k memory entries by blended trust-weighted score.
    Entries below the trust threshold are excluded before ranking.
    """
    now = time.time()
    candidates = []

    for entry in memory_store:
        # Hard filter: exclude entries below trust floor
        if entry.trust_score < min_trust_threshold:
            continue

        semantic_sim = cosine_similarity(query_embedding, entry.embedding)
        score = compute_retrieval_score(entry, semantic_sim)
        candidates.append((entry, score))

    # Sort by blended score descending, return top-k
    candidates.sort(key=lambda x: x[1], reverse=True)

    # Update retrieval metadata for retrieved entries
    top_k = candidates[:k]
    for entry, _ in top_k:
        entry.last_retrieved = now
        entry.retrieval_count += 1

    return top_k

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Cosine similarity between two embedding vectors."""
    if not a or not b or len(a) != len(b):
        return 0.0
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b) if (norm_a and norm_b) else 0.0

Layer 4: The Sanitization Pipeline

Content is inspected before any write to memory. The sanitizer looks for structural markers of injection attacks: imperative commands, forged delimiter tags, role-impersonation strings, and embedded URLs in non-URL fields.

# memory/sanitization.py
import re
from .models import MemoryEntry, MemorySource, SOURCE_TRUST_PRIORS

# Patterns indicative of injection attempts
_FORGED_TAG_PATTERN = re.compile(
    r"<\/?(system|assistant|user|conversation|instructions?|prompt)\b",
    re.IGNORECASE,
)
_IMPERATIVE_PATTERN = re.compile(
    r"\b(ignore (previous|all|above)|forget (previous|all|above)|"
    r"new instruction|override|disregard|from now on|you (must|will|shall)|"
    r"your (new|updated|actual) (role|instructions?|task))\b",
    re.IGNORECASE,
)
_ROLE_IMPERSONATION_PATTERN = re.compile(
    r"\b(system message|you are now|act as (an? )?(admin|root|system)|"
    r"developer mode|DAN mode|jailbreak)\b",
    re.IGNORECASE,
)
_EMBEDDED_URL_PATTERN = re.compile(r"https?://\S+")

# Maximum allowed content length (tokens ~ chars/4; 2000 chars ~ 500 tokens)
MAX_CONTENT_LENGTH = 2000

class SanitizationError(ValueError):
    """Raised when a memory entry fails sanitization and must be rejected."""
    pass

def sanitize_before_write(entry: MemoryEntry) -> MemoryEntry:
    """
    Inspect and score a MemoryEntry before writing to the memory store.
    Raises SanitizationError for high-confidence injection attempts.
    Returns the entry with anomaly_flags and trust_score set.

    Call this on every memory write path without exception.
    """
    content = entry.content
    flags = []

    # Length check
    if len(content) > MAX_CONTENT_LENGTH:
        flags.append("oversized_content")

    # Forged XML/delimiter tags
    if _FORGED_TAG_PATTERN.search(content):
        flags.append("forged_delimiter_tags")

    # Imperative override language
    if _IMPERATIVE_PATTERN.search(content):
        flags.append("imperative_override_language")

    # Role impersonation markers
    if _ROLE_IMPERSONATION_PATTERN.search(content):
        flags.append("role_impersonation")

    # Embedded URLs in non-tool-output sources
    if (
        entry.source not in (MemorySource.TOOL_OUTPUT, MemorySource.RETRIEVED_DOC)
        and _EMBEDDED_URL_PATTERN.search(content)
    ):
        flags.append("unexpected_url")

    entry.anomaly_flags = flags

    # Compute composite trust score
    source_prior = SOURCE_TRUST_PRIORS.get(entry.source, 0.4)
    anomaly_penalty = len(flags) * 0.15
    entry.trust_score = max(0.0, source_prior - anomaly_penalty)

    # Hard rejection threshold: two or more high-confidence flags
    HIGH_CONFIDENCE_FLAGS = {"forged_delimiter_tags", "role_impersonation", "imperative_override_language"}
    high_flag_count = sum(1 for f in flags if f in HIGH_CONFIDENCE_FLAGS)
    if high_flag_count >= 2:
        raise SanitizationError(
            f"Memory write rejected: {flags}. Content: {content[:120]!r}"
        )

    return entry

Layer 5: Retrieval-Time Auditing

Every retrieval is logged with the trace ID, the query, and the source provenance of each returned entry. When an attack is eventually discovered—and in practice they are discovered after the fact—this log is the only way to reconstruct what the agent saw and when.

Governance frameworks increasingly require this audit trail. The AI Agent Governance post covers audit logging and kill-switch requirements; for memory systems, the log schema should include entry_id, source, trust_score, anomaly_flags, and retrieval_score for every record surfaced.

Red-Teaming Memory Poisoning in CI

Your eval pipeline should include adversarial memory cases. Session-injection evals are common; memory-poisoning evals are rare. These three test-case shapes cover the highest-risk scenarios.

Test Case 1 — Query-Only Injection Resistance (MINJA pattern)

Setup: Initialize a fresh memory store. Submit a sequence of benign-looking queries that contain embedded malicious content designed to be stored. Then submit a trigger query that should retrieve and act on the stored content. Assert that either: (a) the malicious content was rejected at write time, or (b) the trust-weighted retrieval score for the poisoned entry is below the minimum threshold and it is never surfaced.

Test Case 2 — Forged-XML-Tag Resistance (Bedrock pattern)

Setup: Construct a tool output that contains well-formed forged XML delimiter tags (</conversation> followed by instruction text). Pass this through the full memory write pipeline. Assert that sanitize_before_write raises SanitizationError and the entry is never written. Confirm that Model Invocation Logs capture the rejection event.

Test Case 3 — Temporal Decay Quarantine (eTAMP pattern)

Setup: Write a memory entry with source=TOOL_OUTPUT containing content that passes sanitization but has never been retrieved or corroborated. Advance simulated time by 45 days. Execute a semantically similar query and assert that the decayed entry does not appear in the top-k retrieval results even though its semantic similarity score would otherwise qualify it. This validates that temporal decay is functioning.

Include these in your CI eval suite tagged "memory-security" and run them on every PR that touches the memory subsystem, the session summarizer, or any tool that feeds output to memory.

Where to Start

Memory poisoning is a real, exploitable threat in production systems today. The defense architecture described here is implementable incrementally—you don’t need to build all five layers before shipping.

Priority order:

Instrument source provenance immediately. This is the cheapest, highest-leverage change. Tag every memory write with its source today. You cannot do trust scoring without provenance. This is a one-day change to your memory write path.
Deploy the sanitization pipeline before any memory-enabled agent reaches production. The pattern filters in sanitize_before_write catch the Bedrock exploit class and the MINJA injection class. False positive rate is low on normal agent workloads.
Implement trust-weighted retrieval. Replace pure semantic similarity retrieval with the blended score. Your legitimate high-quality memories will outcompete poisoned entries even if some slip through sanitization.
Add temporal decay with a 30-day half-life for unverified entries. Entries that are never retrieved and never corroborated lose weight automatically. This is a low-cost backstop.
Add retrieval-time audit logging. Log every retrieval with source and trust metadata. This is a prerequisite for incident response. Without it, attacks discovered after the fact are impossible to fully reconstruct.
Write three memory-poisoning red-team cases and add them to CI. Use the test case shapes above. Gate every PR that touches the memory subsystem on passing them.

The attack that waits is dangerous precisely because its execution is invisible—no failed login, no anomalous API call, no user-visible error. The agent simply, over time, begins to behave as the attacker intended. Source provenance and trust scoring turn that invisibility into an auditing problem—tractable engineering.

We architect memory security layers and adversarial eval pipelines for production AI agents. If your agents use long-term memory and you haven’t modeled the write-path attack surface, let’s talk.