AI & ML

The Memory Hierarchy: Demand Paging Architectures for LLM Agents

Q: What is the difference between RAG and agent memory?

> Bottom Line: RAG is a retrieval mechanism — it fetches semantically similar documents from a vector store and injects them into a prompt. Agent memory is a state management system — it tracks what the agent knows, what it is doing, what it has decided, and what it must remember across sessions. RAG is one component of L3 retrieval. It is not a substitute for L2 working memory or for the eviction and promotion rules that govern the full hierarchy. An agent that only does RAG has no session continuity, no plan state, and no mechanism to update or invalidate stale facts.

By treating agent memory like a CPU cache hierarchy—where L1 is immediate prompt context, L2 is short-term working memory, and L3 is vector-based long-term retrieval—developers can reduce total token costs by 40% while maintaining continuity; but this relies on precise eviction policies that currently lack standardized implementations.

By AxiomLogica Editorial

Apr 22, 202625 min read

Reviewed by Editorial

The Memory Hierarchy: Demand Paging Architectures for LLM Agents

Long-running LLM agents do not fail because their models are wrong. They fail because they forget — or because they try to remember everything at once and collapse under token costs that scale with conversation length. The architectural answer is not a bigger context window. It is a memory hierarchy that treats agent state the same way an operating system treats RAM: tiered, demand-paged, and governed by explicit eviction policy.

The memory problem long-running LLM agents actually hit

Bottom Line: Flat vector recall — stuffing all prior context into either the prompt or a single vector retrieval call — cannot deliver continuity across multi-session agents at production scale. It either exhausts the context window and drives token cost up with session length, or it retrieves semantically adjacent but contextually wrong memories that silently corrupt the agent's reasoning. Hierarchical memory solves both problems by separating what the agent needs right now from what the agent might need from what the agent should remember forever, and by loading each layer on demand rather than wholesale.

The dominant pattern in agentic workflow state management today is one of two failure modes: the agent carries its entire conversation history in the prompt (burning tokens at every step), or it discards history between sessions and treats each invocation as stateless. Neither scales.

The Mem0 research team quantified this precisely. On the LOCOMO benchmark, their hierarchical memory architecture reduced p95 latency by over 91% compared to full-context baselines, and cut token costs by more than 90%, while simultaneously improving answer quality — 5% relative gain on single-hop questions, 11% on temporal questions, and 7% on multi-hop questions over best-in-class baselines per question type. The authors frame the core problem directly: "Large Language Models have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues."

Flat RAG architecture, the other common approach, substitutes retrieval for memory but does not constitute memory. A vector similarity search retrieves documents that are semantically close to the current query — it does not preserve task state, goal hierarchies, intermediate plans, or the temporal ordering of user commitments. An agent that only does RAG will retrieve the right facts and still hallucinate a plan, because plan state is not a document; it is a structured object that must survive between turns.

Why the context window behaves like L1 cache, not a database

The context window is the highest-bandwidth, highest-cost, shortest-lived storage in the entire agent memory stack. Every token in it is re-processed at every inference step. That is the defining property of L1 cache: it is fast, expensive, and small relative to the total information the system needs. Current models like GPT-5.4 support up to 1.05 million tokens, but that scale does not make the context window a database. It makes the cost problem worse once prompts exceed the documented pricing threshold in the GPT-5.4 API documentation, where higher-rate billing applies to very large prompts for the session.

A database persists passively between reads. The context window must be explicitly constructed on every call. The architectural implication is that content in the context window should be treated like cache lines — loaded precisely because the current task needs them, evicted as soon as they are no longer hot, and never held just because they might be useful later.

The competitive gap in most agent memory discussions is this conflation: teams treat their vector database as a memory system, when in fact it maps to L3 archival storage. The full hierarchy is three tiers:

flowchart LR
    subgraph L1["L1 · Prompt Context (Hot)"]
        A["Active instructions\nCurrent turn input\nTool call results (ephemeral)\nLatest user message"]
    end
    subgraph L2["L2 · Working Memory (Warm)"]
        B["Task plan & goal state\nSession scratchpad\nPending sub-tasks\nRecent turn summaries"]
    end
    subgraph L3["L3 · Archival Retrieval (Cold)"]
        C["Vector store\n(pgvector / Pinecone / FAISS)\nLong-term facts\nUser preference history\nKnowledge archives"]
    end

    L3 -- "demand page into L2 on retrieval hit" --> L2
    L2 -- "promote hot state into L1 on task start" --> L1
    L1 -- "demote completed state to L2 on turn end" --> L2
    L2 -- "evict cold state to L3 on compaction" --> L3

Each tier maps to a distinct storage backend, latency profile, and cost model. The boundaries between tiers are not fixed — content moves across them under explicit promotion and eviction rules, exactly as a hardware cache controller manages cache lines.

L1: Immediate prompt context and why it is the most expensive layer

L1 holds only what the model must attend to in the current inference call: the system prompt, the active user message, the output schemas for current tools, and the immediate results of the most recent tool invocations. Everything else is overhead.

The cost pressure at L1 is nonlinear in the sense that larger prompts increase total spend and latency for every call. Under GPT-5.4 pricing, once a session crosses the documented large-prompt threshold, the higher-rate tier applies for that session; that is enough to make poor L1 hygiene expensive without pretending there is an official percentage-of-spend metric for prompt context.

Pro Tip: Ephemeral tool output — the raw JSON from an API call, a scraped HTML body, an intermediate code execution result — must never persist in L1 across turns. Extract the conclusions from tool output, write them to L2 as structured state, and drop the raw payload before the next inference. A single search result page left in context across ten turns costs more than ten fresh retrievals from L3.

Agentic workflow state management at L1 means treating the prompt as a register file, not a log. What the model needs to act belongs there; what it needs to remember belongs in L2 or L3.

L2: Short-term working memory for task state and scratchpad continuity

L2 holds the agent's active task graph: the decomposed plan, intermediate conclusions, completed sub-tasks, and the current goal hierarchy. This is the scratchpad that persists across turns within a session but does not need to be re-processed from scratch on every call. It is warm memory — not in the prompt by default, but loaded into L1 at the start of each turn and flushed back after.

The design challenge at L2 is compaction. An agent that appends every assistant response and tool result to a growing L2 buffer will exhaust its working memory budget within a few dozen turns. The correct behavior is progressive summarization: when L2 reaches a configured token threshold, the agent issues a compaction pass — a targeted LLM call that condenses the scratchpad into a structured state object representing goals, decisions made, and open sub-tasks.

Production Note: Trigger L2 compaction when working memory approaches your configured budget rather than waiting for overflow. The exact threshold is authorial engineering guidance, not a verified benchmark; a practical starting range is 60–70% of session token budget if your compaction model and backend can tolerate the extra call. Compacting early preserves fidelity, and LangGraph's checkpoint mechanism provides a natural compaction hook: store the compacted state dict as the checkpoint payload, not the full message history.

Mem0's architecture confirms the value of this approach: their method's over 91% lower p95 latency versus full-context baselines is largely attributable to not re-ingesting full history on every call — the functional analog of keeping L2 compact and loading only relevant state into L1.

L3: Long-term retrieval from vector stores and knowledge archives

L3 is archival storage — facts, preferences, historical decisions, and knowledge that the agent should be able to recall but does not need in every prompt. Vector databases including pgvector, Pinecone, and FAISS are the canonical backends. Retrieval from L3 is triggered on demand, not preloaded — the agent issues an embedding query when it detects a knowledge gap, and the result is loaded into L1 for the current turn.

The critical architectural point is that L3 is not a substitute for L2. A vector store can tell an agent what a user prefers; it cannot tell the agent what sub-task it was in the middle of executing. State and knowledge are different objects with different retrieval semantics.

Limitation Note: The verified sources in this article do not provide a controlled benchmark that compares pgvector, Pinecone, or FAISS directly against in-context state storage under the same task. The supported claim is narrower: archival retrieval is suitable for persistent facts and preferences, while short-term task state belongs in a structured working store.

Dimension	Archival Recall (L3)	Semantic Search (L3 variant)	State Storage (L2)
Backend	Vector DB (Pinecone, pgvector)	Embedding + reranker	Structured store (Redis, Postgres)
Query type	ANN similarity	Semantic relevance	Key-value or graph lookup
Latency	10–100 ms	50–300 ms (with reranker)	< 5 ms
Content type	Facts, preferences, documents	Passages, conversation chunks	Goals, plans, task state
Session scope	Persistent across sessions	Persistent across sessions	Single session or short-term
Evictable?	Yes, with TTL or semantic eviction	Yes	Yes, on task completion

Mem0's evaluation on LOCOMO demonstrates that an L3-backed retrieval layer, properly integrated, improves factual quality on multi-hop and temporal questions while keeping latency well below full-context approaches — because only the retrieved facts enter the prompt, not the entire history.

Demand paging for agent memory: what gets loaded, what gets evicted, what gets rebuilt

Demand paging in operating systems loads a memory page into RAM only when a process faults on it — not proactively. Agent memory should work the same way. The agent does not preload all user history into L1 at session start. It starts with a minimal L1 (system prompt, current message, active task state from L2), and triggers retrieval from L3 only when it detects a gap.

The Mem0 architecture operationalizes this directly: dynamic extraction, consolidation, and retrieval of salient memories across sessions, activated by the current task context rather than by time or position in history. Their results — over 90% token-cost savings versus full-context methods — are the empirical case for demand paging over wholesale context loading.

A concrete policy specification for a three-tier paging system, presented here as authorial guidance rather than a verified benchmark, looks like this:

memory_paging_policy:
  l1_prompt_budget_tokens: 32000
  l2_working_memory_budget_tokens: 8000

  l2_compaction_trigger:
    threshold_pct: 0.65          # authorial default, not a verified benchmark
    compaction_model: "gpt-4.1-mini"
    output_format: "structured_state_json"

  l3_retrieval:
    trigger: "page_fault"        # only retrieve when agent signals missing context
    top_k: 5
    rerank: true
    min_score_threshold: 0.72   # authorial default, not a verified benchmark

  promotion_rules:
    l3_to_l2: "on retrieval hit with score >= 0.85 AND recency < 7 days"
    l2_to_l1: "on task start, load active goal + last 3 compacted states"

  eviction_rules:
    l1_eviction: "end of turn — demote all non-sticky content to L2"
    l2_eviction: "on compaction — move completed sub-tasks to L3 archive"
    l3_eviction: "LRU with semantic override — see eviction policy section"

The key design decision in this spec is that retrieval from L3 is pull-only. The agent explicitly signals a page fault — a missing fact, a missing state object — and the memory controller loads it. This prevents speculative loading that fills L1 with context that never gets attended to.

Page faults in agent terms: missing facts, missing state, and missing plans

An agent page fault occurs when the model's response signals a knowledge gap that should have been resolvable from memory. Three categories exist: missing facts (the agent does not know a user's stated preference), missing state (the agent does not know where it was in a multi-step task), and missing plans (the agent cannot reconstruct its goal hierarchy after a session gap).

Detecting page faults in production requires instrumentation, not hope. A practical pattern is to have the agent emit a structured signal — for example, a retrieve_memory tool call — when it detects uncertainty about a fact it believes it should know. That signal is the page fault; the retrieval call is the page load.

Watch Out: Do not confuse retrieval miss rate with true memory loss. A retrieval miss (L3 returns no result above the similarity threshold) means the fact was never stored, or was stored under an incompatible embedding. True memory loss means the fact was stored and should have been retrieved but was not — a precision/recall failure in the vector index. Conflating the two leads to over-provisioning L3 storage when the actual problem is embedding quality or chunk granularity.

Mem0's LOCOMO results — 5% improvement on single-hop, 11% on temporal, 7% on multi-hop questions — represent the measurable benefit of correctly handling page faults for each question type: single-hop faults are simple fact retrievals, temporal faults require time-ordered state, and multi-hop faults require chained retrievals across L2 and L3.

Eviction policies: LRU, LFU, and semantic eviction compared

No single eviction policy dominates all agent memory regimes. The choice depends on access pattern, content type, and whether temporal recency or semantic relevance predicts future utility.

Policy	Trigger	Benefit	Failure mode	Best fit
LRU (Least Recently Used)	Time since last access	Simple to implement; handles bursty access well	Evicts rare but critical facts that haven't been accessed recently	Session scratchpad; short-horizon agents
LFU (Least Frequently Used)	Access count over window	Retains frequently referenced preferences and facts	Newly added critical facts are vulnerable before they accumulate hits	User preference stores; domain knowledge archives
Semantic Eviction	Embedding distance to current task context	Retains contextually relevant memories regardless of access pattern	Computationally expensive; requires embedding the current task state on each eviction pass	Long-horizon research agents; multi-domain task agents

Engineering guidance, not a verified benchmark conclusion: LRU is often the pragmatic default for L2 working memory because task state access is temporally clustered — you need the last few plan steps, rarely the ones from ten turns ago. LFU is more appropriate for L3 preference stores where a user's standing preferences should survive long gaps between sessions. Semantic eviction is the strongest policy for L3 archival retrieval in agents with variable task domains, but the embedding cost of re-scoring every candidate on each eviction pass is non-trivial at scale; batch the eviction pass rather than running it per-turn.

Mem0's consolidation mechanism — extracting and merging salient memories rather than storing raw conversation chunks — implicitly functions as a semantic eviction filter at write time. Facts that are not salient enough to survive extraction are never written to L3, which is a form of pre-eviction at ingest.

Promotion and demotion rules across memory tiers

Content moves up the hierarchy (toward L1) when the current task context assigns it high utility, and down (toward L3) when it becomes cold. Defining promotion and demotion mathematically prevents ad-hoc decisions that corrupt agent continuity.

A utility function for promotion scoring:

$(U(m, t) = \alpha \cdot \text{sem_sim}(m, q_t) + \beta \cdot \text{recency}(m, t) + \gamma \cdot \text{freq}(m, t))$

Where: - $m$ is a memory record - (q_t) is the embedding of the current task context at time $t$ - (\text{sem_sim}) is cosine similarity between the memory embedding and the task context - (\text{recency}(m, t) = e^{-\lambda(t - t_m)}) — exponential decay since last access - (\text{freq}(m, t)) is normalized access frequency over a rolling window - (\alpha, \beta, \gamma) are weights summing to 1 (typical starting point: 0.5, 0.3, 0.2)

Records with (U > \theta_{promote}) (for example, 0.75) are candidates for promotion from L3 to L2 at the start of the next session. Records with (U < \theta_{demote}) (for example, 0.25) after a compaction pass are evicted from L2 to L3. Those thresholds are illustrative design defaults, not verified benchmark cutoffs.

The RAG architecture connection: when a record is loaded from L3 via vector retrieval, its (\text{recency}) and (\text{freq}) terms should be updated in the L3 metadata store — not just the embedding index. This metadata update is what enables LFU and recency-weighted eviction at L3; pure vector databases that store only embeddings and payload cannot implement this without a parallel metadata table (e.g., a pgvector table with an accessed_at column and an access_count integer).

How the hierarchy changes token cost, latency, and answer continuity

The empirical case for hierarchical memory is anchored in Mem0's LOCOMO evaluation. A full-context baseline — carrying the entire conversation history in the prompt — achieves the highest theoretical completeness but pays for it with latency and cost that scale with conversation length. Mem0's hierarchical approach cuts p95 latency by over 91% and token cost by more than 90% while improving answer quality on three of four question categories.

Dimension	Full-context baseline	Flat RAG	Hierarchical memory (Mem0-style)	Verified status
Token cost per turn	O(session length)	O(retrieval chunk size)	O(L1 budget) — bounded	Verified as qualitative scaling behavior
p95 latency	Scales with total tokens	Moderate (retrieval + generation)	>91% lower than full-context (LOCOMO)	Verified on LOCOMO for Mem0-style system
Single-hop QA accuracy	Baseline	Near-baseline	+5% relative (LOCOMO)	Verified on LOCOMO
Temporal QA accuracy	Baseline	Degrades without temporal ordering	+11% relative (LOCOMO)	Verified on LOCOMO
Multi-hop QA accuracy	Baseline	Degrades without state chaining	+7% relative (LOCOMO)	Verified on LOCOMO
Session continuity	High (all history present)	Low (stateless between sessions)	High (L2 preserves state; L3 preserves facts)	Supported by architecture, not a direct benchmark metric
Retrieval overhead	Not applicable	Retrieval + rerank required	Present, but no direct verified measurement in the cited sources	Limitation: no direct retrieval-overhead number in the sources

The retrieval overhead of L3 queries — the latency cost of embedding and ANN search — is one dimension where hierarchical memory can pay a per-turn cost relative to a single small prompt. For a single-turn query where all relevant history fits comfortably in the context window, flat context can be faster. The hierarchy earns its cost at session 5, 10, or 50 — where full-context latency has grown proportionally and the retrieval overhead is a fixed per-turn constant.

Where flat RAG wins and where it fails

Flat RAG — embed the query, retrieve top-k documents, stuff into prompt, generate — is a practical architecture for a specific regime: single-session, low-stakes, stateless question-answering over a fixed document corpus.

It fails the moment the task requires any of the following: tracking what was decided in a prior turn, maintaining a goal hierarchy across sessions, updating a belief based on new information from the user, or correlating facts across multiple retrieved documents in a temporal sequence. These are state operations, and RAG has no state.

Pro Tip: Flat RAG is sufficient for agents that field lookup-style queries against a document corpus (internal knowledge base search, product catalog Q&A, policy lookup) where no personalization or session continuity is required. The inflection point where you need L2 working memory is when the agent must plan across more than one turn or remember a user's prior stated preferences. Add L2 before adding L3 — the absence of working memory hurts more than the absence of archival retrieval.

The GPT-5.4 pricing model reinforces this: a 1M-token context window sounds like a solution to the memory problem, but prompts above the documented large-prompt threshold are billed at a higher rate for the session. Architectures that rely on context-window scale as a substitute for memory hierarchy will hit pricing cliffs at exactly the volume where they need reliability most.

Where hierarchical memory pays off in customer support and research agents

The two canonical use cases for hierarchical memory are per-account support agents and long-horizon research agents. Their memory requirements differ significantly by tier.

Use case	Required L1 content	Required L2 content	Required L3 content	Primary retrieval pattern
Customer support	Current issue, account tier, open ticket	Prior resolution history (session), stated preferences	Account history (all sessions), product knowledge	Scoped by tenant ID; temporal ordering matters
Research agent	Current hypothesis, active sources	Research plan, cited claims, open questions	Literature archive, prior findings	Semantic similarity; multi-hop chaining
Coding assistant	Active file, current error	Project structure, design decisions	Code history, API docs	Hybrid: semantic + exact-match
Personalized assistant	Current request	Short-term task state	Long-term preferences, life context	Recency-weighted; preference consolidation

Customer support agents need L3 scoped strictly to the account — a fact about one user must never surface in another user's context. Research agents need L3 to support multi-hop retrieval chains: finding paper A leads to citing author B leads to retrieving paper C. Mem0's architecture, designed explicitly for prolonged multi-session dialogues, aligns most directly with customer support continuity; its LOCOMO benchmark covers the temporal and multi-hop patterns that research agents require.

Production failure modes that break agent memory systems

Watch Out: The three most common failure modes in production agent memory systems are (1) stale memories that contradict updated user state — the agent recalls a preference the user changed six months ago; (2) prompt bloat from over-loading L1 — the agent retrieves too many L3 results and saturates the context budget, crowding out the actual task; and (3) cross-user memory contamination in multi-tenant deployments — one user's retrieved facts appear in another user's context due to absent or misconfigured namespace isolation. All three are silent failures: the agent produces a response, but the response is wrong.

"Fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues," as Mem0's authors note — but the failure modes above are not window failures, they are architecture failures that any window size will fail to fix.

Stale or contradictory memories

Stale memories accumulate when the memory system stores facts as immutable records. A user changes their city; the old city record persists in L3 with a high utility score accumulated from prior sessions; the new city record scores lower on frequency because it was just written. The agent retrieves the wrong city.

The architectural fix is explicit versioning and invalidation: every L3 record can carry a created_at timestamp, an optional valid_until timestamp, and a superseded_by pointer when your implementation needs them. Treat this as a proposed schema, not a verified requirement. The retrieval path must filter out superseded records before scoring.

Pro Tip: Attach a confidence score and a source label to every L3 record at write time. Facts derived from direct user statements should carry higher confidence and longer TTL than facts inferred by the model from conversation. When two records conflict and neither is explicitly superseded, surface both to the model with their confidence scores rather than silently arbitrating — the model can usually resolve the conflict given both options and ask the user to confirm.

Mem0's consolidation step — dynamically merging salient memories across sessions — is a write-time deduplication mechanism that partially addresses this, but it does not replace explicit invalidation for user-corrected facts.

Multi-tenant memory isolation

A memory system that serves more than a handful of users must enforce strict namespace isolation at the retrieval layer. Vector similarity search, by default, is global across the index. Without a tenant-scoped filter applied at query time, a retrieved memory from user A can appear in a query response for user B if the embeddings are similar enough.

Production Note: Implement isolation at the retrieval layer and in your data model as production guidance, not as an authoritative source claim. In Pinecone, namespaces are the usual mechanism; in pgvector, many teams add a tenant_id column and a composite index on (tenant_id, embedding). Second, enforce the tenant filter in every retrieval call as a hard pre-filter, not a post-filter on results. Post-filtering allows the ANN search to surface cross-tenant candidates before they are discarded; pre-filtering prevents cross-tenant candidates from entering the candidate set at all. For deployments beyond a few thousand tenants, evaluate whether a single shared index with namespace filtering or per-tenant dedicated index shards is more operationally sustainable — the latter trades management overhead for absolute isolation guarantees.

Agentic workflow state management in multi-tenant systems also requires that L2 working memory be scoped per session and per user: a LangGraph checkpoint store keyed by (user_id, session_id) prevents state bleed across concurrent user sessions on the same agent instance.

Framework reality check: A-MEM, Mem0, and adjacent prototypes

The agent memory framework space in 2026 is still early-stage. Most implementations are research prototypes or thin wrappers over vector databases. Only Mem0 has published a peer-reviewed benchmark with verifiable numbers under a standard evaluation harness.

Framework	Maturity	Extensibility	Verified benchmark	Scale limits	Verdict
Mem0	Production-ready (per authors)	High — pluggable backends, custom extractors	LOCOMO: +5/11/7% QA, >91% latency reduction, >90% token savings (arXiv:2504.19413)	Not officially published per-tenant scaling limits	Most evidence-backed option in this article's source set
A-MEM	Research prototype	Moderate — zettelkasten-style note graph	No peer-reviewed benchmark retrieved	Unknown at scale	Interesting architecture; not production-validated
LangGraph state	Stable	High — integrates with any Python backend	No memory-specific benchmark; checkpoint latency depends on backend	Scales with checkpoint store (Redis, Postgres)	Solid L2 working memory; not a full memory hierarchy solution
Custom pgvector	As mature as your implementation	Unlimited	No published benchmark; depends on implementation	Postgres scaling limits apply	Viable L3 backend; requires you to build L1/L2 yourself

The honest assessment: no framework in the current ecosystem provides a fully implemented, production-validated three-tier hierarchy with demand paging and eviction policies out of the box. Mem0 comes closest on the archival retrieval and consolidation side. LangGraph provides the scaffolding for L2 state management. Engineers building production systems will need to compose these tools and implement eviction and promotion logic themselves, using the policy specification in this article as a starting point.

FAQ

Bottom Line: Agent memory is not just RAG. RAG is the L3 retrieval mechanism; real agent memory also needs L2 working memory for task state and an eviction/promotion policy that keeps continuity intact across turns and sessions.

What is the memory hierarchy in LLMs?

The memory hierarchy maps agent storage to three tiers analogous to CPU cache: L1 (the active prompt context — high cost, small capacity, re-processed on every call), L2 (session working memory — structured task state and scratchpad that persists across turns but lives outside the prompt by default), and L3 (vector-based long-term archival retrieval — facts and preferences that persist across sessions and are loaded on demand).

How does demand paging work in AI agents?

The agent starts each turn with a minimal L1 containing only the current task state and active instructions. When the agent detects a knowledge gap — a missing fact, a missing plan, a missing user preference — it issues a retrieval call to L3. The retrieved record is loaded into L1 for the current turn. At turn end, non-sticky content is demoted back to L2 or L3. Nothing is preloaded speculatively.

What is the difference between RAG and agent memory?

Bottom Line: RAG is a retrieval mechanism — it fetches semantically similar documents from a vector store and injects them into a prompt. Agent memory is a state management system — it tracks what the agent knows, what it is doing, what it has decided, and what it must remember across sessions. RAG is one component of L3 retrieval. It is not a substitute for L2 working memory or for the eviction and promotion rules that govern the full hierarchy. An agent that only does RAG has no session continuity, no plan state, and no mechanism to update or invalidate stale facts.

Which eviction policy is best for long-term agent memory?

No single policy dominates. LRU is the pragmatic default for L2 session state. LFU is more appropriate for L3 preference stores where frequently accessed facts should outlive inactive ones. Semantic eviction — scoring candidates against the current task embedding — produces the best precision for long-horizon multi-domain agents but carries higher compute cost. In practice, a hybrid utility function combining recency, frequency, and semantic similarity (as defined in the promotion/demotion section above) outperforms any single policy.

How do you keep context across sessions in an LLM agent?

L2 compacted state (a structured JSON object representing the agent's goals, decisions, and open tasks) is persisted to a durable store (Postgres, Redis) at session end and reloaded at session start. L3 long-term facts are written to a vector store with tenant-scoped namespacing and retrieved on demand. The combination means the agent resumes with its plan intact (L2) and can retrieve specific historical facts as needed (L3) without reconstructing the entire conversation history.

Sources & References

Production Note: The primary benchmark source for this article is the Mem0 paper (arXiv:2504.19413), which provides the only peer-reviewed, numerically verifiable results on hierarchical agent memory under the LOCOMO benchmark as of this writing. OpenAI GPT-5.4 pricing and context-window specifications are drawn from official OpenAI documentation and are subject to change.

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory (arXiv:2504.19413) — Primary benchmark source; LOCOMO evaluation, latency and token-cost results
GPT-5.4 Model Launch — OpenAI — 1M-token context window announcement and positioning
GPT-5.4 API Documentation — OpenAI — Context window limits and pricing tier structure
pgvector — Open-source vector similarity for Postgres — L3 vector backend reference
Pinecone Vector Database — Managed vector store; L3 backend with namespace isolation
FAISS — Facebook AI Similarity Search — Open-source ANN library for L3 retrieval

Keywords: A-MEM, Mem0, LangGraph, Llama 3.1, OpenAI Responses API, vector database, RAG, LRU, LFU, semantic caching, embedding model, pgvector, Pinecone, FAISS, arXiv:2603.10062

Was this guide helpful?

Share: X · LinkedIn · Reddit