Agent Engineering

Memory: Short, Long, Episodic

Ravinder·February 15, 2025·6 min read

AgentsAILLMMemoryRAG

Series

Agent Engineering

Part 3 of 10

← Part 2

Tool Design for Autonomy

Part 4 →

Planning vs Reacting

Most agent bugs are not reasoning bugs — they are memory bugs. The agent forgot what it said three turns ago, retrieved the wrong document, or replayed an action it already completed. Treating "memory" as a monolith is the root cause. There are three fundamentally different stores, each with different latency, cost, and scope profiles. Conflate them and you get a system that hallucintates its own history.

The Three Layers

Short-term memory is the context window. Long-term memory is a vector store or structured DB persisted between sessions. Episodic memory is the structured log of what the agent actually did — tool calls, observations, decisions — stored so it can reason about its own history.

flowchart TD A[Incoming Task] --> B[Working Memory\nContext Window] B --> C{Needs past context?} C -- yes --> D[Long-Term Store\nVector / KV] C -- no --> E[Execute Step] D --> E E --> F[Episodic Log\nStructured event store] F --> G{Task done?} G -- no --> B G -- yes --> H[Summarize → Long-Term Store]

Each arrow represents a design decision. Most teams implement the happy path (A→B→E→G) and ship. Then the agent starts duplicating work, contradicting itself, or failing on anything longer than a single session.

Short-Term: Context Window as Working Memory

The context window is fast and cheap per-token but bounded and ephemeral. Treat it like CPU registers — only what the current step needs.

The common mistake is dumping everything in: full conversation history, all retrieved docs, all tool results. Token bloat hurts latency, cost, and model quality (attention dilution is real).

A sliding-window strategy with importance-weighted pruning works well:

from dataclasses import dataclass, field
from typing import Literal
 
@dataclass
class Message:
    role: Literal["user", "assistant", "tool"]
    content: str
    importance: float = 1.0  # 0.0 = purgeable, 1.0 = critical
 
def build_context(
    history: list[Message],
    max_tokens: int = 8000,
    token_estimator=lambda m: len(m.content) // 4,
) -> list[Message]:
    """Keep the most important recent messages within budget."""
    # Always keep system + last user message
    pinned = [m for m in history if m.importance >= 1.0]
    candidates = [m for m in history if m.importance < 1.0]
    # Sort by recency, drop oldest when over budget
    candidates.sort(key=lambda m: history.index(m), reverse=True)
    budget = max_tokens - sum(token_estimator(m) for m in pinned)
    kept = []
    for msg in candidates:
        cost = token_estimator(msg)
        if budget - cost >= 0:
            kept.append(msg)
            budget -= cost
    return pinned + sorted(kept, key=lambda m: history.index(m))

Mark tool observations as low-importance once summarized. Mark user constraints and task goals as critical.

Long-Term: Vector Recall

Long-term memory survives sessions. The canonical implementation is embeddings in a vector store (Pinecone, pgvector, Chroma). The design surface people underestimate is write policy: when do facts enter the store, and how do you avoid polluting it with stale or wrong information?

import openai
import numpy as np
from datetime import datetime
 
class LongTermStore:
    def __init__(self, vector_db_client):
        self.db = vector_db_client
        self.embed_model = "text-embedding-3-small"
 
    def _embed(self, text: str) -> list[float]:
        resp = openai.embeddings.create(model=self.embed_model, input=text)
        return resp.data[0].embedding
 
    def remember(self, fact: str, confidence: float, source: str):
        """Only write high-confidence, sourced facts."""
        if confidence < 0.8:
            return  # don't poison the store with guesses
        vec = self._embed(fact)
        self.db.upsert(
            id=f"{source}-{hash(fact)}",
            vector=vec,
            metadata={"text": fact, "source": source,
                       "ts": datetime.utcnow().isoformat(),
                       "confidence": confidence},
        )
 
    def recall(self, query: str, top_k: int = 5) -> list[dict]:
        vec = self._embed(query)
        results = self.db.query(vector=vec, top_k=top_k, include_metadata=True)
        return [m.metadata for m in results.matches]

Confidence gating is a first-class concern. An agent that writes its own hallucinations to the long-term store will compound errors across sessions.

Episodic Memory: The Action Log

Episodic memory is where agents most often have no implementation at all. It is the structured record of what happened: which tools were called, what they returned, what decisions were made. Without it, the agent cannot answer "did I already send that email?" or "what was the error three steps ago?"

import json
from pathlib import Path
from datetime import datetime
 
class EpisodicLog:
    def __init__(self, session_id: str, log_dir: str = "/tmp/agent_episodes"):
        self.session_id = session_id
        self.path = Path(log_dir) / f"{session_id}.jsonl"
        self.path.parent.mkdir(parents=True, exist_ok=True)
 
    def record(self, event_type: str, payload: dict):
        entry = {
            "ts": datetime.utcnow().isoformat(),
            "session": self.session_id,
            "type": event_type,
            **payload,
        }
        with self.path.open("a") as f:
            f.write(json.dumps(entry) + "\n")
 
    def replay(self) -> list[dict]:
        if not self.path.exists():
            return []
        return [json.loads(line) for line in self.path.read_text().splitlines()]
 
    def already_done(self, tool: str, args: dict) -> bool:
        """Idempotency check — did we already call this tool with these args?"""
        for event in self.replay():
            if event.get("type") == "tool_call" \
               and event.get("tool") == tool \
               and event.get("args") == args:
                return True
        return False

The already_done check gives you cheap idempotency without distributed locks. For actions with side effects (sending emails, writing files), check before executing.

Summarization as Memory Compression

When a session ends, don't discard the episodic log — compress it into the long-term store. A summary LLM call costs pennies and buys you durable context.

def compress_episode_to_ltm(log: EpisodicLog, ltm: LongTermStore, llm_client):
    events = log.replay()
    narrative = "\n".join(
        f"[{e['type']}] {e.get('tool','')} {e.get('result','')[:200]}"
        for e in events
    )
    prompt = (
        "Summarize this agent session into ≤5 key facts "
        "that would be useful for future sessions:\n\n" + narrative
    )
    response = llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    summary = response.choices[0].message.content
    ltm.remember(fact=summary, confidence=0.9, source=log.session_id)

This closes the loop: short-term feeds the episodic log, episodic log compresses to long-term, long-term seeds short-term on the next session.

Key Takeaways

Context window, vector store, and episodic log solve different problems — conflating them causes silent reliability failures.
Prune short-term memory by importance, not just recency; pin goals and constraints, drop stale observations.
Gate long-term writes by confidence — an agent that writes its own guesses to persistent storage compounds errors across sessions.
Episodic logs enable idempotency: check whether a side-effecting tool was already called before executing it again.
Summarize finished sessions into long-term store; this makes future sessions context-aware without inflating the context window.
Memory architecture should be designed before tool architecture — it determines the entire reliability envelope of a multi-turn agent.

Series

Agent Engineering

Part 3 of 10

← Part 2

Tool Design for Autonomy

Part 4 →

Planning vs Reacting