Memory: Short, Long, Episodic
Most agent bugs are not reasoning bugs — they are memory bugs. The agent forgot what it said three turns ago, retrieved the wrong document, or replayed an action it already completed. Treating "memory" as a monolith is the root cause. There are three fundamentally different stores, each with different latency, cost, and scope profiles. Conflate them and you get a system that hallucintates its own history.
The Three Layers
Short-term memory is the context window. Long-term memory is a vector store or structured DB persisted between sessions. Episodic memory is the structured log of what the agent actually did — tool calls, observations, decisions — stored so it can reason about its own history.
Each arrow represents a design decision. Most teams implement the happy path (A→B→E→G) and ship. Then the agent starts duplicating work, contradicting itself, or failing on anything longer than a single session.
Short-Term: Context Window as Working Memory
The context window is fast and cheap per-token but bounded and ephemeral. Treat it like CPU registers — only what the current step needs.
The common mistake is dumping everything in: full conversation history, all retrieved docs, all tool results. Token bloat hurts latency, cost, and model quality (attention dilution is real).
A sliding-window strategy with importance-weighted pruning works well:
from dataclasses import dataclass, field
from typing import Literal
@dataclass
class Message:
role: Literal["user", "assistant", "tool"]
content: str
importance: float = 1.0 # 0.0 = purgeable, 1.0 = critical
def build_context(
history: list[Message],
max_tokens: int = 8000,
token_estimator=lambda m: len(m.content) // 4,
) -> list[Message]:
"""Keep the most important recent messages within budget."""
# Always keep system + last user message
pinned = [m for m in history if m.importance >= 1.0]
candidates = [m for m in history if m.importance < 1.0]
# Sort by recency, drop oldest when over budget
candidates.sort(key=lambda m: history.index(m), reverse=True)
budget = max_tokens - sum(token_estimator(m) for m in pinned)
kept = []
for msg in candidates:
cost = token_estimator(msg)
if budget - cost >= 0:
kept.append(msg)
budget -= cost
return pinned + sorted(kept, key=lambda m: history.index(m))Mark tool observations as low-importance once summarized. Mark user constraints and task goals as critical.
Long-Term: Vector Recall
Long-term memory survives sessions. The canonical implementation is embeddings in a vector store (Pinecone, pgvector, Chroma). The design surface people underestimate is write policy: when do facts enter the store, and how do you avoid polluting it with stale or wrong information?
import openai
import numpy as np
from datetime import datetime
class LongTermStore:
def __init__(self, vector_db_client):
self.db = vector_db_client
self.embed_model = "text-embedding-3-small"
def _embed(self, text: str) -> list[float]:
resp = openai.embeddings.create(model=self.embed_model, input=text)
return resp.data[0].embedding
def remember(self, fact: str, confidence: float, source: str):
"""Only write high-confidence, sourced facts."""
if confidence < 0.8:
return # don't poison the store with guesses
vec = self._embed(fact)
self.db.upsert(
id=f"{source}-{hash(fact)}",
vector=vec,
metadata={"text": fact, "source": source,
"ts": datetime.utcnow().isoformat(),
"confidence": confidence},
)
def recall(self, query: str, top_k: int = 5) -> list[dict]:
vec = self._embed(query)
results = self.db.query(vector=vec, top_k=top_k, include_metadata=True)
return [m.metadata for m in results.matches]Confidence gating is a first-class concern. An agent that writes its own hallucinations to the long-term store will compound errors across sessions.
Episodic Memory: The Action Log
Episodic memory is where agents most often have no implementation at all. It is the structured record of what happened: which tools were called, what they returned, what decisions were made. Without it, the agent cannot answer "did I already send that email?" or "what was the error three steps ago?"
import json
from pathlib import Path
from datetime import datetime
class EpisodicLog:
def __init__(self, session_id: str, log_dir: str = "/tmp/agent_episodes"):
self.session_id = session_id
self.path = Path(log_dir) / f"{session_id}.jsonl"
self.path.parent.mkdir(parents=True, exist_ok=True)
def record(self, event_type: str, payload: dict):
entry = {
"ts": datetime.utcnow().isoformat(),
"session": self.session_id,
"type": event_type,
**payload,
}
with self.path.open("a") as f:
f.write(json.dumps(entry) + "\n")
def replay(self) -> list[dict]:
if not self.path.exists():
return []
return [json.loads(line) for line in self.path.read_text().splitlines()]
def already_done(self, tool: str, args: dict) -> bool:
"""Idempotency check — did we already call this tool with these args?"""
for event in self.replay():
if event.get("type") == "tool_call" \
and event.get("tool") == tool \
and event.get("args") == args:
return True
return FalseThe already_done check gives you cheap idempotency without distributed locks. For actions with side effects (sending emails, writing files), check before executing.
Summarization as Memory Compression
When a session ends, don't discard the episodic log — compress it into the long-term store. A summary LLM call costs pennies and buys you durable context.
def compress_episode_to_ltm(log: EpisodicLog, ltm: LongTermStore, llm_client):
events = log.replay()
narrative = "\n".join(
f"[{e['type']}] {e.get('tool','')} {e.get('result','')[:200]}"
for e in events
)
prompt = (
"Summarize this agent session into ≤5 key facts "
"that would be useful for future sessions:\n\n" + narrative
)
response = llm_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
summary = response.choices[0].message.content
ltm.remember(fact=summary, confidence=0.9, source=log.session_id)This closes the loop: short-term feeds the episodic log, episodic log compresses to long-term, long-term seeds short-term on the next session.
Key Takeaways
- Context window, vector store, and episodic log solve different problems — conflating them causes silent reliability failures.
- Prune short-term memory by importance, not just recency; pin goals and constraints, drop stale observations.
- Gate long-term writes by confidence — an agent that writes its own guesses to persistent storage compounds errors across sessions.
- Episodic logs enable idempotency: check whether a side-effecting tool was already called before executing it again.
- Summarize finished sessions into long-term store; this makes future sessions context-aware without inflating the context window.
- Memory architecture should be designed before tool architecture — it determines the entire reliability envelope of a multi-turn agent.