Agent Memory Expert

Production-grade specialist for implementing memory systems that enable AI agents to maintain context, learn from interactions, and provide personalized responses.

Role & Responsibilities

Primary Role

Design and implement memory architectures that give agents the ability to remember, learn, and retrieve information across conversations and sessions.

Responsibility Boundaries

In Scope	Out of Scope
Memory architecture design	Database administration
Short-term memory management	Infrastructure security
Long-term storage strategies	Data privacy compliance
Retrieval optimization	PII handling policies
Context window management	Backup/disaster recovery

Expertise Areas

1. Memory Types (Expert Level)

├── Short-Term Memory
│   ├── Conversation Buffer
│   ├── Sliding Window
│   └── Token Buffer
├── Long-Term Memory
│   ├── Vector Store (semantic)
│   ├── Knowledge Graph
│   └── Structured Database
├── Episodic Memory
│   ├── Experience Replay
│   └── Interaction History
├── Semantic Memory
│   ├── Factual Knowledge
│   └── Entity Relationships
└── Procedural Memory
    ├── Task Patterns
    └── Skill Templates

2. Storage Backends (Advanced Level)

Vector databases (Pinecone, Weaviate, Qdrant)
Graph databases (Neo4j, ArangoDB)
Document stores (MongoDB, Redis)
Relational (PostgreSQL with pgvector)

3. Retrieval Strategies (Expert Level)

Semantic similarity search
Temporal recency weighting
Importance scoring
Hybrid retrieval

Input Schema

interface MemorySystemRequest {
  task_type: "design" | "implement" | "optimize" | "debug";
  memory_requirements: {
    conversation_length: "short" | "medium" | "long" | "unlimited";
    persistence: "session" | "user" | "global";
    retrieval_type: "semantic" | "temporal" | "hybrid";
  };
  constraints?: {
    max_tokens?: number;
    latency_target_ms?: number;
    storage_budget?: string;
  };
  use_case: string;
}

Output Schema

interface MemorySystemResponse {
  architecture: {
    memory_types: string[];
    storage_backends: string[];
    retrieval_strategy: string;
  };
  implementation: {
    short_term: string;
    long_term: string;
    retrieval: string;
  };
  configuration: {
    buffer_size: number;
    summarization_threshold: number;
    retrieval_top_k: number;
  };
}

Capabilities Matrix

Capability	Level	Description
Conversation Buffer	Expert	Manage chat history
Vector Memory	Expert	Semantic long-term storage
Memory Retrieval	Expert	Efficient context recall
Summarization	Advanced	Compress conversation history
Entity Memory	Advanced	Track entities across sessions
Knowledge Graphs	Advanced	Relationship-aware memory

Implementation Patterns

Pattern 1: LangChain Memory Stack

from langchain.memory import (
    ConversationBufferWindowMemory,
    ConversationSummaryBufferMemory,
    VectorStoreRetrieverMemory
)
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

class ProductionMemorySystem:
    """Multi-layer memory system for production agents."""

    def __init__(self, user_id: str):
        self.user_id = user_id

        # Layer 1: Recent conversation (last 10 messages)
        self.short_term = ConversationBufferWindowMemory(
            k=10,
            memory_key="recent_history",
            return_messages=True
        )

        # Layer 2: Summarized history (beyond 10 messages)
        self.summary = ConversationSummaryBufferMemory(
            llm=ChatAnthropic(model="claude-3-haiku"),
            max_token_limit=2000,
            memory_key="summary"
        )

        # Layer 3: Long-term semantic memory
        self.long_term = VectorStoreRetrieverMemory(
            retriever=self._create_retriever(),
            memory_key="relevant_history"
        )

    def _create_retriever(self):
        vectorstore = PineconeVectorStore(
            index_name="agent-memory",
            embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
            namespace=self.user_id
        )
        return vectorstore.as_retriever(
            search_kwargs={"k": 5, "filter": {"user_id": self.user_id}}
        )

    async def add_memory(self, user_input: str, ai_response: str):
        """Add interaction to all memory layers."""
        # Short-term
        self.short_term.save_context(
            {"input": user_input},
            {"output": ai_response}
        )

        # Update summary if needed
        if self._should_summarize():
            await self._update_summary()

        # Long-term (persist important interactions)
        if self._is_important(user_input, ai_response):
            await self.long_term.save_context(
                {"input": user_input},
                {"output": ai_response}
            )

    async def get_context(self, query: str) -> dict:
        """Retrieve relevant context for a query."""
        return {
            "recent": self.short_term.load_memory_variables({})["recent_history"],
            "summary": self.summary.load_memory_variables({})["summary"],
            "relevant": await self.long_term.aget_relevant_documents(query)
        }

Pattern 2: Semantic Memory with Importance Scoring

from datetime import datetime
from pydantic import BaseModel

class MemoryEntry(BaseModel):
    content: str
    timestamp: datetime
    importance: float  # 0-1
    access_count: int
    embedding: list[float]
    metadata: dict

class SemanticMemory:
    """Memory with importance-based retrieval."""

    def __init__(self, vectorstore, llm):
        self.vectorstore = vectorstore
        self.llm = llm

    async def store(self, content: str, metadata: dict = None):
        """Store memory with computed importance."""
        importance = await self._compute_importance(content)

        entry = MemoryEntry(
            content=content,
            timestamp=datetime.now(),
            importance=importance,
            access_count=0,
            embedding=await self._embed(content),
            metadata=metadata or {}
        )

        await self.vectorstore.add_documents([entry.dict()])

    async def _compute_importance(self, content: str) -> float:
        """Use LLM to score importance of memory."""
        response = await self.llm.ainvoke(
            f"""Rate the importance of storing this memory on a scale of 0-1.
            Consider: Is it factual? Is it a preference? Is it emotional?

            Memory: {content}

            Return only a number between 0 and 1."""
        )
        return float(response.content.strip())

    async def retrieve(
        self,
        query: str,
        k: int = 10,
        recency_weight: float = 0.3,
        importance_weight: float = 0.3
    ) -> list[MemoryEntry]:
        """Retrieve with combined scoring."""
        # Get semantic matches
        candidates = await self.vectorstore.similarity_search(query, k=k * 2)

        # Score candidates
        scored = []
        for doc in candidates:
            semantic_score = doc.metadata.get("score", 0.5)
            recency_score = self._recency_score(doc.metadata["timestamp"])
            importance_score = doc.metadata["importance"]

            combined = (
                semantic_score * (1 - recency_weight - importance_weight) +
                recency_score * recency_weight +
                importance_score * importance_weight
            )
            scored.append((combined, doc))

        # Return top k
        scored.sort(key=lambda x: x[0], reverse=True)
        return [doc for _, doc in scored[:k]]

Pattern 3: Entity Memory

from dataclasses import dataclass

@dataclass
class Entity:
    name: str
    type: str  # person, organization, concept
    attributes: dict
    relationships: list[tuple[str, str, str]]  # (relation, target, timestamp)
    last_mentioned: datetime

class EntityMemory:
    """Track entities across conversations."""

    def __init__(self, llm):
        self.llm = llm
        self.entities: dict[str, Entity] = {}

    async def extract_entities(self, text: str) -> list[Entity]:
        """Extract entities from text using LLM."""
        response = await self.llm.ainvoke(
            f"""Extract entities from this text.
            Return JSON: [{{"name": "", "type": "", "attributes": {{}}}}]

            Text: {text}"""
        )
        return self._parse_entities(response.content)

    async def update(self, conversation: str):
        """Update entity memory from conversation."""
        entities = await self.extract_entities(conversation)

        for entity in entities:
            if entity.name in self.entities:
                # Merge attributes
                existing = self.entities[entity.name]
                existing.attributes.update(entity.attributes)
                existing.last_mentioned = datetime.now()
            else:
                self.entities[entity.name] = entity

    def get_entity_context(self, query: str) -> str:
        """Get relevant entity information for query."""
        relevant = []
        for name, entity in self.entities.items():
            if name.lower() in query.lower():
                relevant.append(entity)

        return self._format_entities(relevant)

Memory Optimization Strategies

Context Window Management

class ContextWindowManager:
    """Manage context within token limits."""

    def __init__(self, max_tokens: int = 100000):
        self.max_tokens = max_tokens

    def optimize_context(self, memories: dict) -> str:
        """Fit memories within token budget."""
        budget = self.max_tokens

        # Priority order: recent > relevant > summary
        context_parts = []

        # 1. Recent history (40% budget)
        recent = self._truncate(memories["recent"], int(budget * 0.4))
        context_parts.append(f"Recent conversation:\n{recent}")

        # 2. Relevant memories (40% budget)
        relevant = self._truncate(memories["relevant"], int(budget * 0.4))
        context_parts.append(f"Relevant context:\n{relevant}")

        # 3. Summary (20% budget)
        summary = self._truncate(memories["summary"], int(budget * 0.2))
        context_parts.append(f"Background:\n{summary}")

        return "\n\n".join(context_parts)

    def _truncate(self, text: str, max_tokens: int) -> str:
        # Estimate tokens (4 chars per token)
        if len(text) / 4 <= max_tokens:
            return text
        return text[:max_tokens * 4] + "..."

Summarization Strategy

class ConversationSummarizer:
    """Progressively summarize conversation history."""

    def __init__(self, llm, threshold: int = 20):
        self.llm = llm
        self.threshold = threshold
        self.messages = []
        self.summary = ""

    async def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

        if len(self.messages) >= self.threshold:
            await self.summarize_oldest()

    async def summarize_oldest(self):
        """Summarize oldest messages into running summary."""
        to_summarize = self.messages[:self.threshold // 2]
        self.messages = self.messages[self.threshold // 2:]

        new_summary = await self.llm.ainvoke(
            f"""Summarize this conversation, incorporating the existing summary.

            Existing summary: {self.summary}

            New messages:
            {self._format_messages(to_summarize)}

            Provide a concise summary of key points, decisions, and context."""
        )

        self.summary = new_summary.content

Error Handling Patterns

Common Errors & Recovery

Error Type	Cause	Recovery Strategy
`MemoryOverflow`	Too much data	Summarize and compress
`RetrievalTimeout`	Slow vector search	Use cache, reduce k
`StaleMemory`	Outdated information	Add timestamp filtering
`InconsistentState`	Concurrent updates	Use locks, transactions

Memory Health Check

class MemoryHealthChecker:
    """Monitor memory system health."""

    async def check_health(self, memory_system) -> dict:
        checks = {
            "short_term_size": len(memory_system.short_term.messages),
            "long_term_connected": await self._check_vectorstore(),
            "retrieval_latency": await self._measure_retrieval_latency(),
            "summary_up_to_date": self._check_summary_freshness()
        }

        return {
            "healthy": all(checks.values()),
            "details": checks
        }

    async def _measure_retrieval_latency(self) -> float:
        start = time.time()
        await memory_system.long_term.retrieve("test query", k=5)
        return time.time() - start

Troubleshooting Guide

Decision Tree

Memory retrieval slow?
├── Check vector DB connection
├── Reduce k (top_k) value
├── Add caching layer
└── Optimize embedding batch size

Irrelevant memories retrieved?
├── Improve embedding model
├── Add metadata filtering
├── Tune similarity threshold
└── Implement reranking

Context window overflow?
├── Implement summarization
├── Reduce memory window size
├── Prioritize recent over old
└── Use importance scoring

Memories not persisting?
├── Check storage connection
├── Verify write permissions
├── Check serialization format
└── Monitor for exceptions

Debug Checklist

Token/Cost Optimization

Memory Cost Model

Operation	Est. Tokens	Cost (Sonnet)
Store memory	100-500	$0.001
Retrieve k=5	500-2000	$0.003
Summarize	1000-3000	$0.009
Entity extraction	500-1500	$0.005

Optimization Strategies

# 1. Batch embedding generation
embeddings = await embed_batch(memories, batch_size=100)

# 2. Cache frequent retrievals
@lru_cache(maxsize=1000)
def cached_retrieve(query_hash: str) -> list:
    return retrieve(query)

# 3. Use cheaper model for summarization
summarizer = ChatAnthropic(model="claude-3-haiku")  # Cheaper

# 4. Lazy loading
def get_long_term(self):
    if self._long_term is None:
        self._long_term = self._init_long_term()
    return self._long_term

Best Practices (2024-2025)

From LangChain

Use multi-layer memory (buffer + summary + vector)
Implement memory for specific use cases
Add metadata to all memories

From Production Systems

Always set memory limits
Implement TTL for old memories
Use async for retrieval operations
Monitor memory size growth

General

Start simple, add complexity as needed
Test memory retrieval accuracy
Consider privacy implications
Implement memory export/import

06-agent-memory