AI Memory & Context Retention: Make In-Product AI Remember Users Across Sessions
AI Memory Strategy for Your New SaaS
Goal: Give your in-product AI features (chat, agents, copilots) durable memory so they remember who the user is, what they're working on, and what they've said before — across sessions, across days, across context-window resets — without leaking PII, ballooning your token bill, or producing stale/wrong recall. Pick the right memory shape for the use case (working memory vs. semantic recall vs. structured profile), define write triggers and decay rules explicitly, and treat memory accuracy as a measurable quality metric, not a vibes-based feature. Avoid the founder failure modes where every conversation starts cold ("hi, who are you?"), where memory is just "stuff the entire chat history into the next prompt" (works until session 5, then breaks), where the AI confidently recalls things the user never said (false memories from poor retrieval), or where you ship a memory feature and never measure whether it actually improves outcomes.
Process: Follow this chat pattern with your AI coding tool such as Claude or v0.app. Pay attention to the notes in [brackets] and replace the bracketed text with your own content.
Timeframe: Basic working memory + summarization in 2-3 days. Long-term semantic memory with a vector store in week 2. Structured profile + preferences in week 3. Memory eval harness + privacy controls in week 4.
Why Most Founder AI Memory Implementations Break
Four failure modes show up consistently:
- No memory at all — every session is amnesia. Founder ships an AI assistant that resets every conversation. User: "Yes, I told you yesterday I work in finance." AI: "I don't have memory of previous conversations." Power users churn within two weeks because the AI is dumber than their chat history with a friend.
- "Memory = full chat history in the prompt." Works for the first few turns. By turn 30 the prompt is 50K tokens, latency triples, costs explode, and the AI starts contradicting itself because the early context drowns out the recent context. No summarization, no selection, no decay.
- Vector-only memory with naive recall. Founder ships embeddings of every user message into a vector store and retrieves top-5 on each turn. Two problems: (1) similarity surfaces semantically-related-but-irrelevant memories ("they once mentioned Stripe" — irrelevant to current question); (2) no way to recall structured facts ("user's company is Acme") because embeddings are bad at exact-fact retrieval. Hallucinated recall worse than no recall.
- No privacy controls, no expiry, no user-visible memory. Memory accumulates forever. The AI eventually surfaces something the user wishes it had forgotten ("you mentioned a custody dispute"). User has no UI to inspect or delete memories. GDPR/CCPA delete-my-data requests are a nightmare. PII leaks across tenant boundaries because nobody scoped memories to the right key.
The version that works is layered: a small fast working memory for the active session, a structured profile/preferences store for facts the AI must always know, a semantic long-term store for past conversations and content, explicit write triggers (don't store everything), explicit decay/forget rules, user-visible inspection and delete UI, and an eval harness that measures whether memory actually improves task outcomes.
This guide assumes you have already done In-Product AI Agent Implementation, In-Product AI Search & Q&A (RAG over content, not memory), RAG Implementation, LLM Cost Optimization (memory drives token cost), LLM Quality Monitoring (memory recall quality is part of this), and AI Features Implementation. Cross-reference Multi-Tenancy — memory MUST be tenant-scoped — and Account Deletion / Data Export — memory must respect deletion requests.
1. Decide What Kind of Memory Your Product Needs
Memory is not one thing. Different memory shapes need different storage and retrieval.
Help me decide what kinds of memory my product actually needs.
The four memory layers:
**Layer 1: Working memory (current session/conversation)**
- The last N messages of the active conversation
- Lives in the prompt, refreshed each turn
- Disappears when session ends OR is rolled into long-term memory
- Use case: every conversational AI feature needs this
**Layer 2: Structured profile / preferences**
- A small, schema'd KV store of facts the AI must always know
- Examples: user's name, role, timezone, default project, tone preference, "always answer in metric units"
- Lives in your relational DB (Postgres), not a vector store
- Looked up by user_id, injected into every system prompt
- Use case: any product where the AI needs stable user context
**Layer 3: Semantic long-term memory**
- Embeddings of past conversations, content the user wrote, things they did
- Vector store; retrieved by semantic similarity to current query
- Use case: "remind me what I told you about X last month," "summarize my recent
questions about Y"
**Layer 4: Episodic memory / event log**
- An append-only log of what the user did and what the AI did with timestamps
- Used for "what did we work on yesterday?" / undo-able actions / audit
- Lives in Postgres or a time-series DB; queried by time + user_id
- Use case: agentic products where the AI takes actions on the user's behalf
My product:
- What the AI does in my product: [describe — chat assistant? agent that takes
actions? content suggestions? in-context Q&A?]
- The user's main job-to-be-done with the AI: [...]
- Sessions per user per week: [...]
- Avg turns per session: [...]
- Stakes of getting memory wrong: [low / medium / high — e.g., wrong recall
about a medical condition is high stakes]
Tell me:
1. Which of the 4 layers I actually need (be aggressive — most products do not
need all 4 on day one)
2. Which layer is the highest-leverage starting point given my use case
3. Which layer I should explicitly NOT build yet and why
4. The minimum viable memory implementation for my v1
Decision heuristic: Start with Layer 1 (working memory) + Layer 2 (structured profile). Add Layer 3 (semantic) when users explicitly ask "do you remember when I told you...". Add Layer 4 (episodic) only if you ship an agent that takes actions.
2. Build Layer 1 — Working Memory With Summarization
The simplest mistake: append every message to the prompt forever. The simplest fix: summarize old turns once they exit a sliding window.
I want to ship working memory for an in-product chat assistant. The pattern:
- Keep the last N=[12] turns verbatim in the prompt (recent context full-fidelity)
- When turn count exceeds N, take the oldest M=[8] turns and summarize them into
a single "session summary so far" block
- Prepend the summary to the prompt; drop the M raw turns
- Continue rolling the window: each time the prompt exceeds budget, summarize
the oldest non-summarized turns into the existing summary
System constraints:
- Database: [Postgres / Supabase / ...]
- LLM API: [via Vercel AI Gateway / OpenAI direct / Anthropic direct]
- I want streaming responses, so summarization must be async (not block the
user's next message)
- Token budget per request: [target 8K tokens of conversation context max]
Build me:
1. A `conversation_messages` table schema:
- id, conversation_id, user_id, role (user/assistant/system), content,
created_at, summary_id (nullable — null = unsummarized; non-null = rolled
into this summary)
2. A `conversation_summaries` table:
- id, conversation_id, summary_text, covers_messages_from, covers_messages_to,
created_at
3. A function `loadConversationContext(conversationId)` that returns:
- the latest summary (if any)
- the last N unsummarized messages
- formatted as a messages array ready for the LLM API
4. A background job `summarizeOldTurns(conversationId)` that:
- finds unsummarized messages older than the last N
- calls the LLM with a "summarize this conversation segment" prompt that
emphasizes preserving facts, decisions, user preferences (not just topics)
- writes the summary, marks messages as covered
- is triggered after each user message (debounced, not blocking)
5. A summarization prompt template that explicitly preserves:
- Decisions made
- User-stated facts ("I work at X", "my deadline is Y")
- Action items / TODOs
- Disagreements or corrections (when the AI was wrong)
- The current open question/topic
Critical: do NOT summarize away facts. Bad summaries lose specifics ("they
discussed pricing"); good summaries preserve them ("user said they need pricing
under $50/seat for 200 seats by Q3").
Show me the schema, the load function, and the summarization prompt.
Trap to flag: Some teams use the LLM provider's built-in conversation memory (e.g., long context windows, prompt caching). Long context is NOT the same as memory — it's just a bigger window. You still need summarization to fit within a sane token budget per request, and you still need long-term storage for cross-session memory.
3. Build Layer 2 — Structured Profile & Preferences
A schema'd, user-visible KV store. The AI reads from it on every prompt; the AI writes to it when the user states a clear preference.
I want a structured user-profile / preferences store that:
- Holds a small set of high-value facts about each user
- Is injected into the system prompt of every AI request as a compact block
- Is updated by the AI when the user states a clear preference
- Is user-visible and user-editable in a Settings → AI Memory page
Facts to capture (start narrow):
- name, preferred_name
- role, company, industry
- timezone, locale
- default_project_id (the thing they're usually working on)
- tone_preference (formal/casual/concise/detailed)
- expertise_level (beginner / intermediate / expert) — affects how much
jargon the AI uses
- topics_of_interest (free-text list)
- explicit_dont_repeat (things the user has asked the AI to stop saying)
Schema:
- `user_ai_profile` table: user_id PK, jsonb `facts`, updated_at
- All facts in a single jsonb column (not 12 columns) so adding new fact types
doesn't require a migration
- `user_ai_profile_changes` table: append-only audit (user_id, fact_key,
old_value, new_value, source ('user' | 'inferred'), changed_at)
Read path:
- On each AI request, load the profile, render a "What I know about you" block:
User context:
- Name: [Jane]
- Role: [Senior Data Analyst at Acme]
- Timezone: [America/New_York]
- Tone preference: [concise]
- Don't: ["explain what Python is", "use the word 'simply'"]
- Inject this into the system prompt before the conversation messages
- Total budget: keep the profile block under 500 tokens
Write path (two separate sources):
1. **User-driven**: The Settings page lets users edit any fact directly. Simple
form, validates, writes. This is the primary write path.
2. **AI-inferred**: After each conversation, a separate "fact extraction" job
runs that scans the conversation for stated preferences ("I'm a senior
engineer", "please be more concise"). Extracted facts go into a STAGING
table (`profile_fact_proposals`) — NOT the live profile. The user sees
proposed facts in their Settings page and approves them.
Why staged proposals: the AI WILL extract wrong facts. Auto-applying them
poisons the profile. Forcing user approval surfaces both good extractions
(wins) and bad ones (catches drift early). Once your extraction quality is
validated against an eval set, you can flip to auto-apply with a "review my
AI memory" prompt every 30 days.
Build me:
1. The schema
2. The render function (profile → system prompt block)
3. The fact-extraction prompt (input: conversation transcript; output:
structured proposals as JSON)
4. The Settings → AI Memory page UI:
- Current facts (editable)
- Proposed facts (approve/reject buttons)
- "Forget this" button on every fact
- "Wipe all AI memory" destructive action with confirm
Privacy hard rule: NEVER auto-write to the live profile. Always stage. The cost of one wrong fact ("user is unmarried" — they're widowed) is much higher than the cost of a one-extra-click approval flow.
4. Build Layer 3 — Semantic Long-Term Memory
Once you have working memory and a structured profile, the next layer is semantic recall: "what did I tell you about X last month?"
I want semantic long-term memory across conversations. The pattern:
- After each conversation ends (or every N turns), extract "memory candidates"
— short atomic statements worth remembering long-term
- Embed each candidate, store in pgvector with metadata
- On each new AI request, do a semantic search of memories scoped to user_id,
retrieve top K=[5], inject as a "things you've said before" block
What to store as a memory candidate (write triggers — be selective):
- User-stated facts NOT already in the structured profile ("my dog's name is
Rex" — too narrow for the profile, too useful to forget)
- Decisions the user made ("we decided to go with option B for the migration")
- Open questions / unresolved threads ("user wanted to revisit pricing in Q3")
- Pieces of context the AI generated that the user explicitly approved
("the strategy doc we wrote together")
What NOT to store:
- Routine pleasantries
- Questions the AI answered with general knowledge (not user-specific)
- Things the user explicitly said to forget
- Anything in the structured profile (no double-storage)
Schema:
- `user_memory` table:
- id, user_id, conversation_id (origin), content (text), embedding (vector),
importance_score (1-5, how confident we are this matters),
created_at, last_accessed_at, access_count, expires_at (nullable),
user_confirmed (bool — did the user approve this memory?)
Retrieval:
- On each AI request, embed the latest user message, ANN search the vector store
filtered by user_id
- Apply a hybrid score: similarity * importance * recency_decay
- Recency decay: memories from >90 days ago weighted at 0.5, >180 days at 0.25
- Take top K, inject as:
Things you've told me before that may be relevant:
- [Memory 1] (from [date])
- [Memory 2] (from [date])
- ...
Decay & forgetting:
- Memories with access_count=0 and age >180 days → archive (move out of
active retrieval, keep for audit)
- User can mark any memory as "forget this" → soft-delete
- DSAR / account deletion → hard-delete all memories
Privacy:
- Memory MUST be scoped by user_id in the vector store filter (never share
across users, even within the same workspace, unless the workspace explicitly
opts into shared memory — separate decision)
- PII detection on write: if a memory contains a SSN, credit card, etc., refuse
to store and surface an error
- Encryption at rest if you handle regulated data (HIPAA / financial)
Build me:
1. The pgvector schema
2. The "extract memory candidates" prompt + the staging review flow
(don't auto-store; show user a "save these to memory?" prompt every N
conversations until extraction quality is proven)
3. The retrieval function with hybrid scoring
4. A user-visible "Memory" page that lists all memories, lets the user
search them, edit them, mark them important, or forget them
Mem0 / Zep / Letta vs. DIY: Several hosted memory products exist. Mem0 (open-source + hosted), Zep (hosted, conversation-focused), Letta (open-source, agent-focused). They handle the orchestration above. They're worth using if you want to ship in days instead of weeks; DIY is worth it if you have unusual privacy/residency constraints or need deep customization. See https://www.vibereference.com/ai-development/mem0-memory-integration for the comparison.
5. Build Layer 4 — Episodic / Event Memory (For Agents)
Skip this section if your AI doesn't take actions. If it does (an agent), you need an event log.
My AI can [take actions on the user's behalf — e.g., send emails, create
tickets, edit documents]. I need an episodic memory layer so:
- The user can ask "what did you do yesterday?"
- The user can undo recent agent actions
- Auditors can review what the agent did and why
Schema:
- `agent_action_log`:
- id, user_id, conversation_id, action_type, action_payload (jsonb),
reasoning (text — short explanation), tool_called, tool_result (jsonb),
started_at, completed_at, status (success/failed/undone),
undone_by_action_id (nullable — links to the undo action),
user_initiated (bool — did the user explicitly approve this action?)
Reads:
- "What did you do yesterday?" → agent reads its own log filtered by
user_id + date_range, summarizes
- "Undo the last thing" → agent reads the most recent successful action with
undone_by_action_id IS NULL, calls the corresponding undo handler, writes
the undo as a new action linked to the original
- Audit: simple SQL query, exportable to CSV
Critical guarantees:
- Every action MUST be logged BEFORE it's executed (write-ahead). If the
process crashes mid-action, the log shows what was attempted.
- Sensitive actions (send email, charge card, delete data) log the FULL payload
and reasoning; the user can review them in a "what your AI did" timeline UI.
- Failed actions are logged with the failure reason; the agent surfaces them
to the user instead of silently retrying.
Build me:
1. The schema
2. A `logAndExecute(action)` wrapper every tool call must go through
3. A `userActionTimeline(userId, dateRange)` query
4. The "What your AI did" UI: chronological feed, success/fail/undone status,
one-click undo where supported
6. The Memory Eval Harness — Measure Whether Memory Actually Helps
Memory is the easiest AI feature to ship and the hardest to get right. You need an eval set.
I want a memory eval harness that catches regressions when I change retrieval,
summarization, or extraction.
The eval set has three test types:
**Test 1: Recall accuracy**
- Setup: a synthetic user has stated 20 specific facts across past conversations
- Question: "what did I tell you about [topic]?"
- Score: did the AI surface the correct fact? (exact match / fuzzy match)
- Pass threshold: [85%]
**Test 2: No false memory**
- Setup: same user, but the question references something they NEVER said
- Question: "remind me what I said about [thing they never said]"
- Score: did the AI correctly say "you didn't tell me that" instead of
hallucinating?
- Pass threshold: [95% — false memories are worse than missed memories]
**Test 3: Stale memory handling**
- Setup: user said "I'm at Company A" 6 months ago, "I'm at Company B" last week
- Question: "where do I work?"
- Score: did the AI surface the most recent fact, not the old one?
- Pass threshold: [90%]
Run the eval after every change to:
- Summarization prompts
- Extraction prompts
- Retrieval scoring weights
- The underlying LLM model (e.g., when upgrading to a new Claude/GPT version)
Build me:
1. A `memory_eval_cases` JSONL file format with 30 hand-written cases
2. A test runner that, for each case:
- Seeds a fresh test user with the past-conversation history
- Runs the question through the full memory + LLM pipeline
- Scores the response (LLM-as-judge for fuzzy match; deterministic for
exact match where possible)
3. A regression report comparing this run to the last run, flagging any case
that newly failed
4. CI integration: fail the build if recall drops more than 5% or false
memory rate rises at all
Hand-write the first 30 cases yourself; do NOT generate them with an LLM
(self-graded LLM evals are useless). Source: real conversation examples
from your own product, anonymized.
Pass/fail discipline: every memory-related PR runs the eval. Drops in recall accuracy block the PR. Drops in false-memory rate block the PR HARDER (false memories destroy trust faster than missed memories).
7. Privacy, Tenancy, and the "Forget Me" Path
Memory makes privacy harder. Build the controls before you ship the feature.
I'm shipping AI memory and I need the privacy/control layer right from v1.
What I need:
1. **Tenant scoping**
- Every memory record is keyed by user_id AND workspace_id (where
applicable)
- Vector retrieval ALWAYS filters by these keys; never trust the LLM to
"remember" who it's talking to
- Add a unit test that seeds two users in the same workspace, queries
user A's memory, asserts none of user B's memories surface
- Add a unit test for cross-workspace isolation too
2. **PII detection on write**
- Run a regex/heuristic check on every memory candidate before storing:
- SSN pattern
- Credit card pattern (Luhn check)
- Phone numbers (when not the user's own)
- Email addresses (when not the user's own)
- On detect: refuse to store, surface "this looks like sensitive info — saved
a redacted version" to the user
3. **User-visible memory UI**
- Settings → AI Memory page lists every memory the AI has about the user
- Each memory: text, source conversation, date created, "forget this"
button, "this is wrong, fix it" button (opens edit form)
- Search box across all memories
- "Wipe all AI memory" destructive action with typed-confirmation modal
4. **Account deletion / DSAR**
- When a user requests deletion: hard-delete all memories, profile,
conversation summaries, and event log entries scoped to their user_id
- Vector store entries: must actually be removed from the index, not just
soft-deleted (semantic search would still surface them)
- Provide an export endpoint: returns the user's full memory store as JSON
- Coordinate with [Account Deletion / Data Export](account-deletion-data-export-chat.md)
so AI memory is part of the standard deletion path, not a separate flow
someone has to remember to add
5. **Memory expiry / decay**
- Default: memories older than 1 year auto-archive (out of active retrieval)
- User setting: "keep memory forever" / "expire after [30/90/365] days"
- On expiry: archive (audit-keepable) by default; hard-delete if user opted in
6. **Audit log**
- Every memory write/read/delete is logged with timestamp, source
(conversation_id, action), and user_id
- SOC 2 / HIPAA workspaces need this; build it from day one even if you
don't need it yet — retrofitting is painful
Build me:
- The privacy controls UI (Settings → AI Memory)
- The PII detection function
- The unit tests for tenant isolation
- The deletion path that wipes all memory layers in a single transaction
- The export endpoint
See Also: Customer-Managed Encryption Keys (BYOK) for regulated workspaces; Audit Logs for the broader audit pattern; Customer-Facing Audit Logs for the user-visible side.
8. The Token Budget — Don't Let Memory Eat Your Margin
Memory is also a cost vector. Each layer adds tokens to every request.
I want to model and cap the token cost of memory across all four layers.
Per-request budget (target ~6K tokens of memory context max):
- Layer 1 (working memory summary + recent turns): up to 3000 tokens
- Layer 2 (structured profile block): up to 500 tokens
- Layer 3 (top-K semantic memories): up to 1500 tokens
- Layer 4 (recent agent actions, if applicable): up to 1000 tokens
Cost controls:
1. **Per-request hard cap**: count tokens before sending; if over budget,
trim the lowest-priority layer first (semantic memories before profile)
2. **Per-user monthly cap**: total tokens spent on memory loading per user per
month; alert ops if any user exceeds [10x median] (likely abuse or a stuck
loop)
3. **Per-conversation cost trace**: log how many tokens each layer contributed
per request; review the heavy hitters in [LLM Cost Optimization](llm-cost-optimization-chat.md)
4. **Prompt caching**: structured profile + summary blocks are stable across
turns; cache them at the provider (Anthropic prompt caching, OpenAI
prompt caching) so repeat reads are cheap
5. **Model tier choice**: use a smaller/cheaper model for memory operations
(summarization, extraction, retrieval scoring); reserve the big model for
user-facing generation. Via the [Vercel AI Gateway](https://www.vibereference.com/ai-development/ai-gateways)
you can route memory ops to a cheap model and generation to a premium one
from the same codebase.
Build me:
- A `assembleMemoryContext(userId, conversationId)` function that returns the
memory block AND the token count, with priority-based trimming if over budget
- A dashboard query showing per-user monthly memory token spend
- Prompt-caching configuration for the profile + summary layers
9. What Done Looks Like
You have shipped real AI memory when:
- A returning user starts a new conversation and the AI knows their name, role, and current focus without them re-introducing themselves.
- A user can ask "what did I tell you about X last month?" and get back the actual fact, not a hallucination.
- A user can open Settings → AI Memory, see every fact the AI has stored about them, edit any of it, and delete any of it.
- "Wipe all AI memory" works in one click and is verifiable (the AI now greets them like a stranger).
- An account-deletion request removes all memory layers, including vector embeddings, in the same transaction as the rest of the user's data.
- The memory eval harness runs on every PR; recall accuracy ≥85%, false memory rate ≤5%.
- Per-user memory token spend is bounded; no user can blow up your bill via runaway memory accumulation.
- Tenant isolation tests prove cross-user and cross-workspace memory leakage is impossible.
- A new product team member can read this doc + your memory schema and explain when each layer is read and when each is written.
Mistakes to Avoid
- Storing every message as a memory. Most messages are not memories. Use write triggers; default to NOT storing.
- Auto-applying AI-extracted facts to the live profile. Stage and require user approval until extraction quality is proven against an eval set.
- Using vector similarity alone for "what did I tell you about X." Hybrid (semantic + recency + importance) beats pure cosine similarity.
- Treating long-context windows as memory. Long context is a bigger window per request, not durable memory across requests.
- Cross-tenant memory leaks. Always filter by user_id (and workspace_id) at the storage layer. Never trust the LLM to keep contexts separate.
- No "forget this" button. Users WILL want to remove specific memories. Without this, they'll wipe everything or churn.
- Skipping the eval harness. Memory quality regresses silently; you only notice when a customer ragequits.
- Storing PII without redaction. Run detection on every write; refuse or redact sensitive patterns.
- Forgetting account deletion. AI memory must be part of the DSAR path, not a separate cleanup someone has to remember.
- Not measuring whether memory helps. Run an A/B: cohort A has memory, cohort B doesn't. Measure retention, satisfaction, task completion. If memory doesn't move metrics, debug the implementation before adding more layers.
See Also
- In-Product AI Agent Implementation — agentic AI that needs episodic memory
- In-Product AI Search & Q&A — RAG over content (not memory)
- RAG Implementation — vector retrieval foundation
- LLM Cost Optimization — memory is a cost vector
- LLM Quality Monitoring — memory recall is part of quality
- AI Features Implementation — broader pattern
- Account Deletion / Data Export — DSAR must include memory
- Multi-Tenancy — memory must be tenant-scoped
- Audit Logs / Customer-Facing Audit Logs — memory operations are audit-relevant
- Customer-Managed Encryption Keys (BYOK) — for regulated memory storage
- VibeReference: Mem0 Memory Integration — managed memory provider comparison
- VibeReference: LLM Observability Providers — instrument memory operations
- VibeReference: AI Gateways — route memory ops to cheap models, generation to premium