LLMs can answer. They cannot remember.
That gap matters more than most teams expect. A chatbot can look strong in a demo while still failing the first real product test: does it remember what the user told it last week, in a different session, using different wording?
For teams building AI products, memory is where the work gets real. It stops being only a prompting problem and becomes a systems problem involving storage, retrieval, latency, token budgets, and product design. When it is handled poorly, the chatbot feels shallow. When it is overbuilt, the system becomes expensive, noisy, and hard to reason about.
This is the problem that led to the design of Ark Chatbot Context Engine: a memory layer for LLM applications that need continuity across sessions without turning every prompt into a transcript dump.
This article describes the design as it exists in ark-chatbot. It is meant to be concrete rather than abstract. The repository can be pulled locally, the code paths can be inspected directly, and the implementation details can be traced back to the storage model, retrieval logic, API routes, and background workers.
The design is implemented directly in ark-chatbot without relying on an external memory framework. The memory model, extraction pipeline, retrieval logic, and context assembly live in the application stack itself, which keeps the system easier to inspect, debug, and adapt to product-specific needs.
Why Naive Chat Memory Breaks
The first version of chatbot memory is almost always the same: keep the last N messages and stuff them back into the prompt.
That works for short conversations. It stops working when the product starts to resemble actual use.
Three things break quickly.
Context windows are finite
As conversations grow, the system either drops older information or spends more money and time pushing increasingly irrelevant history into every request.
Raw history is not memory
A transcript contains everything, but it does not distinguish between what matters and what does not. The model can read the words, but the application still has not decided which facts should persist.
Sessions reset continuity
The moment the user starts a new chat, the illusion disappears. The assistant does not know their preferences, relationships, projects, or longer-term context unless the application has explicitly stored and retrieved it.
This is why a useful chatbot needs a context engine, not just a chat log.
What a Chat Context Engine Actually Does
A chat context engine sits between conversation storage and model inference.
Its job is simple to describe and hard to implement well:
- save conversation data
- extract the pieces worth remembering
- store them in a form that can be retrieved later
- rank what is relevant to the current turn
- fit that context into a strict prompt budget
- do all of this fast enough for a real product
The key point is that memory is not just persistence. It is selective persistence plus selective retrieval.
That distinction drives the architecture.
The Design Requirements
When designing a memory layer for a production LLM application, a few constraints matter more than almost anything else.
1. Cross-session continuity
The system has to work beyond a single conversation thread. If a user starts a new chat, the assistant should still be able to recall stable facts and recent context that matter.
2. Fast response path
The user-facing message flow has to stay simple and fast. Heavy extraction or summarization work should happen in the background when possible.
3. Structured memory, not just transcripts
A memory system should understand that "River is the user's dog" is more useful than "there was a sentence mentioning River somewhere in a thread."
4. Relevance under token limits
Even perfect memory storage is useless if the retrieval step sends too much context or sends the wrong context.
5. Configurable cost profile
Different products want different tradeoffs. A startup prototype and a scaled product should not be forced into the same extraction cadence, retrieval budget, or provider setup.
Cost matters here because context engines can easily become the hidden tax on every message. Selective retrieval, batching, and prompt budgeting are not just quality features. They materially reduce token processing cost compared with transcript-heavy designs.
6. Flexible memory scoping
Not every product wants memory at the same boundary.
Some applications want memory per chat. Others want memory per user, team, workspace, product, project, or organization. A useful memory engine should not hardcode one idea of identity or one idea of ownership.
That means the storage model needs a flexible scope concept so memory can be grouped however the product needs:
- chat-scoped
- user-scoped
- project-scoped
- product-scoped
- organization-scoped
In Ark Chatbot Context Engine, long-term memory is grouped by scope rather than by a single fixed user model. That keeps the engine usable across very different product shapes.
These requirements lead naturally to a layered design.
The Architecture: Three Flows, Not One
The cleanest way to build memory for chat is to separate fast-path inference from background memory work.
In Ark Chatbot Context Engine, the system is split into three flows.
Message flow
This is the synchronous request path:
- save the incoming user message
- assemble prompt context from recent conversation and stored memory
- call the chat model
- stream or return the response
- save the assistant reply
This path should read memory, but it should not be doing expensive memory-generation work inline.
Extraction flow
This runs asynchronously after a configurable number of user turns.
Instead of extracting on every message, the system batches messages and asks the model to produce structured memory:
- entities
- relationships
- recap summaries
This improves cost efficiency and reduces prompt churn while still capturing the information that is worth keeping. In practice, that is one of the main reasons the engine can stay fast on the active session path without paying to reprocess unnecessary history on every turn.
Just as important, the extractor should not work from raw messages alone. In ark-chatbot, the extraction prompt is given its own working context: new messages, existing entities, existing relationships, and recent recaps. That richer extraction window makes it easier to avoid duplicate entities, avoid duplicate relationships, and reinforce memory that has been mentioned again.
Episode flow
When a chat ends, or when a new chat begins inside the same scope, the system can generate a more compressed long-term memory artifact: an episode.
An episode is a narrative summary of what the user shared in that conversation, stored with an embedding for later retrieval.
This creates a natural separation between current-chat memory and cross-chat memory. A practical episodic pattern here is to keep one final episode per chat. That is usually enough to preserve cross-session narrative continuity without fragmenting long-term memory into too many overlapping summaries.
The Memory Model: Short-Term and Long-Term Memory
A memory system becomes easier to reason about when it mirrors how products actually use memory.
The architecture uses two layers.
Short-Term Memory (STM)
Short-term memory is scoped to a single chat.
It stores granular memory units:
- entities with attributes: people, pets, places, objects, events, and user facts, along with structured details such as roles, preferences, breeds, locations, and other attached properties
- relationships: user owns Milo, Alice lives in Seattle
- recaps: compact summaries of recent conversation segments
This layer is useful because it preserves detail. During an active conversation, the system should be able to reason over fresh structured facts without waiting for a longer-term summarization step.
Long-Term Memory (LTM)
Long-term memory is scoped across chats.
It stores:
- episodes: higher-level narrative summaries of past chats
- promoted entities: cross-chat facts worth keeping
- promoted relationships: persistent connections between entities
This layer is useful because it preserves continuity without dragging entire transcripts forward forever.
The product advantage is straightforward: the current chat gets detail, while future chats get continuity.
The important architectural point is that "scope" is a product-level choice, not a fixed rule inside the engine. Depending on the application, long-term memory can represent a single end user, a shared project, a customer account, a workspace, a product surface, or an organization.
Why Structured Extraction Matters
For an LLM application to remember users well, it needs a representation that is closer to meaning than to raw text.
This is also where many context engines succeed or fail. Retrieval gets most of the attention, but extraction quality usually determines whether retrieval has anything useful to work with at all.
Take a simple example:
"I have a dog named River and we usually go to the park on Sundays."
A transcript stores the sentence.
A memory engine can store:
- entity: River, type: pet, subtype: dog
- relationship: user owns River
- user preference or habit: park on Sundays
- recap: user discussed their dog River and a weekly routine
That structure creates options later:
- exact retrieval of the dog
- semantic retrieval when the user later says "my puppy"
- recap-based retrieval when the user refers to "our Sunday routine"
It also helps the application reinforce memory over time. If River is mentioned repeatedly, confidence and mention count can increase instead of duplicating records.
The Extractor Is Usually Harder Than the Retriever
In practice, the hardest problems in a context engine often sit upstream of retrieval.
This is not a problem that gets "solved once" and then disappears. For real products, it is usually an ongoing battle. LLM applications generate a large volume of conversational material, and selecting the right information to keep is far more important than keeping more information by default.
The more critical the product is, the more careful this planning has to be. A casual companion product, a support assistant, and a domain-specific assistant for a high-trust workflow should not all extract, reinforce, decay, and retrieve memory in the same way.
If the system extracts at the wrong level, retrieval becomes much harder. If it creates duplicate entities, splits one preference into several partial facts, misses reinforcement opportunities, or assigns poor confidence scores, the retrieval layer is left ranking flawed memory.
That is why this is often a garbage-in, garbage-out problem.
The hard parts usually include:
- deciding what is important enough to extract at all
- choosing the right level of granularity for facts, preferences, and relationships
- deduplicating entities and relationships reliably
- reinforcing frequently mentioned information instead of re-inserting it as new memory
- handling conflicting facts without polluting memory with incompatible versions of the same truth
- assigning useful confidence scores
- applying decay without losing durable but less recent facts
ark-chatbot handles several of these directly in the extraction and STM paths. The extractor sees existing memory context before producing new structured output. The STM layer deduplicates entities, reinforces mention counts, and carries confidence forward as memory is re-mentioned. That combination makes retrieval easier because the memory store is cleaner before ranking even begins.
The remaining challenge is tuning. Confidence and decay are not one-time decisions. They affect the extraction prompt, storage behavior, and retrieval weighting at the same time. Getting them right requires thinking through what should persist in the product, what should fade, how conflicting facts should be handled, and how strongly reinforced facts should compete with fresher but weaker signals.
This is also why long-term memory often becomes stale in weak implementations. If old facts are never challenged, if conflicting updates are not resolved, or if decay is too weak or too strong, the assistant starts surfacing outdated memory and the user experience degrades quickly.
Retrieval Still Determines the Prompt
Saving memory is only half the problem.
The more important question is what should be sent back to the model on the next turn.
Too little memory and the assistant feels forgetful.
Too much memory and the assistant becomes expensive, noisy, and less accurate.
This is why the context engine has to perform context assembly, not just retrieval.
For ark-chatbot, this is also where most of the performance benefit comes from. In-session responses stay particularly fast because the engine leans heavily on recent-message context and only supplements it with memory when it is actually useful. Cross-session recall matters for quality as well; internal evaluation of the memory system has measured cross-session accuracy above 93 percent when relevant prior context exists and is properly scoped.
A Practical Retrieval Strategy
A practical retrieval pattern is layered.
1. Start with recent messages
Always start with recent conversation history. In ark-chatbot, a small recent-message window has worked well in practice: 10 recent messages total across both user and assistant turns is usually enough to preserve the active thread without bloating the prompt. The current chat matters most, and recent turns are usually the strongest signal for what the model needs right now. That number is not universal, but it has been a practical default for this system.
2. Retrieve structured memory
Search stored entities, relationships, recaps, and episodes for items relevant to the current message.
3. Enforce a token budget
Split the prompt budget intentionally across:
- system instruction
- recent conversation
- retrieved memory
This prevents the memory layer from flooding the prompt and ensures the live turn still dominates.
4. Deduplicate aggressively
If a fact already appears in the recent message window, do not waste budget re-injecting it from memory.
This sounds small, but it matters. A memory system should add missing context, not echo what the model can already see.
5. Show memory age
Tag each memory with a relative time prefix — "today", "1 day ago", "5 days ago". Day-level granularity is enough. The LLM uses it to weight recent vs old information and can reference age naturally in responses.
Why Hybrid Retrieval Beats a Single Method
A hybrid approach works better because no single retrieval strategy works well by itself.
Keyword search is good at exact matches.
Vector search is good at semantic matches.
Recency and confidence are good at ranking what should matter more.
Together, these signals are much more useful than any one alone.
For example:
- keyword matching helps when the user repeats an exact name
- vector similarity helps when the user changes wording
- recency helps prioritize fresh context
- confidence helps promote facts that have been reinforced over time
This is the difference between "the system found something related" and "the system found the memory that should actually be in the prompt."
Confidence and decay are especially important here. They are not just storage metadata. They become active ranking inputs during retrieval. In a production context engine, confidence should be shaped by the extractor and reinforcement logic, while decay should keep stale memory from overwhelming the prompt without erasing durable knowledge too aggressively.
When You Don't Need Hybrid
Hybrid is the default in ark-chatbot, but you can ship with FTS only. Lexical search is fast, cheap, and accurate enough when entity names are exact and queries are short. Vector search earns its weight when users describe things indirectly ("my puppy" → "Max the dog"). A reasonable starting point: FTS-only for STM, hybrid only for long-term episodes.
Why Not Just Store Everything
This is one of the biggest design mistakes in memory systems.
More storage is not the same as better memory.
Storing everything without deciding what matters creates three new problems:
- retrieval quality gets worse
- prompt assembly gets noisier
- debugging gets harder
A useful memory system needs compression at multiple stages:
- extraction compresses conversation into structured facts
- recaps compress chunks of messages
- episodes compress full chats into long-term memory artifacts
That layered compression is what makes persistent memory manageable instead of chaotic.
It is also what keeps token spend under control. A context engine that stores and re-injects everything eventually pays for the same information over and over. A context engine that compresses, scopes, and ranks memory can reduce token processing cost significantly while keeping responses fast and accurate.
Extensibility: Memory Should Not Stop at the Chat Log
Conversation memory is only one source of useful context.
In real products, the chatbot often needs access to additional data:
- user profile data
- product catalog or product state
- project-specific context
- organization or workspace memory
- uploaded documents
- internal knowledge base content
- external systems such as CRM, tickets, tasks, or calendars
A useful context engine should make it straightforward to combine these sources rather than forcing everything into one giant transcript or one special-purpose store.
The practical pattern is:
- retrieve recent conversation
- retrieve conversation memory
- retrieve external or domain-specific context
- merge everything into the same prompt budget
That lets the application combine previously learned user context with system knowledge from elsewhere.
For example, a support assistant might use:
- chat-scoped short-term memory for the active issue
- user-scoped memory for preferences and history
- org-scoped memory for account-level context
- document retrieval for product docs and policies
A B2B copilot might use:
- user-scoped memory for the operator
- project-scoped memory for the current engagement
- organization-scoped memory for shared team context
- product or customer data pulled from internal systems
The key idea is that the context engine should not care whether context came from a chat, a document index, or a product data source. It should care about relevance, scope, and prompt budget.
That extensibility is one of the main reasons to separate the memory layer from the chatbot itself.
The same separation also makes ark-chatbot a reasonable base for light agentic workflows. Linear tool use fits naturally into the architecture: retrieve memory, retrieve documents, call a small number of tools, and assemble the result into the same bounded context. This works well for patterns like RAG over documents, simple tool integration, user data lookup, organization data lookup, and product-state retrieval.
Heavier tasking agents are a different category. Long-running planners, multi-step execution loops, deep tool chains, and autonomous task graphs usually need additional architecture for planning state, execution control, retries, and observability. A context engine like this is useful for designing the context and memory layer inside agentic systems, but it is not a full agent runtime.
A Production System Also Needs to Say No
Memory retrieval should not only decide what to include. It should also help decide when not enough support exists.
A production chatbot needs an explicit insufficient-context path:
- when retrieval confidence is weak, avoid inventing memory
- when supporting data is missing, say so directly
- when the question is answerable with a tool or document fetch, ask for that lookup or trigger it
- when the current scope is wrong, avoid leaking unrelated scoped memory into the answer
This matters for trust just as much as successful recall. A system that remembers well but hallucinates when memory is absent is still unreliable.
In practical terms, this means the context engine should pass support signals into the prompt assembly step. If the engine did not retrieve enough evidence, the model should be instructed to say it does not know, ask a clarifying question, or defer until better context is available. That pattern has already proven useful in production and is a natural extension for the open-source ark-chatbot project as well.
The Tradeoffs That Matter
The architecture is shaped by a set of practical tradeoffs.
Extract every message vs batch extraction
Per-message extraction feels cleaner in theory, but it is expensive and often unnecessary. Batching extraction every few user turns is usually a better balance between freshness and cost.
Single memory store vs layered memory
A single store is simpler, but it blurs "what is happening now" with "what should persist across sessions." The STM/LTM split keeps those jobs separate.
Bigger prompts vs better retrieval
Larger context windows help, but they do not remove the need for retrieval discipline. If the system does not know what matters, more context just means more noise.
Inline memory generation vs background processing
Doing memory work inline increases latency and makes the message path fragile. Moving extraction and episode generation to background flows keeps the user path simpler.
These are the kinds of tradeoffs that separate a demo from a production-capable system.
What Makes This Production-Usable
A memory engine has to fit into a real product stack, not just look good in an architecture diagram.
The decisions that make this architecture practical are not flashy, but they matter:
- the API surface is simple enough to integrate into an existing LLM application
- provider interfaces are explicit, so model vendors can be swapped
- token budgets are configurable rather than buried in prompt logic
- memory generation is background-oriented
- retrieval is structured enough to debug
- the system is self-hostable and testable
- the code is concrete enough to pull from the repository and inspect end to end
That last point matters for teams building with LLMs. Many do not need an entire agent platform. They need a memory layer that is understandable, controllable, and adaptable.
Evaluation Starts with Risk
Evaluation is deliberately not covered in depth in this article.
That is not because it is unimportant. It is because evaluation for systems like this is a broad topic in its own right, and it starts with the product's risk profile.
In practice, evaluation usually combines several methods:
- deterministic tests for stable behaviors and regressions
- model-based or AI-as-a-judge evaluation for fuzzier quality checks
- human-in-the-loop review for product judgment, edge cases, and trust
Some parts of testing strategy can be standardized. Most of it still depends on the product, the failure modes that matter, and the level of risk the system is allowed to carry.
An eval harness is often a good idea, but it should not be copied blindly. Each team needs to define what good behavior means for its product, what failure is unacceptable, and what evidence is strong enough to trust a release.
A separate article covers evaluation strategy for systems like this in more detail. For the purposes of this article, the focus stays on context-engine design rather than the full testing methodology around it.
Why Open Source It
Ark Chatbot Context Engine was open sourced simply because it can be.
The system was designed in a way that could be shared publicly, and the hope is that it is useful to other teams building with LLMs. It should also serve as a concrete example of how to think about memory, retrieval, and context assembly in real products.
This is an ongoing problem, not a finished one. Open sourcing the engine makes it possible to improve it over time, learn from how others approach the same challenge, and give developers a better starting point than building the entire memory layer from scratch.
Closing
For teams building LLM products, the question is not whether conversation history should be stored.
The real question is whether the system knows how to turn conversation into usable memory.
That is what a context engine is for.
It decides what matters, stores it in forms that are useful later, retrieves it under real constraints, and keeps the assistant coherent across sessions.
In ark-chatbot, that design is implemented directly in the application stack without relying on an external memory framework. The result is a context engine that is concrete, inspectable, fast in session, accurate across sessions, and efficient enough to reduce unnecessary token spend.
That matters because better context design does not just improve quality. It also saves time, reduces cost, and makes LLM products easier to ship and improve.
Ark Chatbot Context Engine is now open source for teams working on that problem.