Agent Memory Toolkit stores conversation data as memory documents and builds higher-level artifacts on top of those documents for retrieval and reuse.
Every memory uses the same base shape:
| Field | Description |
|---|---|
id |
Unique identifier |
user_id |
User the memory belongs to |
thread_id |
Conversation thread |
role |
user, agent, tool, or system |
type |
turn, summary, fact, or user_summary |
content |
Main text payload |
embedding |
Vector used for semantic search |
metadata |
Extra context (e.g. tool_name, tool_call_id, edit reasons) |
created_at |
ISO 8601 timestamp |
Type: turn
Turn memories are raw conversation records. They are created by add_local(), add_cosmos(), and push_to_cosmos() (which bulk-uploads local memories to Cosmos DB). They act as the source material for summaries and facts.
Use for: full conversation history and short-term context.
Type: summary
A summary compresses one thread into a compact record of the main topic, key decisions, open issues, and next steps. It is created by generate_thread_summary() and stored with deterministic ID summary_{user_id}_{thread_id}.
If a summary already exists, the pipeline reads it, loads only newer memories, and merges them into an updated summary instead of rebuilding from scratch.
Use for: long conversations that need a compact recap.
Type: fact
Facts are discrete assertions extracted from a thread, such as preferences, requirements, or confirmed decisions. extract_facts() stores each fact as its own document with its own embedding.
Use for: fine-grained semantic retrieval across threads.
Type: user_summary
A user summary is a cross-thread profile for one user. It captures stable context such as preferences, current work, environment details, and constraints. It is created by generate_user_summary() and stored with ID user_summary_{user_id} and thread_id="__user_summary__".
Like thread summaries, user summaries update incrementally by merging the existing profile with only the new memories.
Use for: cross-session personalization and onboarding context.
| Short-Term | Long-Term | |
|---|---|---|
| What | Turn messages | Summaries, facts, user summaries |
| Granularity | Per message | Per thread, per fact, or per user |
| Created by | add_local() / add_cosmos() / push_to_cosmos() |
generate_thread_summary() / extract_facts() / generate_user_summary() |
| Purpose | Replay recent context | Compact recall and semantic retrieval |
Common pattern: keep turns during an active conversation, then generate summaries or facts when the thread gets long or is complete.
A thread is the unit of conversation. get_thread() returns the memories for one thread_id, optionally limited to the most recent k entries. get_memories() also supports a thread_id filter to retrieve memories from a specific thread.
| Role | Meaning |
|---|---|
user |
Human message |
agent |
Assistant message |
tool |
Tool output (metadata can include tool_name and tool_call_id) |
system |
Generated artifacts such as summaries and facts |
Memories stored in Cosmos DB include embeddings generated by Microsoft AI Foundry (e.g. text-embedding-3-large). This enables semantic search: a query is embedded, then Cosmos returns the closest matching memories via vector distance. search_cosmos also supports hybrid search (vector + full-text ranking via RRF).
Facts work especially well for vector search because each fact is stored as a small, self-contained document.
Derived memories are generated by an Azure Durable Functions pipeline:
read existing doc if present -> query source memories -> call LLM -> embed output -> upsert to Cosmos DB
Thread summaries and user summaries support:
- deterministic IDs
- incremental updates
recent_klimits to restrict how much history is processed
Prompts for summarization and fact extraction live in azure_functions/prompts/ and can be edited without changing pipeline code.
The reconcile_memories(user_id, n=50) pipeline step reads up to N most-recent active facts for a user and asks the LLM to identify two orthogonal outcomes in one pass:
- Duplicates — two or more facts that restate the same claim in different words. Resolution: collapse into one merged fact; the originals are soft-deleted with
supersede_reason="duplicate"andsuperseded_byset to the merged fact's id. - Contradictions — two facts that assert opposing claims about the same subject. Resolution: keep the winner (more recent first, higher confidence as tiebreaker), soft-delete the loser with
supersede_reason="contradict"andsuperseded_byset to the winner.
Detecting contradictions semantically requires the LLM to see the candidate pool as a whole — paraphrased ("user prefers aisle seats") and contradictory ("user is vegetarian" vs "user loves steak") facts often have very different embedding vectors and would never co-occur in any cosine cluster. Putting all N candidates into one prompt lets the LLM do the semantic reasoning across both axes simultaneously. The pipeline returns {"kept": int, "merged": int, "contradicted": int}.
Soft-deleted facts stay in the container with their supersede_reason, superseded_at, and superseded_by fields populated. Default reads (get_memories, search_cosmos) filter them out via superseded_by IS NULL. To inspect the audit trail (e.g. "show everything that ever applied to this user"), opt out of the filter at the query level.
Each fact written by extract_memories carries a content_hash (SHA-256 of normalized content, truncated to 32 hex chars; lowercase, whitespace-collapsed). Before upserting a freshly-extracted fact, the pipeline checks the hash against existing active facts and short-circuits if a match exists, incrementing the exact_dedup_skipped metric. This catches identical re-extractions cheaply without an LLM call.
DEDUP_EVERY_N (default 5) controls how often reconcile_memories runs in the auto-trigger path. Set to 0 to disable. The candidate cap n (default 50) is tunable per call; larger values give the LLM a wider view at higher token cost.
Indexing note. The reconcile pool query orders by
created_at(matching the prompt's "more recent first" tiebreaker). Cosmos's default indexing policy includes every property, so this works out of the box. If you customize the indexing policy to reduce write RU, ensure/created_at/?remains indexed or the query will fail with a 400 (Order-by over a non-indexed path).
In addition to on-demand processing via the SDK, the toolkit includes a Cosmos DB change feed trigger that automatically starts processing orchestrations when enough new turns have been written.
memories container
│ (change feed)
▼
on_memory_change trigger
│
├── count turns per (user_id, thread_id)
│ └── crosses threshold? ──► start thread_summary / extract_facts
│
└── count turns per user_id
└── crosses threshold? ──► start user_summary
- The change feed trigger watches the
memoriescontainer for new documents. - Only documents with
type == "turn"are counted (summaries, facts, and user summaries are ignored). - Documents in the dedicated
countercontainer track how many turns have been seen per scope using ETag-based optimistic concurrency. - When a counter crosses a configured threshold, the corresponding Durable Functions orchestration is started automatically.
| Setting | Scope | Default |
|---|---|---|
THREAD_SUMMARY_EVERY_N |
Per (user_id, thread_id) |
0 (disabled) |
FACT_EXTRACTION_EVERY_N |
Per (user_id, thread_id) |
0 (disabled) |
USER_SUMMARY_EVERY_N |
Per user_id (across all threads) |
0 (disabled) |
Set any value to 0 to disable that processing type. For example, setting THREAD_SUMMARY_EVERY_N=5 generates a thread summary every 5 new turns in each thread.
| Container | Partition Key | Purpose |
|---|---|---|
memories |
/user_id, /thread_id (hierarchical) |
Durable derived memories (fact, episodic, procedural) |
memories_turns |
/user_id, /thread_id (hierarchical) |
Raw conversation turns (turn) — append-only, TTL-pruned |
memories_summaries |
/user_id, /thread_id (hierarchical) |
Thread + user summaries (thread_summary, user_summary) |
counter |
/user_id, /thread_id (hierarchical) |
Message count tracking for automatic processing |
leases |
/id |
Change feed checkpointing container created by create_memory_store() |
The toolkit provisions all required Cosmos containers under one shared throughput mode:
serverlessis the default. The toolkit creates thememories,memories_turns,memories_summaries,counter, andleasescontainers without specifying RU/s.autoscaleapplies the sharedCOSMOS_DB_AUTOSCALE_MAX_RUcap to all five containers.
This keeps the change feed dependencies aligned with the main memory store instead of letting the Functions trigger create the lease container independently.
| Mode | Trigger | Use case |
|---|---|---|
| On-demand (pull) | SDK call (generate_thread_summary(), etc.) |
Explicit control over when processing happens |
| Automatic (push) | Change feed trigger | Fire-and-forget — processing happens in the background as turns are written |
Both modes use the same Durable Functions orchestrator and activities, so prompts, incremental update logic, and stored outputs are identical.
| Backend | Use Case | Persistence |
|---|---|---|
| Local (in-memory) | Development and quick testing | Lost on process exit |
| Cosmos DB | Production, shared access, semantic search | Durable |
Local storage is enough for CRUD testing. Cosmos DB is required for persistence, vector search, and the processing pipeline.
Cosmos DB uses a hierarchical partition key (user_id, thread_id) for efficient queries scoped to a user or thread.