Concepts

Agent Memory Toolkit stores conversation data as memory documents and builds higher-level artifacts on top of those documents for retrieval and reuse.

Memory Schema

Every memory uses the same base shape:

Field	Description
`id`	Unique identifier
`user_id`	User the memory belongs to
`thread_id`	Conversation thread
`role`	`user`, `agent`, `tool`, or `system`
`type`	`turn`, `summary`, `fact`, or `user_summary`
`content`	Main text payload
`embedding`	Vector used for semantic search
`metadata`	Extra context (e.g. `tool_name`, `tool_call_id`, edit reasons)
`created_at`	ISO 8601 timestamp

Memory Types

Turn

Type: turn

Turn memories are raw conversation records. They are created by add_local(), add_cosmos(), and push_to_cosmos() (which bulk-uploads local memories to Cosmos DB). They act as the source material for summaries and facts.

Use for: full conversation history and short-term context.

Summary

Type: summary

A summary compresses one thread into a compact record of the main topic, key decisions, open issues, and next steps. It is created by generate_thread_summary() and stored with deterministic ID summary_{user_id}_{thread_id}.

If a summary already exists, the pipeline reads it, loads only newer memories, and merges them into an updated summary instead of rebuilding from scratch.

Use for: long conversations that need a compact recap.

Fact

Type: fact

Facts are discrete assertions extracted from a thread, such as preferences, requirements, or confirmed decisions. extract_facts() stores each fact as its own document with its own embedding.

Use for: fine-grained semantic retrieval across threads.

User Summary

Type: user_summary

A user summary is a cross-thread profile for one user. It captures stable context such as preferences, current work, environment details, and constraints. It is created by generate_user_summary() and stored with ID user_summary_{user_id} and thread_id="__user_summary__".

Like thread summaries, user summaries update incrementally by merging the existing profile with only the new memories.

Use for: cross-session personalization and onboarding context.

Short-Term vs. Long-Term Memory

	Short-Term	Long-Term
What	Turn messages	Summaries, facts, user summaries
Granularity	Per message	Per thread, per fact, or per user
Created by	`add_local()` / `add_cosmos()` / `push_to_cosmos()`	`generate_thread_summary()` / `extract_facts()` / `generate_user_summary()`
Purpose	Replay recent context	Compact recall and semantic retrieval

Common pattern: keep turns during an active conversation, then generate summaries or facts when the thread gets long or is complete.

Threads and Roles

A thread is the unit of conversation. get_thread() returns the memories for one thread_id, optionally limited to the most recent k entries. get_memories() also supports a thread_id filter to retrieve memories from a specific thread.

Role	Meaning
`user`	Human message
`agent`	Assistant message
`tool`	Tool output (metadata can include `tool_name` and `tool_call_id`)
`system`	Generated artifacts such as summaries and facts

Embeddings and Search

Memories stored in Cosmos DB include embeddings generated by Microsoft AI Foundry (e.g. text-embedding-3-large). This enables semantic search: a query is embedded, then Cosmos returns the closest matching memories via vector distance. search_cosmos also supports hybrid search (vector + full-text ranking via RRF).

Facts work especially well for vector search because each fact is stored as a small, self-contained document.

Processing Pipeline

Derived memories are generated by an Azure Durable Functions pipeline:

read existing doc if present -> query source memories -> call LLM -> embed output -> upsert to Cosmos DB

Thread summaries and user summaries support:

deterministic IDs
incremental updates
recent_k limits to restrict how much history is processed

Prompts for summarization and fact extraction live in azure_functions/prompts/ and can be edited without changing pipeline code.

Memory Reconciliation

The reconcile_memories(user_id, n=50) pipeline step reads up to N most-recent active facts for a user and asks the LLM to identify two orthogonal outcomes in one pass:

Duplicates — two or more facts that restate the same claim in different words. Resolution: collapse into one merged fact; the originals are soft-deleted with supersede_reason="duplicate" and superseded_by set to the merged fact's id.
Contradictions — two facts that assert opposing claims about the same subject. Resolution: keep the winner (more recent first, higher confidence as tiebreaker), soft-delete the loser with supersede_reason="contradict" and superseded_by set to the winner.

Why one pass

Detecting contradictions semantically requires the LLM to see the candidate pool as a whole — paraphrased ("user prefers aisle seats") and contradictory ("user is vegetarian" vs "user loves steak") facts often have very different embedding vectors and would never co-occur in any cosine cluster. Putting all N candidates into one prompt lets the LLM do the semantic reasoning across both axes simultaneously. The pipeline returns {"kept": int, "merged": int, "contradicted": int}.

Loser preservation

Soft-deleted facts stay in the container with their supersede_reason, superseded_at, and superseded_by fields populated. Default reads (get_memories, search_cosmos) filter them out via superseded_by IS NULL. To inspect the audit trail (e.g. "show everything that ever applied to this user"), opt out of the filter at the query level.

Write-time exact dedup

Each fact written by extract_memories carries a content_hash (SHA-256 of normalized content, truncated to 32 hex chars; lowercase, whitespace-collapsed). Before upserting a freshly-extracted fact, the pipeline checks the hash against existing active facts and short-circuits if a match exists, incrementing the exact_dedup_skipped metric. This catches identical re-extractions cheaply without an LLM call.

Tunable

DEDUP_EVERY_N (default 5) controls how often reconcile_memories runs in the auto-trigger path. Set to 0 to disable. The candidate cap n (default 50) is tunable per call; larger values give the LLM a wider view at higher token cost.

Indexing note. The reconcile pool query orders by created_at (matching the prompt's "more recent first" tiebreaker). Cosmos's default indexing policy includes every property, so this works out of the box. If you customize the indexing policy to reduce write RU, ensure /created_at/? remains indexed or the query will fail with a 400 (Order-by over a non-indexed path).

Automatic Processing (Change Feed)

In addition to on-demand processing via the SDK, the toolkit includes a Cosmos DB change feed trigger that automatically starts processing orchestrations when enough new turns have been written.

memories container
      │  (change feed)
      ▼
on_memory_change trigger
      │
      ├── count turns per (user_id, thread_id)
      │   └── crosses threshold? ──► start thread_summary / extract_facts
      │
      └── count turns per user_id
          └── crosses threshold? ──► start user_summary

How it works

The change feed trigger watches the memories container for new documents.
Only documents with type == "turn" are counted (summaries, facts, and user summaries are ignored).
Documents in the dedicated counter container track how many turns have been seen per scope using ETag-based optimistic concurrency.
When a counter crosses a configured threshold, the corresponding Durable Functions orchestration is started automatically.

Threshold settings

Setting	Scope	Default
`THREAD_SUMMARY_EVERY_N`	Per `(user_id, thread_id)`	`0` (disabled)
`FACT_EXTRACTION_EVERY_N`	Per `(user_id, thread_id)`	`0` (disabled)
`USER_SUMMARY_EVERY_N`	Per `user_id` (across all threads)	`0` (disabled)

Set any value to 0 to disable that processing type. For example, setting THREAD_SUMMARY_EVERY_N=5 generates a thread summary every 5 new turns in each thread.

Required containers

Container	Partition Key	Purpose
`memories`	`/user_id`, `/thread_id` (hierarchical)	Durable derived memories (`fact`, `episodic`, `procedural`)
`memories_turns`	`/user_id`, `/thread_id` (hierarchical)	Raw conversation turns (`turn`) — append-only, TTL-pruned
`memories_summaries`	`/user_id`, `/thread_id` (hierarchical)	Thread + user summaries (`thread_summary`, `user_summary`)
`counter`	`/user_id`, `/thread_id` (hierarchical)	Message count tracking for automatic processing
`leases`	`/id`	Change feed checkpointing container created by `create_memory_store()`

Throughput configuration

The toolkit provisions all required Cosmos containers under one shared throughput mode:

serverless is the default. The toolkit creates the memories, memories_turns, memories_summaries, counter, and leases containers without specifying RU/s.
autoscale applies the shared COSMOS_DB_AUTOSCALE_MAX_RU cap to all five containers.

This keeps the change feed dependencies aligned with the main memory store instead of letting the Functions trigger create the lease container independently.

Push vs. pull

Mode	Trigger	Use case
On-demand (pull)	SDK call (`generate_thread_summary()`, etc.)	Explicit control over when processing happens
Automatic (push)	Change feed trigger	Fire-and-forget — processing happens in the background as turns are written

Both modes use the same Durable Functions orchestrator and activities, so prompts, incremental update logic, and stored outputs are identical.

Local vs. Cloud Storage

Backend	Use Case	Persistence
Local (in-memory)	Development and quick testing	Lost on process exit
Cosmos DB	Production, shared access, semantic search	Durable

Local storage is enough for CRUD testing. Cosmos DB is required for persistence, vector search, and the processing pipeline.

Cosmos DB uses a hierarchical partition key (user_id, thread_id) for efficient queries scoped to a user or thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concepts

Memory Schema

Memory Types

Turn

Summary

Fact

User Summary

Short-Term vs. Long-Term Memory

Threads and Roles

Embeddings and Search

Processing Pipeline

Memory Reconciliation

Why one pass

Loser preservation

Write-time exact dedup

Tunable

Automatic Processing (Change Feed)

How it works

Threshold settings

Required containers

Throughput configuration

Push vs. pull

Local vs. Cloud Storage

FilesExpand file tree

concepts.md

Latest commit

History

concepts.md

File metadata and controls

Concepts

Memory Schema

Memory Types

Turn

Summary

Fact

User Summary

Short-Term vs. Long-Term Memory

Threads and Roles

Embeddings and Search

Processing Pipeline

Memory Reconciliation

Why one pass

Loser preservation

Write-time exact dedup

Tunable

Automatic Processing (Change Feed)

How it works

Threshold settings

Required containers

Throughput configuration

Push vs. pull

Local vs. Cloud Storage