Skip to content

Latest commit

 

History

History
217 lines (137 loc) · 10.5 KB

File metadata and controls

217 lines (137 loc) · 10.5 KB

Concepts

Agent Memory Toolkit stores conversation data as memory documents and builds higher-level artifacts on top of those documents for retrieval and reuse.


Memory Schema

Every memory uses the same base shape:

Field Description
id Unique identifier
user_id User the memory belongs to
thread_id Conversation thread
role user, agent, tool, or system
type turn, summary, fact, or user_summary
content Main text payload
embedding Vector used for semantic search
metadata Extra context (e.g. tool_name, tool_call_id, edit reasons)
created_at ISO 8601 timestamp

Memory Types

Turn

Type: turn

Turn memories are raw conversation records. They are created by add_local(), add_cosmos(), and push_to_cosmos() (which bulk-uploads local memories to Cosmos DB). They act as the source material for summaries and facts.

Use for: full conversation history and short-term context.

Summary

Type: summary

A summary compresses one thread into a compact record of the main topic, key decisions, open issues, and next steps. It is created by generate_thread_summary() and stored with deterministic ID summary_{user_id}_{thread_id}.

If a summary already exists, the pipeline reads it, loads only newer memories, and merges them into an updated summary instead of rebuilding from scratch.

Use for: long conversations that need a compact recap.

Fact

Type: fact

Facts are discrete assertions extracted from a thread, such as preferences, requirements, or confirmed decisions. extract_facts() stores each fact as its own document with its own embedding.

Use for: fine-grained semantic retrieval across threads.

User Summary

Type: user_summary

A user summary is a cross-thread profile for one user. It captures stable context such as preferences, current work, environment details, and constraints. It is created by generate_user_summary() and stored with ID user_summary_{user_id} and thread_id="__user_summary__".

Like thread summaries, user summaries update incrementally by merging the existing profile with only the new memories.

Use for: cross-session personalization and onboarding context.


Short-Term vs. Long-Term Memory

Short-Term Long-Term
What Turn messages Summaries, facts, user summaries
Granularity Per message Per thread, per fact, or per user
Created by add_local() / add_cosmos() / push_to_cosmos() generate_thread_summary() / extract_facts() / generate_user_summary()
Purpose Replay recent context Compact recall and semantic retrieval

Common pattern: keep turns during an active conversation, then generate summaries or facts when the thread gets long or is complete.


Threads and Roles

A thread is the unit of conversation. get_thread() returns the memories for one thread_id, optionally limited to the most recent k entries. get_memories() also supports a thread_id filter to retrieve memories from a specific thread.

Role Meaning
user Human message
agent Assistant message
tool Tool output (metadata can include tool_name and tool_call_id)
system Generated artifacts such as summaries and facts

Embeddings and Search

Memories stored in Cosmos DB include embeddings generated by Microsoft AI Foundry (e.g. text-embedding-3-large). This enables semantic search: a query is embedded, then Cosmos returns the closest matching memories via vector distance. search_cosmos also supports hybrid search (vector + full-text ranking via RRF).

Facts work especially well for vector search because each fact is stored as a small, self-contained document.


Processing Pipeline

Derived memories are generated by an Azure Durable Functions pipeline:

read existing doc if present -> query source memories -> call LLM -> embed output -> upsert to Cosmos DB

Thread summaries and user summaries support:

  • deterministic IDs
  • incremental updates
  • recent_k limits to restrict how much history is processed

Prompts for summarization and fact extraction live in azure_functions/prompts/ and can be edited without changing pipeline code.


Memory Reconciliation

The reconcile_memories(user_id, n=50) pipeline step reads up to N most-recent active facts for a user and asks the LLM to identify two orthogonal outcomes in one pass:

  • Duplicates — two or more facts that restate the same claim in different words. Resolution: collapse into one merged fact; the originals are soft-deleted with supersede_reason="duplicate" and superseded_by set to the merged fact's id.
  • Contradictions — two facts that assert opposing claims about the same subject. Resolution: keep the winner (more recent first, higher confidence as tiebreaker), soft-delete the loser with supersede_reason="contradict" and superseded_by set to the winner.

Why one pass

Detecting contradictions semantically requires the LLM to see the candidate pool as a whole — paraphrased ("user prefers aisle seats") and contradictory ("user is vegetarian" vs "user loves steak") facts often have very different embedding vectors and would never co-occur in any cosine cluster. Putting all N candidates into one prompt lets the LLM do the semantic reasoning across both axes simultaneously. The pipeline returns {"kept": int, "merged": int, "contradicted": int}.

Loser preservation

Soft-deleted facts stay in the container with their supersede_reason, superseded_at, and superseded_by fields populated. Default reads (get_memories, search_cosmos) filter them out via superseded_by IS NULL. To inspect the audit trail (e.g. "show everything that ever applied to this user"), opt out of the filter at the query level.

Write-time exact dedup

Each fact written by extract_memories carries a content_hash (SHA-256 of normalized content, truncated to 32 hex chars; lowercase, whitespace-collapsed). Before upserting a freshly-extracted fact, the pipeline checks the hash against existing active facts and short-circuits if a match exists, incrementing the exact_dedup_skipped metric. This catches identical re-extractions cheaply without an LLM call.

Tunable

DEDUP_EVERY_N (default 5) controls how often reconcile_memories runs in the auto-trigger path. Set to 0 to disable. The candidate cap n (default 50) is tunable per call; larger values give the LLM a wider view at higher token cost.

Indexing note. The reconcile pool query orders by created_at (matching the prompt's "more recent first" tiebreaker). Cosmos's default indexing policy includes every property, so this works out of the box. If you customize the indexing policy to reduce write RU, ensure /created_at/? remains indexed or the query will fail with a 400 (Order-by over a non-indexed path).


Automatic Processing (Change Feed)

In addition to on-demand processing via the SDK, the toolkit includes a Cosmos DB change feed trigger that automatically starts processing orchestrations when enough new turns have been written.

memories container
      │  (change feed)
      ▼
on_memory_change trigger
      │
      ├── count turns per (user_id, thread_id)
      │   └── crosses threshold? ──► start thread_summary / extract_facts
      │
      └── count turns per user_id
          └── crosses threshold? ──► start user_summary

How it works

  1. The change feed trigger watches the memories container for new documents.
  2. Only documents with type == "turn" are counted (summaries, facts, and user summaries are ignored).
  3. Documents in the dedicated counter container track how many turns have been seen per scope using ETag-based optimistic concurrency.
  4. When a counter crosses a configured threshold, the corresponding Durable Functions orchestration is started automatically.

Threshold settings

Setting Scope Default
THREAD_SUMMARY_EVERY_N Per (user_id, thread_id) 0 (disabled)
FACT_EXTRACTION_EVERY_N Per (user_id, thread_id) 0 (disabled)
USER_SUMMARY_EVERY_N Per user_id (across all threads) 0 (disabled)

Set any value to 0 to disable that processing type. For example, setting THREAD_SUMMARY_EVERY_N=5 generates a thread summary every 5 new turns in each thread.

Required containers

Container Partition Key Purpose
memories /user_id, /thread_id (hierarchical) Durable derived memories (fact, episodic, procedural)
memories_turns /user_id, /thread_id (hierarchical) Raw conversation turns (turn) — append-only, TTL-pruned
memories_summaries /user_id, /thread_id (hierarchical) Thread + user summaries (thread_summary, user_summary)
counter /user_id, /thread_id (hierarchical) Message count tracking for automatic processing
leases /id Change feed checkpointing container created by create_memory_store()

Throughput configuration

The toolkit provisions all required Cosmos containers under one shared throughput mode:

  • serverless is the default. The toolkit creates the memories, memories_turns, memories_summaries, counter, and leases containers without specifying RU/s.
  • autoscale applies the shared COSMOS_DB_AUTOSCALE_MAX_RU cap to all five containers.

This keeps the change feed dependencies aligned with the main memory store instead of letting the Functions trigger create the lease container independently.

Push vs. pull

Mode Trigger Use case
On-demand (pull) SDK call (generate_thread_summary(), etc.) Explicit control over when processing happens
Automatic (push) Change feed trigger Fire-and-forget — processing happens in the background as turns are written

Both modes use the same Durable Functions orchestrator and activities, so prompts, incremental update logic, and stored outputs are identical.


Local vs. Cloud Storage

Backend Use Case Persistence
Local (in-memory) Development and quick testing Lost on process exit
Cosmos DB Production, shared access, semantic search Durable

Local storage is enough for CRUD testing. Cosmos DB is required for persistence, vector search, and the processing pipeline.

Cosmos DB uses a hierarchical partition key (user_id, thread_id) for efficient queries scoped to a user or thread.