microsoft
diff --git a/‎python/samples/agentchat_behavioral_monitor/README.md‎
Lines changed: 40 additions & 25 deletions b/‎python/samples/agentchat_behavioral_monitor/README.md‎
Lines changed: 40 additions & 25 deletions
@@ -1,16 +1,19 @@
 # agentchat_behavioral_monitor
 
-Detect vocabulary and topic drift in long AutoGen conversations.
+Detect vocabulary drift across repeated AgentChat runs.
 
-When an AutoGen conversation runs long enough to trigger history summarization
-or truncation, earlier task context can vanish without any visible signal. The
-agent keeps responding, but its answers may silently ignore facts it established
-in earlier turns — different tool choices, forgotten constraints, missing domain
-vocabulary.
+When a long-running agent shifts away from earlier task vocabulary, the failure
+often shows up first as a change in outputs rather than an explicit error. This
+sample shows how to watch for that drift on the public AgentChat surface.
 
-This sample shows how to detect that drift using **Ghost Consistency Score
-(CCS)**: the fraction of vocabulary from the earliest turns still present in
-the most recent turns. A score below 0.40 indicates likely behavioral drift.
+The demo is deterministic: it uses `ReplayChatCompletionClient` together with a
+real `AssistantAgent`, then monitors the resulting `TaskResult.messages`
+history. In production, replace the replay model with a real model client and
+keep the same monitor.
+
+This sample detects drift using **Ghost Consistency Score (CCS)**: the fraction
+of vocabulary from the earliest runs still present in the most recent runs. A
+score below 0.40 indicates likely behavioral drift.
 
 ## How it works
 
@@ -20,7 +23,8 @@ Current window   = last 25% of conversation turns
 CCS              = |vocab(baseline) ∩ vocab(current)| / |vocab(baseline)|
 ```
 
-A "ghost term" is a task-relevant word (jwt, bcrypt, foreign_key, redis, etc.)
+A "ghost term" is a task-relevant word (`jwt`, `bcrypt`, `foreign_key`,
+`redis`, etc.)
 that appeared in the baseline window but has disappeared from the current
 window. Ghost terms are the most direct signal of forgotten context.
 
@@ -33,38 +37,49 @@ python main.py
 Expected output:
 
 ```
-=== AutoGen Behavioral Monitor demo ===
-Turn          : 8
-CCS           : 0.333  (1.0 = no drift, <0.4 = significant drift)
-Ghost terms   : ['bcrypt', 'foreign_key', 'redis', 'jwt']
-Unretrieved   : True
-Drift detected: True
+=== AutoGen AgentChat behavioral monitor demo ===
+
+Turn 1
+CCS: 1.0
+Ghost terms: []
+Drift detected: False
 
-Interpretation: the agent has drifted from its early task vocabulary.
-If ghost terms include task-critical facts, consider re-injecting context
-or prompting the agent to recall those items from its memory store.
+Turn 3
+CCS: 0.25
+Ghost terms: ['bcrypt', 'foreign_key', 'jwt', 'redis']
+Drift detected: True
 ```
 
 ## Integrating into your agent loop
 
 ```python
+from autogen_agentchat.agents import AssistantAgent
+from autogen_ext.models.replay import ReplayChatCompletionClient
 from main import BehavioralMonitor
 
 monitor = BehavioralMonitor(
     ccs_threshold=0.40,
-    min_messages=6,
+    min_messages=3,
 )
 
 history = []
+agent = AssistantAgent(
+    "assistant",
+    model_client=ReplayChatCompletionClient([
+        "Use jwt and bcrypt for auth.",
+        "Keep jwt auth intact for the profile endpoint.",
+        "Add endpoint rate limiting.",
+    ]),
+)
 
 # Check after each public AgentChat run
-task_result = await assistant_agent.run(task="Use jwt and bcrypt for auth")
+task_result = await agent.run(task="Use jwt and bcrypt for auth", output_task_messages=False)
 result = monitor.observe_result(history, task_result)
 if result["drift_detected"]:
     print("Drift at turn", result["turn"], "ghost:", result["ghost_terms"])
 
 # Later runs keep extending the same external history
-task_result = await assistant_agent.run(task="Now add a profile endpoint")
+task_result = await agent.run(task="Now add a profile endpoint", output_task_messages=False)
 result = monitor.observe_result(history, task_result)
 ```
 
@@ -73,15 +88,15 @@ result = monitor.observe_result(history, task_result)
 | Parameter | Default | Description |
 |---|---|---|
 | `ccs_threshold` | 0.40 | Flag drift when CCS drops below this value |
-| `min_messages` | 6 | Minimum conversation length before checks run |
+| `min_messages` | 3 | Minimum number of tracked AgentChat results before checks run |
 | `ghost_lexicon` | built-in list | Domain terms to watch for disappearance |
 
 ## Connection to AutoGen issue #7265
 
 This sample addresses the production reliability pattern discussed in
 https://github.com/microsoft/autogen/issues/7265 — specifically the
-ghost-lexicon + behavioral footprint pattern for detecting when long-running
-agents silently lose task context to compression.
+ghost-lexicon pattern for detecting when long-running agent outputs silently
+drift away from earlier task vocabulary.
 
 ## Related