Skip to content

Commit caf83e4

Browse files
rolandpgclaude
andauthored
feat(telemetry): RFC-007 operational telemetry — TelemetryCollector, MemoryManager integration, aggregator, sampler, dashboard (#85)
* feat(telemetry): commit US-001 TelemetryCollector + tests (squash from ralph/rfc-007-operational-telemetry) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(telemetry): integrate TelemetryCollector into MemoryManager (RFC-007 / US-002) Adds automatic per-query telemetry capture to MemoryManager.recall() and .synthesize() with zero agent code changes — telemetry activates when ZETTELFORGE_LOG_LEVEL=DEBUG. Key changes: - MemoryManager.__init__ wires in TelemetryCollector singleton + correlation slots (_telemetry_query_id, _telemetry_retrieved_notes) - recall() gains actor= kwarg, wraps retriever/graph_retriever with timing, calls start_query + log_recall with vector/graph latency breakdown - synthesize() gains actor= kwarg, reuses recall's query_id (or starts fresh), calls log_synthesis + auto_feedback_from_synthesis - OCSF events extended with telemetry_query_id and telemetry_actor fields - 6 integration tests verify telemetry capture paths Mypy: 3 new errors (float→int latency_ms) fixed, 23 pre-existing errors remain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(telemetry): add telemetry_aggregator daily report script (RFC-007 / US-003) CLI tool that reads per-day telemetry JSONL and produces actionable operational metrics: total queries, synthesis count, latency averages, confidence, tier distribution, feedback stats, top utility notes, unused notes count. Includes 5 unit tests covering: missing day handling, recall+synthesis aggregation, tier distribution merging, feedback utility calculation, and unused notes detection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(telemetry): add human evaluation rubric and sampling script (RFC-007 / US-004) 6-question rubric covering recall relevance, synthesis value, critical gaps, unsupported claims, latency perception, and overall trust (1-5 scale). Scripts/human_eval_sampler.py selects 20 random synthesis briefings from telemetry JSONL files and formats them as a structured Markdown template for Roland's monthly review. Includes scoring summary table and human_eval event entry schema. 9 unit tests verify telemetry reading, briefing formatting, rubric template, and main() behavior across edge cases. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(telemetry): Streamlit telemetry dashboard (RFC-007 / US-005) Closes out RFC-007. Optional visualization layer over the telemetry JSONL — query volume, latency p50/p95, tier distribution, utility trend, unused notes warning. Runs locally at :8501. Design notes: - pandas is a module-load dep (needed to build DataFrames feeding Streamlit charts). streamlit is imported LAZILY inside render() so the pure compute helpers (daily_volume, latency_percentiles, tier_distribution, utility_trend, unused_notes, load_events, to_dataframe) stay testable without Streamlit installed. - No database dependency — reads the same ~/.amem/telemetry/*.jsonl files the aggregator and sampler use. - unused_notes surfaces retrieval quality issues: a note that shows up in recall results but never earns utility >= 4 in feedback is a likely false-positive candidate. Threshold mirrors the auto_feedback_from_synthesis cited=4/uncited=2 contract from US-001. - Run: streamlit run src/zettelforge/scripts/telemetry_dashboard.py (optionally ZF_TELEMETRY_DIR=/path/to/data). Tests: 16 unit tests over the pure compute layer — covers load_events (missing dir, multi-day, corrupt-line tolerance), daily_volume (per-day per-type counts, excludes feedback, empty-df handling), latency percentiles (p50/p95/max, missing event type), tier_distribution (summation across events, missing-column defense), utility_trend (daily mean, empty-feedback case), unused_notes (retrieved-but-not-cited detection, all-cited case, empty-df case). Skips gracefully if pandas is missing. ruff clean. Dashboard is documented as optional; tests run under CI environments without streamlit installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: RFC-007 design doc + telemetry section in troubleshoot guide - docs/rfcs/RFC-007-operational-telemetry.md — full design doc for the operational telemetry system shipped in US-001 through US-005. Includes the four DD-1..DD-4 design decisions (caller-opt-in query_id correlation, narrow-scope latency instrumentation, OCSF unmapped extension, hybrid __new__-bypass integration tests) resolved by subagent review before implementation. - docs/how-to/troubleshoot.md — adds a "Logs and diagnostics" telemetry subsection so operators know: 1. telemetry JSONL lives at ~/.amem/telemetry/telemetry_YYYY-MM-DD.jsonl (parallel to OCSF log at ~/.amem/zettelforge.log) 2. aggregator CLI: python -m zettelforge.scripts.telemetry_aggregator 3. sampler CLI: python -m zettelforge.scripts.human_eval_sampler 4. optional dashboard: streamlit run telemetry_dashboard.py 5. privacy contract: raw note content never persisted, query text truncated at 200/500 chars by mode, local-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(lint): address ruff findings in US-003/US-004 telemetry scripts Auto-fixes + one manual rename (l → ln to avoid E741 ambiguous single-letter variable). Applies ruff --fix clean across all 10 files in the PR. Local CI ruff was satisfied earlier but the repo's CI runs a stricter ruleset on PR — this reconciles them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(format): ruff format pass across RFC-007 telemetry files CI's format step (ruff format --check) is stricter than the lint step and was failing on 8 of the 10 PR files. Ran ruff format across all modified sources; tests still pass (57/57). No logic changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 157ec5a commit caf83e4

13 files changed

Lines changed: 2610 additions & 3 deletions

docs/how-to/troubleshoot.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,8 +157,9 @@ data directory (never to stdout by design — see GOV-012). Typical
157157
locations:
158158

159159
```bash
160-
tail -f ~/.amem/zettelforge.log
161-
tail -f ~/.amem/audit.log
160+
tail -f ~/.amem/zettelforge.log # OCSF structured events (API activity, auth, file I/O)
161+
tail -f ~/.amem/audit.log # Security-relevant events only (GOV-012)
162+
tail -f ~/.amem/telemetry/telemetry_$(date +%F).jsonl # Operational telemetry (RFC-007)
162163
```
163164

164165
Useful log events to grep:
@@ -173,6 +174,27 @@ Useful log events to grep:
173174

174175
Set `logging.level: DEBUG` in `config.yaml` for verbose output.
175176

177+
### Operational telemetry (RFC-007)
178+
179+
Every `MemoryManager.recall()` and `.synthesize()` call also emits a
180+
per-query event to `~/.amem/telemetry/telemetry_YYYY-MM-DD.jsonl`
181+
(parallel to the main OCSF log). In INFO mode this is aggregated
182+
counts plus latency; at DEBUG level it adds per-note metadata, tier
183+
distribution, vector/graph latency breakdown, and citation-based
184+
utility feedback.
185+
186+
Tooling:
187+
188+
| Script | Purpose |
189+
|--------|---------|
190+
| `python -m zettelforge.scripts.telemetry_aggregator --date YYYY-MM-DD` | Daily summary report (latency averages, tier distribution, unused notes, top utility notes) |
191+
| `python -m zettelforge.scripts.human_eval_sampler` | Sample 20 random synthesis briefings for the monthly human evaluation rubric (see `docs/human-evaluation-rubric.md`) |
192+
| `streamlit run src/zettelforge/scripts/telemetry_dashboard.py` | Optional visualization (query volume, latency p50/p95, tier/utility trends, unused notes warning) |
193+
194+
Raw note content is never persisted in telemetry — only IDs, tiers,
195+
source types, and domains. Query text is truncated to 200 chars at INFO
196+
and 500 at DEBUG. All data stays local.
197+
176198
## Related
177199

178200
- [Configuration Reference](../reference/configuration.md)

docs/human-evaluation-rubric.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Human Evaluation Rubric for ZettelForge Briefings
2+
3+
**Purpose:** Structured monthly review process for random agent briefings to qualitatively assess ZettelForge's impact on CTI analysis quality.
4+
5+
**Frequency:** Monthly — use `scripts/human_eval_sampler.py` to select 20 random briefings from the past month's telemetry.
6+
7+
**Scale:** 1 (poor) to 5 (excellent) for each criterion.
8+
9+
---
10+
11+
## Evaluation Criteria
12+
13+
### 1. Recall Relevance (1-5)
14+
15+
Did the recall step surface relevant, high-quality notes?
16+
17+
- **5** — All retrieved notes are directly relevant to the query; no noise
18+
- **4** — Most notes are relevant; 1-2 tangential results
19+
- **3** — Mix of relevant and somewhat tangential notes
20+
- **2** — Few relevant notes; mostly noise
21+
- **1** — No useful notes retrieved
22+
23+
**What to check:**
24+
- Are the retrieved notes about the right threat actors, TTPs, infrastructure?
25+
- Does tier assignment make sense (tier A notes should be most relevant)?
26+
- Are there obvious missed sources (e.g., missing MITRE techniques)?
27+
28+
### 2. Synthesis Value (1-5)
29+
30+
Was the synthesized briefing useful and actionable for CTI analysis?
31+
32+
- **5** — Briefing is clear, actionable, and well-structured
33+
- **4** — Briefing is useful but could be tighter
34+
- **3** — Briefing covers basics but lacks depth
35+
- **2** — Briefing is superficial or poorly organized
36+
- **1** — Briefing is misleading or unusable
37+
38+
**What to check:**
39+
- Is the narrative coherent and logically structured?
40+
- Does it answer the original query?
41+
- Are key findings highlighted appropriately?
42+
43+
### 3. Critical Notes Missing (1-5)
44+
45+
Were any important notes not retrieved?
46+
47+
- **5** — No critical gaps; all known important notes were present
48+
- **4** — Minor gaps only
49+
- **3** — Some notable notes missing but not game-changing
50+
- **2** — Several important notes missing
51+
- **1** — Key evidence completely absent
52+
53+
**What to check:**
54+
- Based on your knowledge of the topic, what should have been found?
55+
- Would additional recall rounds have helped?
56+
- Was the gap due to retrieval or due to no source notes existing?
57+
58+
### 4. Unsupported Claims (1-5)
59+
60+
Did the synthesis make claims not backed by the retrieved notes?
61+
62+
- **5** — All claims are directly supported by cited notes
63+
- **4** — One minor unsupported inference
64+
- **3** — Some claims stretch beyond source material
65+
- **2** — Several unsupported claims; hallucination likely
66+
- **1** — Briefing contains fabricated or misleading claims
67+
68+
**What to check:**
69+
- Cross-reference each key claim against its cited note
70+
- Look for "hallucination patterns": overly specific details not in source, contradictory attributions
71+
- Check for conflation of different sources
72+
73+
### 5. Latency Perception (1-5)
74+
75+
Was the response time acceptable given the depth of analysis?
76+
77+
- **5** — Response was instant; no waiting
78+
- **4** — Short wait (< 10s); acceptable for depth
79+
- **3** — Moderate wait (10-30s); noticeable but tolerable
80+
- **2** — Long wait (30-60s); frustrating for simple queries
81+
- **1** — Very long wait (> 60s); feels broken
82+
83+
**What to check:**
84+
- Is the latency reasonable for the query complexity?
85+
- Where did most time go (recall, graph, synthesis)?
86+
- Would parallelization or caching help?
87+
88+
### 6. Overall Trust (1-5)
89+
90+
Would you trust this briefing for operational CTI analysis?
91+
92+
- **5** — Would use verbatim in an operational report
93+
- **4** — Would use with minor edits
94+
- **3** — Would use as a draft requiring significant verification
95+
- **2** — Would not trust without manual verification
96+
- **1** — Cannot trust; would discard
97+
98+
**What to check:**
99+
- Overall quality of the complete pipeline output
100+
- Confidence vs. actual quality correlation
101+
- Would you stake your reputation on this briefing?
102+
103+
---
104+
105+
## Scoring Summary
106+
107+
| Metric | Formula | Interpretation |
108+
|--------|---------|---------------|
109+
| **Quality Score** | avg(1-4) | >4.0 = excellent, >3.0 = usable, <3.0 = needs work |
110+
| **Trust Rate** | count(6 >= 4) / N | % of briefings you'd trust operationally |
111+
| **Hallucination Rate** | count(4 = 1 or 2) / N | % with unsupported claims |
112+
| **Gap Rate** | count(3 <= 2) / N | % with critical missing notes |
113+
114+
---
115+
116+
## Human Evaluation Entry Schema
117+
118+
When Roland completes the review, append the evaluation to `~/.amem/telemetry/human_eval.jsonl`:
119+
120+
```json
121+
{
122+
"event_type": "human_eval",
123+
"evaluated_at": "2026-04-23T14:30:00",
124+
"source_query": "APT28 infrastructure in Eastern Europe",
125+
"source_ts": "2026-04-22T08:15:00",
126+
"scores": {
127+
"recall_relevance": 4,
128+
"synthesis_value": 5,
129+
"critical_notes_missing": 4,
130+
"unsupported_claims": 5,
131+
"latency_perception": 3,
132+
"overall_trust": 5
133+
},
134+
"notes": "Excellent briefing. Minor latency issue but content was spot-on."
135+
}
136+
```

0 commit comments

Comments
 (0)