Skip to content

Commit 6cc2a5b

Browse files
authored
feat(claude-plugin): record guideline usage per session in audit.log (#239)
* feat(claude-plugin): record guideline usage per session in audit.log Adds the reverse provenance direction: which sessions used a given guideline. Complements the existing source-trajectory stamp on each entity. - recall/retrieve_entities.py now appends one `recall` event per UserPromptSubmit to .evolve/audit.log, listing the served entity slugs and the session_id derived from transcript_path. Failures are swallowed so logging cannot break the user-visible recall path. - learn/SKILL.md gains a Step 4 that reads audit.log, reconstructs the set of guidelines served to this session, and emits per-entity verdicts (followed | contradicted | not_applicable) with a short evidence line. - A new log_influence.py script validates and writes those verdicts back into audit.log. - The e2e test asserts both event types land correctly after session 2. * style(tests): apply ruff format to sandbox learn+recall test Fixes failing CI check: check-formatting * fix(learn): guard payload and assessment types in log_influence.py Validate that the JSON payload is a mapping before calling payload.get(), and skip non-dict assessment items instead of letting a.get() raise AttributeError. Logs each skipped item for traceability. Addresses CodeRabbit review finding: Guard payload and assessment item types before calling `.get()` * fix(learn): resolve recalled slugs across subscribed entity trees Step 4 previously hard-coded .evolve/entities/guideline/<slug>.md, which misses entities served from subscribed repositories at .evolve/entities/subscribed/<source>/guideline/<slug>.md. Switch the instructions to a recursive search under .evolve/entities/ so the influence assessment can reach the actual file wherever it lives. Addresses CodeRabbit review finding: Resolve recalled slugs beyond only `.evolve/entities/guideline/` * test(learn): add unit tests for log_influence.py Covers happy path (single/multiple assessments, default evidence), per-item skip paths (invalid verdict, missing entity, non-dict item), and top-level input validation (non-dict payload, missing session_id, non-list assessments, invalid JSON). Complements the e2e sandbox test with fast, deterministic coverage of the helper's input-validation surface. Addresses CodeRabbit review finding: Add unit tests for log_influence.py * fix(claude-plugin): qualify recall audit ids across local/subscribed trees The recall audit log stored bare filename stems, so the same slug from a local entity and a subscribed repo collapsed into one entry and the influence step couldn't tell which guideline actually fired. Switch the stored id to the path relative to .evolve/entities/ (without .md): "guideline/foo" for local entries, "subscribed/alice/guideline/foo" for subscribed ones. The id is unambiguous, names exactly one file, and lets learn open it directly (no recursive search). SKILL.md Step 4 is simplified accordingly — no more find / multi-tree resolver; just Read .evolve/entities/<id>.md. The e2e test matches session 1's guidelines against the new qualified ids, and the existing log_influence unit tests pass unchanged. Addresses review feedback from visahak * fix(learn): renumber influence step to Step 5 after merge Merging public/main brought in #243's SKILL.md restructure, which inserted a "Review Existing Guidelines" step and shifted Save Entities to Step 4 — collidingwith the "Step 4: Assess Influence" section this branch added. Rename the influence section to Step 5 and update its sub-steps to reference Step 4 (save) and derive session_id from the saved_trajectory_path variable (the post-#243 name) instead of the removed transcript_path. * test(learn): switch log_influence tests to temp_project_dir fixture Align with the rest of tests/platform_integrations/ which use the temp_project_dir fixture (a thin wrapper over tmp_path that creates an isolated test_project/ subdir). All 11 tests still pass. Addresses CodeRabbit review finding: Replace the use of the tmp_path fixture with the temp_project_dir fixture * test(learn): assert no audit writes on invalid JSON input Mirrors the "no audit side effect" assertion present in the other reject-path tests so we catch any regression where invalid JSON would sneak a partial write into audit.log. Addresses CodeRabbit review finding: The test_rejects_invalid_json case is missing the "no audit side effect" assertion
1 parent d8b9ac0 commit 6cc2a5b

5 files changed

Lines changed: 370 additions & 1 deletion

File tree

platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,38 @@ The script will:
116116
- Deduplicate against existing entities
117117
- Display confirmation with the total count
118118

119+
### Step 5: Assess Influence of Recalled Entities
120+
121+
Regardless of whether Step 4 saved new entities, judge whether the guidelines the recall hook served to *this* session were actually followed, contradicted, or simply irrelevant. This closes the provenance loop: the recall hook records *what* was served; this step records *what effect* it had.
122+
123+
1. Derive this session's `session_id` from the `saved_trajectory_path` extracted in Step 0: strip the directory prefix and the `claude-transcript_` / `.jsonl` affixes. For `.evolve/trajectories/claude-transcript_abc-123.jsonl` the `session_id` is `abc-123`.
124+
125+
2. Read `.evolve/audit.log` (JSONL, one object per line). Find every line where `event == "recall"` and `session_id` matches. Take the union of their `entities` arrays — that is the set of guideline identifiers served to this session. Each identifier is a relative path from `.evolve/entities/` without the `.md` suffix (e.g. `guideline/foo` for a local entity, or `subscribed/alice/guideline/foo` for a subscribed one), so it unambiguously names one file. If the set is empty, skip this step.
126+
127+
3. For each identifier, open `.evolve/entities/<id>.md` with the Read tool. Read its content + trigger — that is the guideline's intent. Skip the identifier (log it as an assessment-less entry) if the file is not found.
128+
129+
4. Compare against the transcript loaded in Step 0. For each identifier, pick one verdict:
130+
- `followed` — the agent's actual actions are consistent with the guideline's recommendation.
131+
- `contradicted` — the guideline's trigger matched the task but the agent did the opposite, or hit the dead end the guideline would have prevented.
132+
- `not_applicable` — the guideline's trigger didn't match what this session was about.
133+
134+
Keep `evidence` to one short sentence citing a specific action or tool call from the transcript.
135+
136+
5. Emit one JSON payload and pipe it to the helper:
137+
138+
```bash
139+
echo '{
140+
"session_id": "<session-id>",
141+
"assessments": [
142+
{"entity": "guideline/<slug>", "verdict": "followed", "evidence": "Agent imported struct and parsed APP1 directly"}
143+
]
144+
}' | python3 ${CLAUDE_PLUGIN_ROOT}/skills/learn/scripts/log_influence.py
145+
```
146+
147+
The `entity` value must match exactly what appeared in the recall event — include the `subscribed/<source>/` prefix if the entity came from a subscribed repo.
148+
149+
Emit zero assessments (empty `assessments` list) when no recall events exist for this session.
150+
119151
## Quality Gate
120152

121153
Before saving, review each entity against this checklist:
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
#!/usr/bin/env python3
2+
"""Append post-hoc influence assessments to .evolve/audit.log.
3+
4+
Reads JSON from stdin of the form:
5+
{
6+
"session_id": "<transcript stem>",
7+
"assessments": [
8+
{"entity": "<slug>", "verdict": "followed|contradicted|not_applicable",
9+
"evidence": "<short justification>"},
10+
...
11+
]
12+
}
13+
"""
14+
15+
import json
16+
import sys
17+
from pathlib import Path
18+
19+
sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent.parent / "lib"))
20+
from entity_io import get_evolve_dir, log as _log # noqa: E402
21+
import audit # noqa: E402
22+
23+
24+
_ALLOWED_VERDICTS = {"followed", "contradicted", "not_applicable"}
25+
26+
27+
def log(message):
28+
_log("influence", message)
29+
30+
31+
def main():
32+
try:
33+
payload = json.load(sys.stdin)
34+
except json.JSONDecodeError as exc:
35+
log(f"Invalid JSON input: {exc}")
36+
print(f"Error: invalid JSON input - {exc}", file=sys.stderr)
37+
sys.exit(1)
38+
39+
if not isinstance(payload, dict):
40+
log(f"Bad payload type: {type(payload).__name__}")
41+
print("Error: payload must be a JSON object.", file=sys.stderr)
42+
sys.exit(1)
43+
44+
session_id = payload.get("session_id")
45+
assessments = payload.get("assessments", [])
46+
if not session_id or not isinstance(assessments, list):
47+
log(f"Bad payload shape: session_id={session_id!r} assessments_type={type(assessments).__name__}")
48+
print("Error: payload must include `session_id` and a list `assessments`.", file=sys.stderr)
49+
sys.exit(1)
50+
51+
project_root = str(get_evolve_dir().resolve().parent)
52+
53+
written = 0
54+
for a in assessments:
55+
if not isinstance(a, dict):
56+
log(f"Skipping non-dict assessment item: {a!r}")
57+
continue
58+
entity = a.get("entity")
59+
verdict = a.get("verdict")
60+
evidence = a.get("evidence", "")
61+
if not entity or verdict not in _ALLOWED_VERDICTS:
62+
log(f"Skipping invalid assessment: {a}")
63+
continue
64+
audit.append(
65+
project_root=project_root,
66+
event="influence",
67+
session_id=session_id,
68+
entity=entity,
69+
verdict=verdict,
70+
evidence=evidence,
71+
)
72+
written += 1
73+
74+
log(f"Wrote {written} influence record(s) for session {session_id}")
75+
print(f"Recorded {written} influence assessment(s).")
76+
77+
78+
if __name__ == "__main__":
79+
main()

platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88

99
# Add lib to path so we can import entity_io
1010
sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent.parent / "lib"))
11-
from entity_io import find_recall_entity_dirs, markdown_to_entity, log as _log
11+
from entity_io import find_recall_entity_dirs, get_evolve_dir, markdown_to_entity, log as _log
12+
import audit
1213

1314

1415
def log(message):
@@ -82,6 +83,13 @@ def load_entities_with_source(entities_dir):
8283
entity = markdown_to_entity(md)
8384
if not entity.get("content"):
8485
continue
86+
# Record the on-disk path relative to entities_dir (without the
87+
# .md suffix) as a qualified identifier. This distinguishes
88+
# same-named entities in different trees — e.g.
89+
# "guideline/foo" (local) vs "subscribed/alice/guideline/foo"
90+
# (from a subscribed repo) — so downstream auditing doesn't
91+
# collapse them into one.
92+
entity["_id"] = str(md.relative_to(entities_dir).with_suffix(""))
8593
# Detect subscribed entities by path: .../entities/subscribed/{name}/...
8694
parts = md.parts
8795
try:
@@ -129,6 +137,24 @@ def main():
129137
print(output)
130138
log(f"Output {len(output)} chars to stdout")
131139

140+
# Audit: record which entities were served to which session. Must not
141+
# fail the hook if logging errors — recall is the user-visible path.
142+
try:
143+
transcript_path = input_data.get("transcript_path", "")
144+
session_id = Path(transcript_path).stem if transcript_path else None
145+
entity_ids = sorted({e["_id"] for e in entities if e.get("_id")})
146+
if session_id and entity_ids:
147+
project_root = get_evolve_dir().resolve().parent
148+
audit.append(
149+
project_root=str(project_root),
150+
event="recall",
151+
session_id=session_id,
152+
entities=entity_ids,
153+
)
154+
log(f"Audit: recall session_id={session_id} entities={len(entity_ids)}")
155+
except Exception as exc:
156+
log(f"Audit append failed (non-fatal): {exc}")
157+
132158

133159
if __name__ == "__main__":
134160
main()

tests/e2e/test_sandbox_learn_recall.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,36 @@ def test_learn_then_recall_flow(sandbox_ready, sandbox_workspace):
164164
# pip-installed). Other libraries (PIL, piexif, exifread) may appear in a
165165
# valid guideline as "install via pip and use", so we don't ban them.
166166
assert not re.search(r"\bexiftool\b", joined), "session 2 invoked exiftool despite recall guideline:\n" + "\n".join(commands)
167+
168+
# --- Usage provenance: audit.log should record recall + influence ---
169+
audit_log = sandbox_workspace / ".evolve" / "audit.log"
170+
assert audit_log.is_file(), f"{audit_log} was not created — recall did not append audit events"
171+
172+
events = []
173+
for line in audit_log.read_text().splitlines():
174+
line = line.strip()
175+
if not line:
176+
continue
177+
events.append(json.loads(line))
178+
179+
session2_id = session2_transcript.stem.removeprefix("claude-transcript_")
180+
# Recall audit records qualified ids — path relative to .evolve/entities/
181+
# without the .md suffix — so we match session 1's entities the same way.
182+
session1_ids = {str(p.relative_to(entities_dir).with_suffix("")) for p in entity_files}
183+
184+
recall_events = [e for e in events if e.get("event") == "recall" and e.get("session_id") == session2_id]
185+
assert recall_events, f"no recall audit event for session 2 ({session2_id}). all events: {events}"
186+
recalled_ids = {eid for e in recall_events for eid in e.get("entities", [])}
187+
assert recalled_ids & session1_ids, f"recall event entities {recalled_ids} did not include any id from session 1 ({session1_ids})"
188+
log.info(f"session 2: audit recorded recall of {recalled_ids}")
189+
190+
influence_events = [e for e in events if e.get("event") == "influence" and e.get("session_id") == session2_id]
191+
assert influence_events, (
192+
f"no influence audit event for session 2 ({session2_id}). recall events exist but learn did not emit assessments."
193+
)
194+
for ie in influence_events:
195+
assert ie.get("verdict") in {"followed", "contradicted", "not_applicable"}, f"influence event has invalid verdict: {ie}"
196+
log.info(
197+
f"session 2: audit recorded {len(influence_events)} influence assessment(s): "
198+
f"{[(e['entity'], e['verdict']) for e in influence_events]}"
199+
)

0 commit comments

Comments
 (0)