feat(claude-plugin): record guideline usage per session in audit.log (#239)

vinodmut · web-flow · commit 6cc2a5b6d050 · 2026-05-01T22:07:56.000-04:00
* feat(claude-plugin): record guideline usage per session in audit.log Adds the reverse provenance direction: which sessions used a given guideline. Complements the existing source-trajectory stamp on each entity. - recall/retrieve_entities.py now appends one `recall` event per UserPromptSubmit to .evolve/audit.log, listing the served entity slugs and the session_id derived from transcript_path. Failures are swallowed so logging cannot break the user-visible recall path. - learn/SKILL.md gains a Step 4 that reads audit.log, reconstructs the set of guidelines served to this session, and emits per-entity verdicts (followed | contradicted | not_applicable) with a short evidence line. - A new log_influence.py script validates and writes those verdicts back into audit.log. - The e2e test asserts both event types land correctly after session 2. * style(tests): apply ruff format to sandbox learn+recall test Fixes failing CI check: check-formatting * fix(learn): guard payload and assessment types in log_influence.py Validate that the JSON payload is a mapping before calling payload.get(), and skip non-dict assessment items instead of letting a.get() raise AttributeError. Logs each skipped item for traceability. Addresses CodeRabbit review finding: Guard payload and assessment item types before calling `.get()` * fix(learn): resolve recalled slugs across subscribed entity trees Step 4 previously hard-coded .evolve/entities/guideline/<slug>.md, which misses entities served from subscribed repositories at .evolve/entities/subscribed/<source>/guideline/<slug>.md. Switch the instructions to a recursive search under .evolve/entities/ so the influence assessment can reach the actual file wherever it lives. Addresses CodeRabbit review finding: Resolve recalled slugs beyond only `.evolve/entities/guideline/` * test(learn): add unit tests for log_influence.py Covers happy path (single/multiple assessments, default evidence), per-item skip paths (invalid verdict, missing entity, non-dict item), and top-level input validation (non-dict payload, missing session_id, non-list assessments, invalid JSON). Complements the e2e sandbox test with fast, deterministic coverage of the helper's input-validation surface. Addresses CodeRabbit review finding: Add unit tests for log_influence.py * fix(claude-plugin): qualify recall audit ids across local/subscribed trees The recall audit log stored bare filename stems, so the same slug from a local entity and a subscribed repo collapsed into one entry and the influence step couldn't tell which guideline actually fired. Switch the stored id to the path relative to .evolve/entities/ (without .md): "guideline/foo" for local entries, "subscribed/alice/guideline/foo" for subscribed ones. The id is unambiguous, names exactly one file, and lets learn open it directly (no recursive search). SKILL.md Step 4 is simplified accordingly — no more find / multi-tree resolver; just Read .evolve/entities/<id>.md. The e2e test matches session 1's guidelines against the new qualified ids, and the existing log_influence unit tests pass unchanged. Addresses review feedback from visahak * fix(learn): renumber influence step to Step 5 after merge Merging public/main brought in #243's SKILL.md restructure, which inserted a "Review Existing Guidelines" step and shifted Save Entities to Step 4 — collidingwith the "Step 4: Assess Influence" section this branch added. Rename the influence section to Step 5 and update its sub-steps to reference Step 4 (save) and derive session_id from the saved_trajectory_path variable (the post-#243 name) instead of the removed transcript_path. * test(learn): switch log_influence tests to temp_project_dir fixture Align with the rest of tests/platform_integrations/ which use the temp_project_dir fixture (a thin wrapper over tmp_path that creates an isolated test_project/ subdir). All 11 tests still pass. Addresses CodeRabbit review finding: Replace the use of the tmp_path fixture with the temp_project_dir fixture * test(learn): assert no audit writes on invalid JSON input Mirrors the "no audit side effect" assertion present in the other reject-path tests so we catch any regression where invalid JSON would sneak a partial write into audit.log. Addresses CodeRabbit review finding: The test_rejects_invalid_json case is missing the "no audit side effect" assertion
diff --git a/platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md b/platform-integrations/claude/plugins/evolve-lite/skills/learn/SKILL.md
@@ -116,6 +116,38 @@ The script will:
 - Deduplicate against existing entities
 - Display confirmation with the total count
 
+### Step 5: Assess Influence of Recalled Entities
+
+Regardless of whether Step 4 saved new entities, judge whether the guidelines the recall hook served to *this* session were actually followed, contradicted, or simply irrelevant. This closes the provenance loop: the recall hook records *what* was served; this step records *what effect* it had.
+
+1. Derive this session's `session_id` from the `saved_trajectory_path` extracted in Step 0: strip the directory prefix and the `claude-transcript_` / `.jsonl` affixes. For `.evolve/trajectories/claude-transcript_abc-123.jsonl` the `session_id` is `abc-123`.
+
+2. Read `.evolve/audit.log` (JSONL, one object per line). Find every line where `event == "recall"` and `session_id` matches. Take the union of their `entities` arrays — that is the set of guideline identifiers served to this session. Each identifier is a relative path from `.evolve/entities/` without the `.md` suffix (e.g. `guideline/foo` for a local entity, or `subscribed/alice/guideline/foo` for a subscribed one), so it unambiguously names one file. If the set is empty, skip this step.
+
+3. For each identifier, open `.evolve/entities/<id>.md` with the Read tool. Read its content + trigger — that is the guideline's intent. Skip the identifier (log it as an assessment-less entry) if the file is not found.
+
+4. Compare against the transcript loaded in Step 0. For each identifier, pick one verdict:
+   - `followed` — the agent's actual actions are consistent with the guideline's recommendation.
+   - `contradicted` — the guideline's trigger matched the task but the agent did the opposite, or hit the dead end the guideline would have prevented.
+   - `not_applicable` — the guideline's trigger didn't match what this session was about.
+
+   Keep `evidence` to one short sentence citing a specific action or tool call from the transcript.
+
+5. Emit one JSON payload and pipe it to the helper:
+
+```bash
+echo '{
+  "session_id": "<session-id>",
+  "assessments": [
+    {"entity": "guideline/<slug>", "verdict": "followed", "evidence": "Agent imported struct and parsed APP1 directly"}
+  ]
+}' | python3 ${CLAUDE_PLUGIN_ROOT}/skills/learn/scripts/log_influence.py
+```
+
+The `entity` value must match exactly what appeared in the recall event — include the `subscribed/<source>/` prefix if the entity came from a subscribed repo.
+
+Emit zero assessments (empty `assessments` list) when no recall events exist for this session.
+
 ## Quality Gate
 
 Before saving, review each entity against this checklist:
diff --git a/platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py b/platform-integrations/claude/plugins/evolve-lite/skills/learn/scripts/log_influence.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python3
+"""Append post-hoc influence assessments to .evolve/audit.log.
+
+Reads JSON from stdin of the form:
+  {
+    "session_id": "<transcript stem>",
+    "assessments": [
+      {"entity": "<slug>", "verdict": "followed|contradicted|not_applicable",
+       "evidence": "<short justification>"},
+      ...
+    ]
+  }
+"""
+
+import json
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent.parent / "lib"))
+from entity_io import get_evolve_dir, log as _log  # noqa: E402
+import audit  # noqa: E402
+
+
+_ALLOWED_VERDICTS = {"followed", "contradicted", "not_applicable"}
+
+
+def log(message):
+    _log("influence", message)
+
+
+def main():
+    try:
+        payload = json.load(sys.stdin)
+    except json.JSONDecodeError as exc:
+        log(f"Invalid JSON input: {exc}")
+        print(f"Error: invalid JSON input - {exc}", file=sys.stderr)
+        sys.exit(1)
+
+    if not isinstance(payload, dict):
+        log(f"Bad payload type: {type(payload).__name__}")
+        print("Error: payload must be a JSON object.", file=sys.stderr)
+        sys.exit(1)
+
+    session_id = payload.get("session_id")
+    assessments = payload.get("assessments", [])
+    if not session_id or not isinstance(assessments, list):
+        log(f"Bad payload shape: session_id={session_id!r} assessments_type={type(assessments).__name__}")
+        print("Error: payload must include `session_id` and a list `assessments`.", file=sys.stderr)
+        sys.exit(1)
+
+    project_root = str(get_evolve_dir().resolve().parent)
+
+    written = 0
+    for a in assessments:
+        if not isinstance(a, dict):
+            log(f"Skipping non-dict assessment item: {a!r}")
+            continue
+        entity = a.get("entity")
+        verdict = a.get("verdict")
+        evidence = a.get("evidence", "")
+        if not entity or verdict not in _ALLOWED_VERDICTS:
+            log(f"Skipping invalid assessment: {a}")
+            continue
+        audit.append(
+            project_root=project_root,
+            event="influence",
+            session_id=session_id,
+            entity=entity,
+            verdict=verdict,
+            evidence=evidence,
+        )
+        written += 1
+
+    log(f"Wrote {written} influence record(s) for session {session_id}")
+    print(f"Recorded {written} influence assessment(s).")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py b/platform-integrations/claude/plugins/evolve-lite/skills/recall/scripts/retrieve_entities.py
@@ -8,7 +8,8 @@
 
 # Add lib to path so we can import entity_io
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent.parent / "lib"))
-from entity_io import find_recall_entity_dirs, markdown_to_entity, log as _log
+from entity_io import find_recall_entity_dirs, get_evolve_dir, markdown_to_entity, log as _log
+import audit
 
 
 def log(message):
@@ -82,6 +83,13 @@ def load_entities_with_source(entities_dir):
             entity = markdown_to_entity(md)
             if not entity.get("content"):
                 continue
+            # Record the on-disk path relative to entities_dir (without the
+            # .md suffix) as a qualified identifier. This distinguishes
+            # same-named entities in different trees — e.g.
+            # "guideline/foo" (local) vs "subscribed/alice/guideline/foo"
+            # (from a subscribed repo) — so downstream auditing doesn't
+            # collapse them into one.
+            entity["_id"] = str(md.relative_to(entities_dir).with_suffix(""))
             # Detect subscribed entities by path: .../entities/subscribed/{name}/...
             parts = md.parts
             try:
@@ -129,6 +137,24 @@ def main():
     print(output)
     log(f"Output {len(output)} chars to stdout")
 
+    # Audit: record which entities were served to which session. Must not
+    # fail the hook if logging errors — recall is the user-visible path.
+    try:
+        transcript_path = input_data.get("transcript_path", "")
+        session_id = Path(transcript_path).stem if transcript_path else None
+        entity_ids = sorted({e["_id"] for e in entities if e.get("_id")})
+        if session_id and entity_ids:
+            project_root = get_evolve_dir().resolve().parent
+            audit.append(
+                project_root=str(project_root),
+                event="recall",
+                session_id=session_id,
+                entities=entity_ids,
+            )
+            log(f"Audit: recall session_id={session_id} entities={len(entity_ids)}")
+    except Exception as exc:
+        log(f"Audit append failed (non-fatal): {exc}")
+
 
 if __name__ == "__main__":
     main()
diff --git a/tests/e2e/test_sandbox_learn_recall.py b/tests/e2e/test_sandbox_learn_recall.py
@@ -164,3 +164,36 @@ def test_learn_then_recall_flow(sandbox_ready, sandbox_workspace):
     # pip-installed). Other libraries (PIL, piexif, exifread) may appear in a
     # valid guideline as "install via pip and use", so we don't ban them.
     assert not re.search(r"\bexiftool\b", joined), "session 2 invoked exiftool despite recall guideline:\n" + "\n".join(commands)
+
+    # --- Usage provenance: audit.log should record recall + influence ---
+    audit_log = sandbox_workspace / ".evolve" / "audit.log"
+    assert audit_log.is_file(), f"{audit_log} was not created — recall did not append audit events"
+
+    events = []
+    for line in audit_log.read_text().splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        events.append(json.loads(line))
+
+    session2_id = session2_transcript.stem.removeprefix("claude-transcript_")
+    # Recall audit records qualified ids — path relative to .evolve/entities/
+    # without the .md suffix — so we match session 1's entities the same way.
+    session1_ids = {str(p.relative_to(entities_dir).with_suffix("")) for p in entity_files}
+
+    recall_events = [e for e in events if e.get("event") == "recall" and e.get("session_id") == session2_id]
+    assert recall_events, f"no recall audit event for session 2 ({session2_id}). all events: {events}"
+    recalled_ids = {eid for e in recall_events for eid in e.get("entities", [])}
+    assert recalled_ids & session1_ids, f"recall event entities {recalled_ids} did not include any id from session 1 ({session1_ids})"
+    log.info(f"session 2: audit recorded recall of {recalled_ids}")
+
+    influence_events = [e for e in events if e.get("event") == "influence" and e.get("session_id") == session2_id]
+    assert influence_events, (
+        f"no influence audit event for session 2 ({session2_id}). recall events exist but learn did not emit assessments."
+    )
+    for ie in influence_events:
+        assert ie.get("verdict") in {"followed", "contradicted", "not_applicable"}, f"influence event has invalid verdict: {ie}"
+    log.info(
+        f"session 2: audit recorded {len(influence_events)} influence assessment(s): "
+        f"{[(e['entity'], e['verdict']) for e in influence_events]}"
+    )
diff --git a/tests/platform_integrations/test_log_influence.py b/tests/platform_integrations/test_log_influence.py