Skip to content

Commit 78c1378

Browse files
committed
Validate committed base analysis SHA
1 parent 880710d commit 78c1378

5 files changed

Lines changed: 117 additions & 14 deletions

File tree

action.yml

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -352,15 +352,24 @@ runs:
352352
working-directory: target-repo
353353
env:
354354
BASE_SHA: ${{ steps.guard.outputs.base_sha }}
355+
ACTION_PATH: ${{ github.action_path }}
355356
run: |
356357
BASE_DIR="${RUNNER_TEMP}/cb-base"
357358
HEAD_DIR="${RUNNER_TEMP}/cb-head"
358359
mkdir -p "$BASE_DIR" "$HEAD_DIR"
359360
echo "base_dir=$BASE_DIR" >> $GITHUB_OUTPUT
360361
echo "head_dir=$HEAD_DIR" >> $GITHUB_OUTPUT
361362
if git show "${BASE_SHA}:.codeboarding/analysis.json" > "${BASE_DIR}/analysis.json" 2>/dev/null; then
362-
echo "committed=true" >> $GITHUB_OUTPUT
363-
echo "Using committed .codeboarding/analysis.json at ${BASE_SHA}."
363+
if python3 "$ACTION_PATH/scripts/cb_engine.py" validate-base \
364+
--analysis "${BASE_DIR}/analysis.json" \
365+
--expected-sha "$BASE_SHA"; then
366+
echo "committed=true" >> $GITHUB_OUTPUT
367+
echo "Using committed .codeboarding/analysis.json at ${BASE_SHA}."
368+
else
369+
rm -f "${BASE_DIR}/analysis.json"
370+
echo "committed=false" >> $GITHUB_OUTPUT
371+
echo "Committed baseline at ${BASE_SHA} is stale; will generate a fresh base analysis."
372+
fi
364373
else
365374
rm -f "${BASE_DIR}/analysis.json"
366375
echo "committed=false" >> $GITHUB_OUTPUT
@@ -373,7 +382,7 @@ runs:
373382
uses: actions/cache/restore@v4
374383
with:
375384
path: ${{ runner.temp }}/cb-base
376-
key: cb-base-${{ runner.os }}-${{ steps.guard.outputs.base_sha }}-d${{ inputs.depth_level }}-${{ inputs.engine_ref }}-${{ inputs.llm_provider }}-${{ inputs.agent_model }}-${{ inputs.parsing_model }}
385+
key: cb-base-v2-${{ runner.os }}-${{ steps.guard.outputs.base_sha }}-d${{ inputs.depth_level }}-${{ inputs.engine_ref }}-${{ inputs.llm_provider }}-${{ inputs.agent_model }}-${{ inputs.parsing_model }}
377386

378387
# A committed analysis.json gives the head analysis stable component ids,
379388
# but the engine's incremental path ALSO needs the base static_analysis.pkl
@@ -478,7 +487,7 @@ runs:
478487
uses: actions/cache/save@v4
479488
with:
480489
path: ${{ runner.temp }}/cb-base
481-
key: cb-base-${{ runner.os }}-${{ steps.guard.outputs.base_sha }}-d${{ inputs.depth_level }}-${{ inputs.engine_ref }}-${{ inputs.llm_provider }}-${{ inputs.agent_model }}-${{ inputs.parsing_model }}
490+
key: cb-base-v2-${{ runner.os }}-${{ steps.guard.outputs.base_sha }}-d${{ inputs.depth_level }}-${{ inputs.engine_ref }}-${{ inputs.llm_provider }}-${{ inputs.agent_model }}-${{ inputs.parsing_model }}
482491

483492
- name: Analyze PR head (incremental from base)
484493
if: steps.guard.outputs.skip != 'true'

docs/COMMIT_STRATEGY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ The engine writes these under `.codeboarding/`:
4343
The warm-start — and the engine's incremental path itself — needs the pkl **and** its `.sha`: the cluster baseline that drives incremental lives only inside the pkl, so a committed `analysis.json` alone forces the head run into a full (LLM) fallback. The review action therefore guarantees the pair exists for the base SHA:
4444

4545
- **No committed baseline:** the generated base analysis writes the pkl as a side effect; the artifact dir is saved in `actions/cache` keyed by base SHA / depth / engine ref.
46+
- **Committed baseline:** the action first requires `analysis.json.metadata.commit_hash` to equal the PR base SHA. A stale committed diagram is treated like no baseline, so the base is regenerated at the PR base commit before diffing.
4647
- **Committed baseline, cache miss:** the action *seeds* the pkl deterministically (`cb_engine.py seed`: LSP indexing + the same clustering call a full run makes — **no LLM calls**), then saves it to the same cache. Seeding is fail-open: if it fails, the head run falls back to a full analysis.
4748

4849
Either way the head analysis is seeded from that directory and runs incrementally. This keeps the repo clean — the pkl never enters git — while the cache + seeding make incremental work from the first PR run.

scripts/cb_engine.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,40 @@ def _clear_dir(path: Path) -> None:
4242
child.unlink()
4343

4444

45+
def validate_base_analysis(analysis_path: Path, expected_sha: str) -> tuple[bool, str]:
46+
"""Return whether ``analysis.json`` was generated for ``expected_sha``.
47+
48+
The PR action can only reuse a committed baseline when the diagram's own
49+
source commit matches the PR base commit. Otherwise the diff would be
50+
computed from the PR base while mutating an older diagram snapshot.
51+
"""
52+
try:
53+
data = json.loads(analysis_path.read_text(encoding="utf-8"))
54+
except FileNotFoundError:
55+
return False, f"Baseline analysis is missing: {analysis_path}"
56+
except (OSError, json.JSONDecodeError) as exc:
57+
return False, f"Baseline analysis is unreadable: {exc}"
58+
59+
if not isinstance(data, dict):
60+
return False, "Baseline analysis root is not a JSON object."
61+
62+
metadata = data.get("metadata")
63+
if not isinstance(metadata, dict):
64+
return False, "Baseline analysis metadata is missing."
65+
66+
actual_sha = metadata.get("commit_hash")
67+
if not isinstance(actual_sha, str) or not actual_sha:
68+
return False, "Baseline analysis metadata.commit_hash is missing."
69+
70+
if actual_sha != expected_sha:
71+
return (
72+
False,
73+
f"Baseline analysis was generated for {actual_sha}, expected PR base {expected_sha}.",
74+
)
75+
76+
return True, f"Baseline analysis commit matches PR base {expected_sha}."
77+
78+
4579
def run_base(repo: str, out: str, name: str, run_id: str, depth: int, source_sha: str) -> None:
4680
from codeboarding_workflows.analysis import run_full
4781

@@ -199,6 +233,10 @@ def main(argv=None) -> int:
199233
for a in ("--artifact-dir", "--repo", "--name", "--issues-out"):
200234
hc.add_argument(a, required=True)
201235

236+
vb = sub.add_parser("validate-base")
237+
vb.add_argument("--analysis", required=True)
238+
vb.add_argument("--expected-sha", required=True)
239+
202240
args = p.parse_args(argv)
203241
if args.cmd == "base":
204242
run_base(args.repo, args.out, args.name, args.run_id, args.depth, args.source_sha)
@@ -208,6 +246,10 @@ def main(argv=None) -> int:
208246
run_head(args.repo, args.out, args.name, args.run_id, args.depth, args.base_ref, args.target_ref, args.source_sha)
209247
elif args.cmd == "health":
210248
Path(args.issues_out).write_text(str(run_health(args.artifact_dir, args.repo, args.name)))
249+
elif args.cmd == "validate-base":
250+
ok, message = validate_base_analysis(Path(args.analysis), args.expected_sha)
251+
print(message)
252+
return 0 if ok else 1
211253
return 0
212254

213255

scripts/run_local.sh

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -84,11 +84,14 @@ else
8484
[ -d "$ENGINE" ] || { echo "Engine not found at $ENGINE (set --engine or \$ENGINE)." >&2; exit 2; }
8585
[ -n "${OPENROUTER_API_KEY:-}" ] || { echo "Export OPENROUTER_API_KEY for the full pipeline." >&2; exit 2; }
8686
REPO="$(cd "$REPO" && pwd)"
87+
BASE_SHA="$(git -C "$REPO" rev-parse "$BASE_REF^{commit}")"
88+
HEAD_SHA="$(git -C "$REPO" rev-parse "$HEAD_REF^{commit}")"
8789
BASE_DIR="$OUT/base"; HEAD_DIR="$OUT/head"
8890
rm -rf "$BASE_DIR" "$HEAD_DIR"; mkdir -p "$BASE_DIR" "$HEAD_DIR"
8991

90-
echo "== Resolving base analysis at $BASE_REF =="
91-
if git -C "$REPO" show "$BASE_REF:.codeboarding/analysis.json" > "$BASE_DIR/analysis.json" 2>/dev/null; then
92+
echo "== Resolving base analysis at $BASE_SHA =="
93+
if git -C "$REPO" show "$BASE_SHA:.codeboarding/analysis.json" > "$BASE_DIR/analysis.json" 2>/dev/null \
94+
&& run_engine validate-base --analysis "$BASE_DIR/analysis.json" --expected-sha "$BASE_SHA"; then
9295
echo " using committed baseline"
9396
# Mirror action.yml: a committed analysis.json alone can't drive incremental —
9497
# the engine needs the base static_analysis.pkl with its cluster baseline.
@@ -97,8 +100,8 @@ else
97100
git -C "$REPO" worktree remove --force "$BASE_SRC" 2>/dev/null || true
98101
git -C "$REPO" worktree prune
99102
rm -rf "$BASE_SRC"
100-
git -C "$REPO" worktree add --detach "$BASE_SRC" "$BASE_REF" >/dev/null
101-
if run_engine seed --repo "$BASE_SRC" --out "$BASE_DIR" --source-sha "$BASE_REF"; then
103+
git -C "$REPO" worktree add --detach "$BASE_SRC" "$BASE_SHA" >/dev/null
104+
if run_engine seed --repo "$BASE_SRC" --out "$BASE_DIR" --source-sha "$BASE_SHA"; then
102105
echo " seeded static-analysis baseline (no LLM)"
103106
else
104107
rm -f "$BASE_DIR/static_analysis.pkl" "$BASE_DIR/static_analysis.sha"
@@ -112,29 +115,29 @@ else
112115
git -C "$REPO" worktree remove --force "$BASE_SRC" 2>/dev/null || true
113116
git -C "$REPO" worktree prune
114117
rm -rf "$BASE_SRC"
115-
git -C "$REPO" worktree add --detach "$BASE_SRC" "$BASE_REF" >/dev/null
118+
git -C "$REPO" worktree add --detach "$BASE_SRC" "$BASE_SHA" >/dev/null
116119
run_engine base \
117120
--repo "$BASE_SRC" \
118121
--out "$BASE_DIR" \
119122
--name "$(basename "$REPO")" \
120123
--run-id local-base \
121124
--depth "$DEPTH" \
122-
--source-sha "$BASE_REF"
125+
--source-sha "$BASE_SHA"
123126
git -C "$REPO" worktree remove --force "$BASE_SRC" >/dev/null 2>&1 || true
124127
[ -f "$BASE_DIR/analysis.json" ] || { echo "Base full analysis ran but analysis.json is missing." >&2; exit 1; }
125128
fi
126129

127-
echo "== Analyzing head at $HEAD_REF (incremental from base) =="
130+
echo "== Analyzing head at $HEAD_SHA (incremental from base) =="
128131
cp -a "$BASE_DIR"/. "$HEAD_DIR"/ 2>/dev/null || true
129132
run_engine head \
130133
--repo "$REPO" \
131134
--out "$HEAD_DIR" \
132135
--name "$(basename "$REPO")" \
133136
--run-id local-head \
134137
--depth "$DEPTH" \
135-
--base-ref "$BASE_REF" \
136-
--target-ref "$HEAD_REF" \
137-
--source-sha "$HEAD_REF"
138+
--base-ref "$BASE_SHA" \
139+
--target-ref "$HEAD_SHA" \
140+
--source-sha "$HEAD_SHA"
138141
[ -f "$HEAD_DIR/analysis.json" ] || { echo "Head analysis ran but analysis.json is missing." >&2; exit 1; }
139142
BASE_ANALYSIS="$BASE_DIR/analysis.json"
140143
HEAD_ANALYSIS="$HEAD_DIR/analysis.json"

tests/test_cb_engine.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""Smoke tests for scripts/cb_engine.py — verify it calls the engine API correctly,
22
using stub modules so no real engine venv is needed."""
33

4+
import json
45
import os
56
import sys
67
import tempfile
@@ -158,6 +159,53 @@ def test_head_falls_back_to_full_on_baseline_unavailable(self):
158159
self.assertEqual(len(rf.calls), 1) # BaselineUnavailableError also triggers the full re-run
159160

160161

162+
class TestValidateBase(_Base):
163+
def test_validate_base_accepts_matching_commit(self):
164+
with tempfile.TemporaryDirectory() as tmp:
165+
path = Path(tmp) / "analysis.json"
166+
path.write_text(json.dumps({"metadata": {"commit_hash": "abc123"}}), encoding="utf-8")
167+
168+
ok, message = cb_engine.validate_base_analysis(path, "abc123")
169+
170+
self.assertTrue(ok)
171+
self.assertIn("matches", message)
172+
173+
def test_validate_base_rejects_mismatched_commit(self):
174+
with tempfile.TemporaryDirectory() as tmp:
175+
path = Path(tmp) / "analysis.json"
176+
path.write_text(json.dumps({"metadata": {"commit_hash": "old"}}), encoding="utf-8")
177+
178+
ok, message = cb_engine.validate_base_analysis(path, "new")
179+
180+
self.assertFalse(ok)
181+
self.assertIn("old", message)
182+
self.assertIn("new", message)
183+
184+
def test_validate_base_rejects_missing_commit(self):
185+
with tempfile.TemporaryDirectory() as tmp:
186+
path = Path(tmp) / "analysis.json"
187+
path.write_text(json.dumps({"metadata": {}}), encoding="utf-8")
188+
189+
ok, message = cb_engine.validate_base_analysis(path, "abc123")
190+
191+
self.assertFalse(ok)
192+
self.assertIn("commit_hash", message)
193+
194+
def test_main_validate_base_exit_codes(self):
195+
with tempfile.TemporaryDirectory() as tmp:
196+
path = Path(tmp) / "analysis.json"
197+
path.write_text(json.dumps({"metadata": {"commit_hash": "abc123"}}), encoding="utf-8")
198+
199+
self.assertEqual(
200+
cb_engine.main(["validate-base", "--analysis", str(path), "--expected-sha", "abc123"]),
201+
0,
202+
)
203+
self.assertEqual(
204+
cb_engine.main(["validate-base", "--analysis", str(path), "--expected-sha", "def456"]),
205+
1,
206+
)
207+
208+
161209
class TestSeed(_Base):
162210
"""run_seed must analyze, cluster, then save — in that order, same results object.
163211

0 commit comments

Comments
 (0)