refactor(ci): let tracer-validation gate reuse benchmark artifact by carlos-alm · Pull Request #1171 · optave/ops-codegraph-tool

carlos-alm · 2026-05-19T23:23:47Z

Summary

The pre-publish-benchmark job runs the per-language tracer once in Run resolution benchmark (telemetry into resolution-result.json) and a second time in Run tracer validation (same-file recall). This PR deduplicates the second run by reusing the artifact.
Spun off from follow-up: deduplicate resolution benchmark work between script and gate test #1052 / refactor(ci): let resolution gate reuse benchmark artifact #1167 as a follow-up — couldn't fold into the parent PR because runDynamicTracer and runTracer had different toolchain-missing semantics (empty array vs null), so the artifact needed a status field to round-trip the "skipped" signal.

Changes

scripts/resolution-benchmark.ts — Extend the per-language LangResult with a new required tracer: { status: 'ok' | 'skipped', edges } block. runDynamicTracer now mirrors tracer-validation.test.ts's skip semantics (error && edges.length === 0 → skipped). Existing dynamicEdges / dynamicConfirmed count fields are preserved for the report scripts.
tests/benchmarks/resolution/tracer/tracer-validation.test.ts — Add loadArtifact / tracerFromArtifact consuming RESOLUTION_RESULT_JSON when set. In artifact mode the suite drives the language list from artifact keys (so a language the script reported can never be silently skipped). When the env var is unset, fall back to spawning run-tracer.mjs so local npx vitest run … still works standalone.
.github/workflows/publish.yml — Wire RESOLUTION_RESULT_JSON into the Run tracer validation step, mirroring the pattern already in place for the resolution gate.

Stacked on

This branch is stacked on #1167 (refactor/1052-dedupe-resolution-bench) — base will need to retarget main after that lands.

Test plan

Lint clean (npm run lint)
Fallback mode: npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts -t "javascript" passes (spawns run-tracer.mjs like before)
Artifact mode (status=ok): synthetic resolution-result.json with tracer.status="ok" exercises the recall computation and threshold check
Artifact mode (status=skipped): tracer.status="skipped" → test reports status: skipped row, no failure
Artifact mode (missing tracer block): clear regenerate with the current scripts/resolution-benchmark.ts error
CI: Run tracer validation step picks up RESOLUTION_RESULT_JSON from the workspace and reuses the artifact instead of re-spawning the tracer subprocess

Closes #1166

The pre-publish-benchmark job's `Run resolution benchmark` step builds codegraphs for ~34 language fixtures, computes precision/recall, and writes resolution-result.json. The `Gate on resolution thresholds` step that follows then ran the same vitest suite which independently copied every fixture and rebuilt the graphs again — doubling the most expensive slice of the publish pipeline. Extend the script's per-language LangResult to include falsePositiveEdges and falseNegativeEdges so the gate test has everything it needs for the existing precision/recall threshold assertions and failure messages. Refactor the gate test to consume that artifact when RESOLUTION_RESULT_JSON is set, falling back to the build-from-fixtures path when unset so devs can still run `npx vitest run tests/benchmarks/resolution/...` standalone. Wire the env var through the workflow's Gate step. Verified locally: gate test in artifact mode passes 170/170 in ~0.5s against an artifact produced by scripts/resolution-benchmark.ts, and the legacy build-from-fixtures path still passes for the javascript fixture. Closes #1052

…acts (#1167) - scripts/resolution-benchmark.ts: stop rounding precision/recall to 3 decimals before writing the artifact. The rounding let a near-miss like 0.8497 round up to 0.850 and silently clear a 0.85 threshold in CI artifact mode while failing in fixture mode. - tests/benchmarks/resolution/resolution-benchmark.test.ts: validate numeric fields in metricsFromArtifact so a stale or malformed artifact surfaces a clear 'regenerate' error instead of a confusing TypeError at the threshold assertions. - tests/benchmarks/resolution/resolution-benchmark.test.ts: reject an empty artifact in loadArtifact. Without this guard, an empty {} would register zero describe blocks and vitest would exit 0 with '0 tests', silently passing the gate.

The pre-publish-benchmark job's `Run resolution benchmark` step already runs the per-language tracer once for every fixture (the script writes dynamic-edge counts into resolution-result.json for telemetry). The `Run tracer validation` step that follows then re-runs the same tracer subprocess once more per language to compute same-file recall — doubling tracer cost on the publish path. Spun off from #1052 / #1167, which deduped the resolution gate the same way. The tracer side wasn't trivial to fold into that PR because runDynamicTracer and runTracer had different toolchain-missing semantics (empty array vs null), so the artifact needed a status field to round-trip the "skipped" signal. - Extend the script's TracerArtifact to `{ status: 'ok' | 'skipped', edges }`, attached to every LangResult as the new `tracer` block. `skipped` is set when run-tracer.mjs reports an error and produced no edges — mirroring tracer-validation.test.ts's skip semantics so it round-trips through the artifact. - Refactor tracer-validation.test.ts to consume the artifact when RESOLUTION_RESULT_JSON is set (driving the language list from the artifact keys so we never silently drop a language), falling back to spawning run-tracer.mjs when unset so local `npx vitest run …` runs still work standalone. - Wire RESOLUTION_RESULT_JSON into the workflow's tracer validation step, same pattern as the resolution gate. Verified locally: artifact mode correctly distinguishes status=ok (runs recall computation, hits threshold check) from status=skipped (graceful skip), and the missing-tracer-block error path surfaces a clear "regenerate" message. Fallback mode still drives the suite from the filesystem when the env var is unset. Closes #1166

claude · 2026-05-19T23:24:00Z

Claude finished @carlos-alm's task —— View job

PR Review: Tracer Validation Deduplication

Review Progress

Read and analyze changed files
Phase 1: Root cause analysis
Phase 2: Configuration change detection
Phase 3: Test integrity review
Phase 4: Code quality assessment
Phase 5: Backlog compliance evaluation
Final assessment and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The pre-publish-benchmark job was running per-language tracers twice - once in "Run resolution benchmark" and again in "Run tracer validation". This PR eliminates the duplicate work by having the tracer validation reuse the benchmark artifact.

Why This Approach: The approach is well-justified. The duplication was identified as inefficiency in issue #1166. Rather than re-spawning tracer subprocesses, the validation now reads pre-captured edges from the resolution benchmark artifact, cutting CI time in half for this step.

Risk Assessment: Low risk. This is a pure optimization that preserves all existing semantics. The fallback mode ensures local development workflows remain unchanged.

Phase 2: Configuration Change Detection ✅

No high-risk changes detected:

No Biome/linting rules modified
No test timeouts changed or coverage thresholds lowered
No TypeScript strict flags disabled
No module resolution changes
Workflow change is purely additive (sets env var for artifact reuse)

Phase 3: Test Integrity Assessment ✅

The changes preserve test integrity well:

✅ Preserved semantics: The tracerFromArtifact function maintains the same null-return contract for skipped languages as the original runTracer function

✅ Robust fallback: Local development unaffected - when RESOLUTION_RESULT_JSON is unset, falls back to spawning run-tracer.mjs

✅ Artifact validation: Strong validation logic rejects malformed artifacts and provides clear error messages

✅ Improved safety: Artifact mode drives test discovery from artifact keys instead of filesystem, preventing silent skips when a language the script reported is missing

Phase 4: Code Quality Assessment ✅

Implementation Quality: Excellent

Clean type definitions with clear documentation
Proper error handling with descriptive messages
Maintains backward compatibility seamlessly
Good separation of concerns (artifact loading vs execution logic)

Specific Strengths:

Line 184: scripts/resolution-benchmark.ts:184 - Excellent error messaging for missing tracer blocks
Line 203: scripts/resolution-benchmark.ts:203 - Proper handling of edge case where error occurs but edges exist
Line 188: tests/.../tracer-validation.test.ts:188 - Preserves null-as-skipped contract consistently
Line 249: Test suite drives from artifact keys in CI mode, preventing silent language omissions

No significant concerns found

Phase 5: Backlog Compliance ✅

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies added. Builds entirely on existing infrastructure.
Foundation-aligned: ✓ — Fully aligned with all core principles. Improves P1 (always-current) by making CI faster. Maintains P4 (zero-cost core) by eliminating duplicate work. No cloud APIs or external services.
Problem-fit: 3/5 — This is a CI optimization that doesn't directly address the core "AI agents waste tokens" problem. However, it's valuable infrastructure work that makes the benchmark pipeline more efficient.
Breaking: No — This is purely additive. No CLI output changes, API changes, or schema changes. Fallback mode preserves all existing behavior.
Tier: 1 — Zero-dep ✓ + Foundation-aligned ✓

Critical Concerns

None identified. This is a well-executed infrastructure optimization with thorough error handling and backward compatibility.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Pre-publish CI was running per-language tracers twice (benchmark + validation), causing inefficiency
**Why This Approach**: Clean artifact reuse with robust fallback, eliminates duplicate tracer spawning while preserving all semantics
**Risk Assessment**: Very low risk - pure optimization with comprehensive backward compatibility

## Backlog Compliance  
- Zero-dep: ✓ — No new runtime dependencies
- Foundation-aligned: ✓ — Improves P1 (always-current) by optimizing CI pipeline
- Problem-fit: 3 — Infrastructure optimization, doesn't directly solve core AI agent problem but valuable
- Breaking: No — Purely additive with complete fallback preservation
- Tier: 1

## Critical Concerns
None

## Final Recommendation
- Rating: ⭐⭐⭐⭐ (4/5)
- Action: APPROVE
- Reasoning: Excellent technical implementation of a valuable CI optimization. Clean code, thorough error handling, preserved semantics, zero breaking changes. The artifact reuse pattern could be a model for other CI optimizations.

greptile-apps · 2026-05-19T23:27:10Z

Greptile Summary

This PR eliminates a redundant per-language tracer subprocess invocation in CI by storing the tracer edges captured during Run resolution benchmark into resolution-result.json and having the Run tracer validation step read them back via RESOLUTION_RESULT_JSON, mirroring the pattern already in place for the resolution gate.

scripts/resolution-benchmark.ts: LangResult gains a required tracer: { status, edges } field; runDynamicTracer returns TracerArtifact instead of a raw edge array, with status: 'skipped' when the toolchain is missing and status: 'ok' otherwise.
tests/benchmarks/resolution/tracer/tracer-validation.test.ts: New loadArtifact / tracerFromArtifact helpers consume the artifact in CI; the language list is driven from artifact keys (preventing silent skips); local runs fall back to spawning run-tracer.mjs unchanged.
.github/workflows/publish.yml: RESOLUTION_RESULT_JSON env var added to the tracer-validation step.

Confidence Score: 5/5

Safe to merge; the deduplication logic is well-guarded and the local fallback is preserved.

All three changed files have narrow, well-scoped changes. The artifact-read path has robust error handling for missing files, malformed JSON, and empty results. The existing workflow step ordering guarantees resolution-result.json is written before it is consumed. The local fallback via runTracer is untouched.

No files require special attention beyond the note about artifact-mode language discovery in tracer-validation.test.ts.

Important Files Changed

Filename	Overview
scripts/resolution-benchmark.ts	Adds `TracerArtifact` type and extends `LangResult` with a required `tracer` field; changes `runDynamicTracer` return type from `DynamicEdge[]` to `TracerArtifact`.
tests/benchmarks/resolution/tracer/tracer-validation.test.ts	Adds `loadArtifact` / `tracerFromArtifact` helpers and switches the language list to artifact keys in CI mode; JSON parse errors are now wrapped with a helpful message.
.github/workflows/publish.yml	Wires `RESOLUTION_RESULT_JSON` into the tracer-validation step using the workspace-relative path already used by the resolution gate.

Sequence Diagram

sequenceDiagram
    participant RB as Run resolution benchmark
    participant FS as resolution-result.json
    participant RG as Gate on resolution thresholds
    participant TV as Run tracer validation

    RB->>RB: runDynamicTracer(lang)
    RB->>FS: write JSON with tracer block per lang
    RG->>FS: read via RESOLUTION_RESULT_JSON
    RG->>RG: validate thresholds
    TV->>FS: read via RESOLUTION_RESULT_JSON
    TV->>TV: loadArtifact() returns artifact map
    TV->>TV: tracerFromArtifact per lang
    TV->>TV: "computeSameFileRecall assert >= threshold"
    note over RB,TV: Before this PR TV spawned run-tracer.mjs per language. After TV reuses edges from RB.

_{Reviews (4): Last reviewed commit: "fix(test): wrap JSON.parse with helpful ..." | Re-trigger Greptile}

greptile-apps · 2026-05-19T23:27:17Z

+function loadArtifact(artifactPath: string): Record<string, ArtifactLangResult> {
+  if (!fs.existsSync(artifactPath)) {
+    throw new Error(
+      `RESOLUTION_RESULT_JSON=${artifactPath} not found — run scripts/resolution-benchmark.ts first.`,
+    );
+  }
+  const parsed = JSON.parse(fs.readFileSync(artifactPath, 'utf-8')) as Record<
+    string,
+    ArtifactLangResult
+  >;
+  // Refuse to proceed on an empty artifact: with zero languages, vitest would
+  // register no test cases and exit 0, silently passing the gate.
+  if (!parsed || typeof parsed !== 'object' || Object.keys(parsed).length === 0) {
+    throw new Error(
+      `RESOLUTION_RESULT_JSON=${artifactPath} contains no language results — regenerate with scripts/resolution-benchmark.ts.`,
+    );
+  }
+  return parsed;
+}


JSON parse error not wrapped with a helpful message

If resolution-result.json exists but contains malformed JSON, JSON.parse will throw a raw SyntaxError with no reference to RESOLUTION_RESULT_JSON or what to do next. The existing helpful error messages ("not found", "contains no language results") cover the empty-file and missing-file cases, but a corrupt-file scenario produces a confusing traceback. Wrapping JSON.parse in a try/catch and re-throwing with the same "regenerate with scripts/resolution-benchmark.ts" context message would keep the DX consistent.

Fixed — wrapped JSON.parse in try/catch and re-throw with the 'regenerate with scripts/resolution-benchmark.ts' guidance, matching the helpful-error pattern used for the missing-file and empty-file cases. Verified locally that a malformed resolution-result.json now produces a clear error message instead of a raw SyntaxError traceback.

github-actions · 2026-05-19T23:30:24Z

Codegraph Impact Analysis

6 functions changed → 4 callers affected across 2 files

TracerArtifact.status in scripts/resolution-benchmark.ts:62 (0 transitive callers)
TracerArtifact.edges in scripts/resolution-benchmark.ts:63 (0 transitive callers)
LangResult.tracer in scripts/resolution-benchmark.ts:82 (0 transitive callers)
computeMetrics in scripts/resolution-benchmark.ts:114 (3 transitive callers)
runDynamicTracer in scripts/resolution-benchmark.ts:187 (1 transitive callers)
dynamicConfirmed in scripts/resolution-benchmark.ts:314 (0 transitive callers)

…tifact loader (#1171) When RESOLUTION_RESULT_JSON points to a file with malformed JSON, the artifact loader previously threw a raw SyntaxError with no context. Mirror the existing 'regenerate with scripts/resolution-benchmark.ts' guidance so the DX is consistent with the missing-file and empty-file cases.

carlos-alm · 2026-05-20T07:26:42Z

@greptileai

carlos-alm added 3 commits May 19, 2026 01:54

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Base automatically changed from refactor/1052-dedupe-resolution-bench to main May 20, 2026 00:29

carlos-alm and others added 3 commits May 19, 2026 22:51

fix: resolve merge conflicts with main

8641195

Merge branch 'main' into refactor/1166-dedupe-tracer-validation

66a059c

carlos-alm merged commit 4d8df7b into main May 20, 2026
25 checks passed

carlos-alm deleted the refactor/1166-dedupe-tracer-validation branch May 20, 2026 08:09

github-actions Bot locked and limited conversation to collaborators May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(ci): let tracer-validation gate reuse benchmark artifact#1171

refactor(ci): let tracer-validation gate reuse benchmark artifact#1171
carlos-alm merged 6 commits into
mainfrom
refactor/1166-dedupe-tracer-validation

carlos-alm commented May 19, 2026

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

carlos-alm May 20, 2026

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

carlos-alm commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

carlos-alm commented May 19, 2026

Summary

Changes

Stacked on

Test plan

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Tracer Validation Deduplication

Review Progress

Phase 1: Root Cause Analysis ✅

Phase 2: Configuration Change Detection ✅

Phase 3: Test Integrity Assessment ✅

Phase 4: Code Quality Assessment ✅

Phase 5: Backlog Compliance ✅

Backlog Compliance

Critical Concerns

Final Assessment

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 19, 2026 •

edited

Loading

greptile-apps Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading