Skip to content

fix(bench): update elixir/julia/objc expected-edges to module-qualified names#1496

Merged
carlos-alm merged 26 commits into
mainfrom
fix/v3120-regressions-1446-1447
Jun 14, 2026
Merged

fix(bench): update elixir/julia/objc expected-edges to module-qualified names#1496
carlos-alm merged 26 commits into
mainfrom
fix/v3120-regressions-1446-1447

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

Changes:

  • Updated all three expected-edges.json to use actual extractor output names
  • Ratcheted THRESHOLDS: elixir 0%→100% precision / 80% recall; julia 0%→100% precision / 70% recall; objc 0%→100% precision / 40% recall
  • Resolution benchmark: 176/176 tests pass, full suite 3044/3074 pass

Closes #1447

… diff

Adds snapshot-pre-bash.sh (PreToolUse Bash) + track-bash-writes.sh
(PostToolUse Bash): the pre-hook captures git status --porcelain to a
per-worktree temp file before each Bash call; the post-hook diffs the
before/after state and appends newly modified or created files to
.claude/session-edits.log.

This closes the gap where files written by sed -i, printf redirects,
tee, heredocs, or build tools (Cargo.lock, lockfiles) were never
recorded, causing guard-git.sh to emit false-positive BLOCKED errors.

Closes #1457
- clojure.rs: annotate lifetime-anchor assignment to silence false-positive
- cfg.rs: remove never-called start_line_of method
- complexity.rs: remove never-constructed NotHandled variant; convert
  irrefutable if-let patterns to plain let destructures
- dataflow.rs: remove never-read callee fields from CallReturn/Destructured
- incremental.rs: remove never-read lang field from CacheEntry

cargo check and cargo clippy both clean after these changes.
Adds .github/workflows/perf-canary.yml — a path-filtered workflow that
fires on PRs touching src/extractors/, src/domain/graph/, or crates/**
and runs only the incremental-benchmark suite (full build + no-op +
1-file rebuild, both engines). Catches the class of regressions that
accumulated invisibly across the Phase 8.x PRs and were only detected
at v3.12.0 publish time.

The regression guard gains BENCH_CANARY=1 mode: raises thresholds to
50%/100%/150% (standard/noisy/WASM) and skips the build, query, and
resolution suites — only incremental checks run. This absorbs shared-
runner timing variance while still blocking catastrophic regressions
(+98% full build, +1827% 1-file rebuild from v3.12.0).

Closes #1433
On incremental builds, runPostNativeCha previously scanned all
call→qualified-method edges in the DB (~12ms flat, O(graph size)),
even for 1-file changes where no hierarchy or RTA evidence changed.

Add two cheap indexed gate queries. Gate A checks whether any changed
file introduced a class/interface/trait/struct/record node (hierarchy
may have new implementors reachable from unchanged call sites). Gate B
checks whether any changed file added a call edge to a class-kind target
(RTA set may have grown, enabling previously filtered expansions in
unchanged callers). If neither gate fires, restrict the candidate query
to src.file IN changedFiles — safe because the hierarchy and instantiated
set are unchanged for all other files.

Full builds (isFullBuild=true) and cases where either gate fires retain
the existing full-scan behaviour. Mirrors the changed-files scoping
pattern of runPostNativeThisDispatch.

Closes #1441
Times each JS post-pass in tryNativeOrchestrator and exposes the
measurements in BuildResult.phases:

- gapDetectMs  — dropped-language gap detection + backfill
- chaMs        — CHA expansion (interface dispatch)
- thisDispatchMs — this/super dispatch WASM re-parse (was already
                   tracked but now properly named alongside the rest)
- reclassifyMs — scoped role re-classification after edge insertion
- techniqueBackfillMs — technique-column UPDATE on native-written edges

Previously only thisDispatchMs was reported, causing wall-clock vs
phaseSum to diverge by 1.1s+ on 1-file rebuilds and making benchmark
regressions undiagnosable from committed history.

Updates update-incremental-report.ts to render the new phases in a
collapsible details block under each engine's 1-file rebuild section.

Closes #1434
…ld for required-tier grammars

The docstring claimed pool cost was "amortised over enough parse work" —
measurements show IPC overhead scales linearly (~55–64ms/file pool vs
~8–10ms/file inline). The real motivation is crash safety for exotic WASM
grammars (#965); JS/TS/TSX (required-tier, used in all this-dispatch
backfill calls) have never triggered the V8 fatal crash class and are safe
to run inline.

Raise threshold 16 → 32 to keep typical this-dispatch batches (≤ 18 files
on the codegraph corpus) on the inline fast path. Exotic-language drops are
almost always well under 32 files and also benefit from the inline path
without meaningful crash risk increase.

Closes #1435
…e incremental rebuilds

On 1-file native incremental builds, two JS post-passes ran unconditionally
even when they had no work to do:

- `backfillNativeDroppedFiles`: called whenever changedCount > 0, even when
  detectDroppedLanguageGap returned an empty gap. Gate now checks
  gap.missingAbs.length > 0 || gap.staleRel.length > 0 directly, matching
  backfillNativeDroppedFiles's own internal early-exit guard.

- Node/edge COUNT(*) re-count: ran unconditionally after all post-passes even
  when none of them wrote any edges. COUNT(*) over 50K+ edge tables is
  non-trivial, especially via the NativeDbProxy napi-rs round-trip. Now gated
  on postPassWroteData (backfill | CHA edges | this-dispatch edges).

Closes #1454
The post-pass it timed (runPostNativePrototypeMethods) was deleted in
b5c03a2 when func-prop extraction moved to Rust (#1432). The optional
field was never set by any code path that survived the deletion.

Also remove the stale reference to "prototype-methods post-pass" from
the parseFilesWasmForBackfill docstring — only the this-dispatch
post-pass uses symbolsOnly now.

Closes #1432
… collision

Field type annotations (`private repo: OrderRepository`) were seeded as bare
file-wide typeMap keys, causing `this.repo` inside `UserService` to resolve to
`OrderRepository` when both classes had a `repo` field (issue #1458).

Both extractors (TS `handleFieldDefTypeMap` and Rust `field_definition` branch)
now seed `ClassName.field` keys at confidence 0.9, matching the `CallerClass.X`
resolver fallback added in PR #1382. Bare keys are kept at confidence 0.6 as
fallbacks for single-class files or class expressions where no enclosing class
name is available.

Both engines change identically — parity preserved.
…ed names

The resolution benchmark uses WASM-built graphs where the Elixir, Julia,
and Objective-C extractors emit module-qualified symbol names (Main.run,
App.main, UserService.create_user, etc.). The expected-edges manifests
were written with bare unqualified names (run, main, create_user), so
every correctly-resolved edge appeared as a false positive and every
expected edge appeared as a false negative — causing all three languages
to show 0% precision even though resolution was working correctly.

Root cause: starting in v3.12.0, cross-module call resolution began working
for these languages (via the improved receiver-dispatch and same-class
fallback in resolveByMethodOrGlobal / build-edges.ts). With 0 edges
previously resolved, the name mismatch was invisible; once edges started
resolving, the manifests showed 17 FP (elixir), 11 FP (julia), 6 FP
(objc) — all correctly resolved edges misidentified as false positives.

Fix:
- Update all three expected-edges.json manifests to use the
  module-qualified names matching actual extractor output:
  elixir: Main.run, UserService.create_user, Validators.validate_user, etc.
  julia:  App.main, Service.create_user, Repository.new_repo, etc.
  objc:   full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
          plus add main -> run (plain C call correctly resolved)
- Ratchet THRESHOLDS for all three:
  elixir: precision 0.0 -> 1.0, recall 0.0 -> 0.8  (17/21 resolved)
  julia:  precision 0.0 -> 1.0, recall 0.0 -> 0.7  (11/15 resolved)
  objc:   precision 0.0 -> 1.0, recall 0.0 -> 0.4   (6/13 resolved)

Remaining FNs are genuine unresolved edges (same-file bare calls in
elixir/julia, receiver-typed message sends in objc) — not regressions.

Closes #1447
@claude

claude Bot commented Jun 13, 2026

Copy link
Copy Markdown

Claude finished @carlos-alm's task —— View job


PR Review — fix(bench): update elixir/julia/objc expected-edges to module-qualified names

  • Read changed files
  • Phase 1: Root cause analysis
  • Phase 2: Configuration change detection
  • Phase 3: Test integrity check
  • Phase 4: Code quality
  • Phase 5: Backlog compliance
  • Write final assessment

Root Cause Analysis

Problem Being Solved: The expected-edges manifests for elixir, julia, and objc contained bare unqualified function names (run, main, create_user), while the tree-sitter extractors have always emitted module-qualified names (Main.run, App.main, UserService.create_user, full ObjC selectors). Prior to v3.12.0, zero edges resolved for these languages, so the name mismatch was invisible (0 FP, 0 TP — trivially passing at 0.0/0.0). Once v3.12.0's improved cross-module resolution started successfully resolving edges, every correct edge appeared as a false positive and every expected edge as a false negative. This drove apparent precision to 0% while the extractor was working correctly.

Why This Approach: The fix is surgical and correct: update the manifests to match what the extractors actually emit, then ratchet the thresholds to reflect real performance. This is the right direction — the manifests are authoritative ground truth, so they must match actual extractor behavior. The alternative (changing the extractor to emit bare names) would make Elixir/Julia/ObjC inconsistent with every other module-aware language the tool supports.

Risk Assessment: Low. This is a test-data correction, not a production code change. The extractors are unchanged. The thresholds are ratcheted up (from 0.0 to 1.0 precision), which is strictly more restrictive. The PR cannot introduce runtime regressions.


Phase 2: Configuration Change Detection

No Biome rules, TypeScript flags, or build configs touched. The only resolution-benchmark.test.ts change is threshold values for three languages — all moving from { precision: 0.0, recall: 0.0 } to { precision: 1.0, recall: 0.8/0.7/0.4 }. These are ratchets upward, not relaxations. ✓


Phase 3: Test Integrity

No assertions weakened. The change makes tests more strict:

  • elixir: 0% precision floor → 100% precision floor (any future FP will fail CI immediately)
  • julia: same — 0% → 100%
  • objc: same — 0% → 100%

One subtlety worth verifying: the PR description says "81% recall" for elixir but sets the threshold to 0.8 (80%). That's a small but intentional buffer — defensible and consistent with how other languages are handled. No concern.

The notes field added to all manifest entries is genuinely useful: it documents why module-qualified names are used rather than leaving future maintainers confused. ✓


Phase 4: Code Quality

objc/expected-edges.json — one edge added (main → run, plain C call): this new edge is correctly grounded. main.m shows:

void run(void) { ... }
int main(int argc, const char *argv[]) {
    @autoreleasepool { run(); }
}

Both main and run are plain C functions (not ObjC methods), so they correctly use bare names — consistent with how the extractor would see them. The new main → run edge was missing from the original manifest; adding it is correct. ✓

ObjC selector names verified against Service.m and Validators.m:

  • UserService.initWithRepository: ✓ (trailing colon matches ObjC convention)
  • UserService.createUserWithId:name:email:
  • UserService.getUserWithId:
  • UserService.removeUserWithId:
  • Validators.isValidEmail:
  • Validators.isValidName:
  • UserRepository.saveWithId:name:, findById:, deleteWithId: ✓ (match [_repo ...] calls in Service.m)
  • Validators.validateUser:email: + its two same-class callees ✓

Elixir manifest verified against main.ex: defmodule Main do / def run do → extractor emits Main.run. All cross-module calls in the file match the updated manifest. ✓

Julia manifest follows the same module-qualification pattern consistently. ✓

Minor nit (resolution-benchmark.test.ts:174): The comment says "81% recall" but the threshold is set to 0.8:

// elixir: cross-module qualified calls resolve at 100% precision, 81% recall.
elixir: { precision: 1.0, recall: 0.8 },

The comment slightly overstates actual precision ("81%") while the threshold is set to 80%. This is intentional headroom, but the discrepancy in the comment could confuse future readers into thinking the manifest was designed to hit exactly 81% when CI only requires 80%. Not blocking, but worth a follow-up fix to say "≥80% recall" or "11/14 recall" to match the actual fraction. Same applies to julia ("73%" comment, 0.7 threshold) and objc ("46%" comment, 0.4 threshold).


Phase 5: Backlog Compliance

Axis Score Reason
Zero-dep No new dependencies
Foundation-aligned Pure test-data correction; no production code, no cloud calls, no schema changes
Problem-fit 3/5 Fixes CI accuracy metrics. Doesn't directly improve agent navigation or blast-radius detection, but ensures the benchmark correctly reflects real extraction quality — which matters for honest capability tracking
Breaking No Test-only change; no API, DB schema, CLI output, or MCP contract touched
Tier 1 Zero-dep + foundation-aligned

Critical Concerns

None. The changes are well-scoped, the root cause is correctly identified, and the fix is the right approach. The thresholds are ratcheted upward (more strict), not downward.

The only genuine question would be: Was the extractor actually correct before v3.12.0? The PR says v3.12.0 improved cross-module resolution, which revealed the manifest mismatch. If zero edges resolved before, the manifests at 0.0/0.0 were vacuously correct — this PR converts them to genuinely informative thresholds. That's a net improvement to test quality.


Final Recommendation

  • Rating: ⭐⭐⭐⭐ (4/5)
  • Action: APPROVE
  • Reasoning: Correct diagnosis, minimal scope, moves tests strictly more restrictive, well-documented. The only reason it's not 5 stars is the minor comment imprecision (81%/73%/46% stated vs 80%/70%/40% enforced) — not worth blocking for, but worth a follow-up tightening.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes the root cause of #1447 by updating three benchmark fixture manifests (elixir, julia, objc) to use module-qualified names that match actual extractor output, and ratchets their precision/recall thresholds from 0% to real measured values.

  • expected-edges.json (elixir, julia, objc): All edge names updated from bare identifiers to module-qualified form (e.g. runMain.run, create_userUserService.create_user); ObjC selectors now include full multi-part selector syntax with trailing colons; one previously missing main → run static edge added to the ObjC manifest.
  • resolution-benchmark.test.ts: Thresholds ratcheted to precision: 1.0 for all three languages with recall floors at 0.8 / 0.7 / 0.4, with inline comments documenting which edge modes are still unresolved FNs.
  • native-orchestrator.ts: Contains an unintended duplicate PostPassTimings interface declaration that appears to be a stray edit not described in the PR — this file should not have been modified by this fix.

Confidence Score: 4/5

Safe to merge after removing the duplicate PostPassTimings interface from native-orchestrator.ts; the fixture and threshold changes are correct and well-documented.

The benchmark fixture corrections are straightforward and well-reasoned. The only concern is that native-orchestrator.ts — a production source file — was unintentionally modified with a duplicate interface declaration that isn't part of this fix. While TypeScript's interface merging makes it a no-op at compile time, it's clearly an accidental stray change that should be reverted before merging.

src/domain/graph/builder/stages/native-orchestrator.ts — the duplicate PostPassTimings interface block (lines 962–968) should be removed; it is not part of this fix.

Important Files Changed

Filename Overview
src/domain/graph/builder/stages/native-orchestrator.ts Contains an unintended duplicate PostPassTimings interface declaration not mentioned in the PR description; functionally harmless due to TypeScript interface merging but should not be in this commit.
tests/benchmarks/resolution/fixtures/elixir/expected-edges.json All 21 edges updated from bare function names to module-qualified names (e.g. validate_userValidators.validate_user); notes updated to explain extractor behavior.
tests/benchmarks/resolution/fixtures/julia/expected-edges.json All 15 edges updated from bare function names to module-qualified names (e.g. mainApp.main, create_userService.create_user); notes updated accordingly.
tests/benchmarks/resolution/fixtures/objc/expected-edges.json Selectors updated to include full ObjC selector syntax (trailing colons, multi-part selectors); one new main → run static edge added that was missing from the prior manifest.
tests/benchmarks/resolution/resolution-benchmark.test.ts Thresholds for elixir, julia, and objc ratcheted from 0% to real values (1.0 precision; 0.8 / 0.7 / 0.4 recall respectively) with detailed inline comments explaining current FN gaps.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Resolution Benchmark Test] --> B{Load expected-edges.json}
    B --> C[Elixir fixture\nModule-qualified names\ne.g. Main.run, Validators.validate_user]
    B --> D[Julia fixture\nModule-qualified names\ne.g. App.main, Service.create_user]
    B --> E[ObjC fixture\nFull ObjC selectors\ne.g. createUserWithId:name:email:]
    C --> F[Compare vs extractor output]
    D --> F
    E --> F
    F --> G{Precision / Recall\ncheck vs THRESHOLDS}
    G -->|elixir: P=1.0, R>=0.8| H[Pass]
    G -->|julia: P=1.0, R>=0.7| H
    G -->|objc: P=1.0, R>=0.4| H
    G -->|fail| I[Test failure]
Loading

Reviews (7): Last reviewed commit: "fix: resolve merge conflicts with main" | Re-trigger Greptile

Comment on lines +189 to +192
// objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall.
// Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
// matching what the ObjC extractor emits. Receiver-typed instance message sends
// are not yet resolved — tracked as FNs. Precision 1.0 acts as ratchet against future FPs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The ObjC threshold comment says only "Receiver-typed instance message sends are not yet resolved", but the constructor mode edge (run → UserService.initWithRepository:) is also unresolved. With 13 total edges, 6 are resolved (4 static + 2 same-file), giving 6/13 = 46.2% recall. If the constructor were a true positive, recall would be 7/13 = 53.8% — inconsistent with the "46% recall" stated in the comment. The missing mention of the constructor FN could mislead future contributors auditing which modes are covered.

Suggested change
// objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall.
// Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
// matching what the ObjC extractor emits. Receiver-typed instance message sends
// are not yet resolved — tracked as FNs. Precision 1.0 acts as ratchet against future FPs.
// objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall.
// Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
// matching what the ObjC extractor emits. Constructor message sends (initWithRepository:) and
// receiver-typed instance message sends are not yet resolved — tracked as FNs.
// Precision 1.0 acts as ratchet against future FPs.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — the comment now lists both constructor message sends (initWithRepository:) and receiver-typed instance message sends as unresolved FNs, so future contributors see the complete picture of what is and isn't covered.

…nce threshold constant

The secondary docstring still described the old 16-value rationale
("engine-parity drop sizes"). Replace with a pointer to
INLINE_BACKFILL_THRESHOLD where the full rationale now lives.
…bjc threshold comment

The previous comment only listed receiver-typed instance message sends as
unresolved false negatives, omitting the constructor edge
(run → UserService.initWithRepository:). Both modes are FNs; the comment now
lists them together so future contributors see the complete picture.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

…r this. receivers

When handleFieldDefTypeMap seeds ClassName.field at 0.9 and bare field/
this.field at 0.6, the resolver was still finding the bare 0.6 entry first
(effectiveReceiver lookup) and skipping the class-scoped check entirely.
The result: cross-class collision from issue #1458 persisted at runtime even
though the extractor emitted the right keys.

Fix: for this. receivers, check the class-scoped key (ClassName.prop) first,
then fall back to bare keys. Both TS (call-resolver.ts) and Rust
(build_edges.rs) resolvers are updated identically.

Fixes #1458.
…-end integration test

- Rust: field_annotation_multi_class_seeds_separate_scoped_keys confirms that
  two classes with identically-named fields produce separate class-scoped typeMap
  keys at confidence 0.9 (mirrors the TS prevents-cross-class-collision test).
- Integration: issue-1458-cross-class-field-typemap.test.ts exercises the full
  buildGraph → resolver path (WASM engine) and asserts that OrderService.run
  resolves to OrderRepository.save and UserService.run to UserRepository.save,
  with no cross-class false edges.
The field_annotation_multi_class_seeds_separate_scoped_keys test used
parse_js() (tree-sitter-javascript), but the test fixture uses TypeScript
syntax (private field declarations with type annotations). The JS grammar
does not support type annotations, so the type_map came back empty and the
assertion failed.

Add parse_ts() helper using LANGUAGE_TYPESCRIPT and switch the test to it.
All 396 Rust unit tests pass.
@carlos-alm

Copy link
Copy Markdown
Contributor Author

Two follow-up commits since the last trigger: aa25c44 expanded toEqual object literals to multi-line format (Biome formatting), and 263c39b merged the upstream base branch. No logic change.

@carlos-alm

Copy link
Copy Markdown
Contributor Author

@greptileai

Base automatically changed from fix/field-typemap-class-scope-1458 to main June 14, 2026 07:27
@github-actions

Copy link
Copy Markdown
Contributor

Codegraph Impact Analysis

5 functions changed0 callers affected across 0 files

  • PostPassTimings.gapDetectMs in src/domain/graph/builder/stages/native-orchestrator.ts:963 (0 transitive callers)
  • PostPassTimings.chaMs in src/domain/graph/builder/stages/native-orchestrator.ts:964 (0 transitive callers)
  • PostPassTimings.thisDispatchMs in src/domain/graph/builder/stages/native-orchestrator.ts:965 (0 transitive callers)
  • PostPassTimings.reclassifyMs in src/domain/graph/builder/stages/native-orchestrator.ts:966 (0 transitive callers)
  • PostPassTimings.techniqueBackfillMs in src/domain/graph/builder/stages/native-orchestrator.ts:967 (0 transitive callers)

@carlos-alm carlos-alm merged commit b808751 into main Jun 14, 2026
23 checks passed
@carlos-alm carlos-alm deleted the fix/v3120-regressions-1446-1447 branch June 14, 2026 08:17
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: investigate elixir/julia/objc false positive regression in v3.12.0

1 participant