Skip to content

Commit 57a6df4

Browse files
authored
fix(parity): log per-file reasons for native orchestrator drops (#1011) (#1024)
* fix(parity): log per-file reasons for native orchestrator drops (#1011) The native orchestrator's WASM backfill warning was a single line — "dropped N file(s); backfilling via WASM" — which made it impossible to tell which files were dropped or why. With 49 drops on the 3.9.5 self- build, users could not distinguish legitimate parser limits (languages with no Rust extractor, e.g. .fs/.gleam/.clj/.jl/.R/.sol/.cu) from real extractor regressions in natively-supported languages. Add a `classifyNativeDrops` helper in `src/domain/parser.ts` that buckets each missing file by extension and reason: - `unsupported-by-native`: extension has no Rust extractor — legitimate parser limit per the issue's acceptance criteria. Logged at info level. - `native-extractor-failure`: extension IS in the Rust addon yet the file was still dropped — points at a real native bug. Logged at warn level with explicit "likely a Rust extractor bug" framing so it stays loud. `backfillNativeDroppedFiles` in `pipeline.ts` now uses the helper to emit per-extension counts and up to 3 sample paths per extension (capped at 6 extensions to keep the log line readable). The classifier set `NATIVE_SUPPORTED_EXTENSIONS` mirrors `LanguageKind::from_extension` in `crates/codegraph-core/src/parser_registry.rs` and carries a comment flagging the sync requirement — the native addon ships as a separate npm package, so JS has no runtime way to discover its language coverage. Adds unit tests covering the classification, sample paths, case-insensitive extensions, and empty input. Note: the Node 24 vitest config has a pre-existing incompatibility (`--strip-types is not allowed in NODE_OPTIONS`); test logic was verified locally via a standalone Node script that mirrors the unit tests, and the PR's CI on Node 22 will run the vitest suite normally. docs check acknowledged: internal logging refactor; no user-visible language support, command, or roadmap deliverable changes. * test(parity): cover formatDropExtensionSummary and guard NATIVE_SUPPORTED_EXTENSIONS drift (#1024) Address Greptile P2 review feedback on #1024: - Move `formatDropExtensionSummary` from `pipeline.ts` to `parser.ts` next to `classifyNativeDrops` and export it so the cap logic can be unit-tested. - Add tests for the formatter covering: empty buckets, under-cap rendering, per-extension sample cap with `+N more`, exact-cap boundary, extension cap with `+N more extension(s)` suffix, and descending-count ordering. - Add a drift guard test that parses `crates/codegraph-core/src/parser_registry.rs` and asserts the JS `NATIVE_SUPPORTED_EXTENSIONS` set agrees with the Rust `LanguageKind::from_extension` arms. The native addon doesn't expose its own metadata, so source-level cross-check at CI time is the cheapest way to catch drift before users see mis-classified log levels. Impact: 1 functions changed, 3 affected
1 parent d8921c2 commit 57a6df4

3 files changed

Lines changed: 350 additions & 4 deletions

File tree

src/domain/graph/builder/pipeline.ts

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,13 @@ import type {
3232
ExtractorOutput,
3333
SqliteStatement,
3434
} from '../../../types.js';
35-
import { getActiveEngine, getInstalledWasmExtensions, parseFilesAuto } from '../../parser.js';
35+
import {
36+
classifyNativeDrops,
37+
formatDropExtensionSummary,
38+
getActiveEngine,
39+
getInstalledWasmExtensions,
40+
parseFilesAuto,
41+
} from '../../parser.js';
3642
import { setWorkspaces } from '../resolve.js';
3743
import { PipelineContext } from './context.js';
3844
import { batchInsertNodes, collectFiles as collectFilesUtil, loadPathAliases } from './helpers.js';
@@ -761,18 +767,32 @@ async function backfillNativeDroppedFiles(ctx: PipelineContext): Promise<void> {
761767
// minimal installs) can't be parsed by either engine, so they're not a
762768
// native regression — excluding them keeps the warn count meaningful.
763769
const installedExts = getInstalledWasmExtensions();
770+
const missingRel: string[] = [];
764771
const missingAbs: string[] = [];
765772
for (const rel of expected) {
766773
if (existing.has(rel)) continue;
767774
const ext = path.extname(rel).toLowerCase();
768775
if (!installedExts.has(ext)) continue;
776+
missingRel.push(rel);
769777
missingAbs.push(path.join(ctx.rootDir, rel));
770778
}
771779
if (missingAbs.length === 0) return;
772780

773-
warn(
774-
`Native orchestrator dropped ${missingAbs.length} file(s); backfilling via WASM for engine parity`,
775-
);
781+
// Classify drops so users see per-extension reasons instead of just a count
782+
// (#1011). `unsupported-by-native` is a legitimate parser limit (no Rust
783+
// extractor); `native-extractor-failure` indicates a real native bug since
784+
// the language IS supported by the addon yet the file was dropped anyway.
785+
const { byReason, totals } = classifyNativeDrops(missingRel);
786+
if (totals['unsupported-by-native'] > 0) {
787+
info(
788+
`Native orchestrator skipped ${totals['unsupported-by-native']} file(s) in languages without a Rust extractor; backfilling via WASM: ${formatDropExtensionSummary(byReason['unsupported-by-native'])}`,
789+
);
790+
}
791+
if (totals['native-extractor-failure'] > 0) {
792+
warn(
793+
`Native orchestrator dropped ${totals['native-extractor-failure']} file(s) in natively-supported languages — likely a Rust extractor bug. Backfilling via WASM: ${formatDropExtensionSummary(byReason['native-extractor-failure'])}`,
794+
);
795+
}
776796
const wasmResults = await parseFilesAuto(missingAbs, ctx.rootDir, { engine: 'wasm' });
777797

778798
const rows: unknown[][] = [];

src/domain/parser.ts

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,128 @@ export function getInstalledWasmExtensions(): Set<string> {
412412
return exts;
413413
}
414414

415+
/**
416+
* Lowercase file extensions covered by the native Rust addon.
417+
*
418+
* Mirrors `LanguageKind::from_extension` in
419+
* `crates/codegraph-core/src/parser_registry.rs`. Used to classify why the
420+
* native orchestrator dropped a file: extensions outside this set are a
421+
* legitimate parser limit (no Rust extractor exists), while extensions inside
422+
* it indicate a real native bug (parse/read/extract failure).
423+
*
424+
* Keep this list in sync with the Rust enum — the native addon is a separate
425+
* npm package, so JS has no runtime way to discover its language coverage.
426+
*/
427+
export const NATIVE_SUPPORTED_EXTENSIONS: ReadonlySet<string> = new Set([
428+
'.js',
429+
'.jsx',
430+
'.mjs',
431+
'.cjs',
432+
'.ts',
433+
'.tsx',
434+
'.py',
435+
'.pyi',
436+
'.tf',
437+
'.hcl',
438+
'.go',
439+
'.rs',
440+
'.java',
441+
'.cs',
442+
'.rb',
443+
'.rake',
444+
'.gemspec',
445+
'.php',
446+
'.phtml',
447+
'.c',
448+
'.h',
449+
'.cpp',
450+
'.cc',
451+
'.cxx',
452+
'.hpp',
453+
'.kt',
454+
'.kts',
455+
'.swift',
456+
'.scala',
457+
'.sh',
458+
'.bash',
459+
'.ex',
460+
'.exs',
461+
'.lua',
462+
'.dart',
463+
'.zig',
464+
'.hs',
465+
'.ml',
466+
'.mli',
467+
]);
468+
469+
/**
470+
* Classification for a file the native orchestrator dropped.
471+
* - `unsupported-by-native`: extension has no Rust extractor (legitimate parser limit).
472+
* - `native-extractor-failure`: extension is supported by native but the file was
473+
* still dropped — points at a real bug (read error, parse failure, extractor crash).
474+
*/
475+
export type NativeDropReason = 'unsupported-by-native' | 'native-extractor-failure';
476+
477+
export interface NativeDropClassification {
478+
/** Per-reason → per-extension → list of relative paths that hit that bucket. */
479+
byReason: Record<NativeDropReason, Map<string, string[]>>;
480+
/** Total file count per reason. */
481+
totals: Record<NativeDropReason, number>;
482+
}
483+
484+
/**
485+
* Group the missing files (relative paths) by drop reason and extension so the
486+
* caller can log per-extension counts and a sample path. Pure function — no
487+
* I/O, safe to unit-test independently of the build pipeline.
488+
*/
489+
export function classifyNativeDrops(relPaths: Iterable<string>): NativeDropClassification {
490+
const byReason: Record<NativeDropReason, Map<string, string[]>> = {
491+
'unsupported-by-native': new Map(),
492+
'native-extractor-failure': new Map(),
493+
};
494+
const totals: Record<NativeDropReason, number> = {
495+
'unsupported-by-native': 0,
496+
'native-extractor-failure': 0,
497+
};
498+
for (const rel of relPaths) {
499+
const ext = path.extname(rel).toLowerCase();
500+
const reason: NativeDropReason = NATIVE_SUPPORTED_EXTENSIONS.has(ext)
501+
? 'native-extractor-failure'
502+
: 'unsupported-by-native';
503+
const bucket = byReason[reason];
504+
let list = bucket.get(ext);
505+
if (!list) {
506+
list = [];
507+
bucket.set(ext, list);
508+
}
509+
list.push(rel);
510+
totals[reason]++;
511+
}
512+
return { byReason, totals };
513+
}
514+
515+
/**
516+
* Render `{ ext → paths[] }` as `ext (n: sample.ext, ...)` slices for log lines.
517+
* Caps at 3 sample paths per extension and 6 extensions total to keep warnings
518+
* readable when many languages are dropped at once. Extensions are sorted by
519+
* descending file count so the loudest offender shows up first; ties keep
520+
* insertion order. Pure function — safe to unit-test independently.
521+
*/
522+
export function formatDropExtensionSummary(buckets: Map<string, string[]>): string {
523+
const MAX_EXTS = 6;
524+
const MAX_SAMPLES = 3;
525+
const entries = Array.from(buckets.entries()).sort((a, b) => b[1].length - a[1].length);
526+
const shown = entries.slice(0, MAX_EXTS).map(([ext, paths]) => {
527+
const sample = paths.slice(0, MAX_SAMPLES).join(', ');
528+
const more = paths.length > MAX_SAMPLES ? `, +${paths.length - MAX_SAMPLES} more` : '';
529+
return `${ext} (${paths.length}: ${sample}${more})`;
530+
});
531+
if (entries.length > MAX_EXTS) {
532+
shown.push(`+${entries.length - MAX_EXTS} more extension(s)`);
533+
}
534+
return shown.join('; ');
535+
}
536+
415537
// ── Unified API ──────────────────────────────────────────────────────────────
416538

417539
function resolveEngine(opts: ParseEngineOpts = {}): ResolvedEngine {
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
import fs from 'node:fs';
2+
import path from 'node:path';
3+
import { fileURLToPath } from 'node:url';
4+
import { describe, expect, it } from 'vitest';
5+
import {
6+
classifyNativeDrops,
7+
formatDropExtensionSummary,
8+
NATIVE_SUPPORTED_EXTENSIONS,
9+
} from '../../src/domain/parser.js';
10+
11+
const __filename = fileURLToPath(import.meta.url);
12+
const __dirname = path.dirname(__filename);
13+
const REPO_ROOT = path.resolve(__dirname, '..', '..');
14+
15+
describe('classifyNativeDrops', () => {
16+
it('groups WASM-only languages under unsupported-by-native', () => {
17+
const { byReason, totals } = classifyNativeDrops([
18+
'src/a.fs',
19+
'src/b.gleam',
20+
'src/c.clj',
21+
'src/d.jl',
22+
'src/e.R',
23+
'src/f.erl',
24+
'src/g.sol',
25+
'src/h.cu',
26+
'src/i.groovy',
27+
'src/j.v',
28+
'src/k.m',
29+
]);
30+
expect(totals['unsupported-by-native']).toBe(11);
31+
expect(totals['native-extractor-failure']).toBe(0);
32+
expect(byReason['unsupported-by-native'].get('.fs')).toEqual(['src/a.fs']);
33+
expect(byReason['unsupported-by-native'].get('.gleam')).toEqual(['src/b.gleam']);
34+
expect(byReason['unsupported-by-native'].get('.r')).toEqual(['src/e.R']);
35+
});
36+
37+
it('flags natively-supported extensions as native-extractor-failure', () => {
38+
const { byReason, totals } = classifyNativeDrops([
39+
'src/a.ts',
40+
'src/b.py',
41+
'src/c.go',
42+
'src/d.rs',
43+
]);
44+
expect(totals['native-extractor-failure']).toBe(4);
45+
expect(totals['unsupported-by-native']).toBe(0);
46+
expect(byReason['native-extractor-failure'].get('.ts')).toEqual(['src/a.ts']);
47+
expect(byReason['native-extractor-failure'].get('.py')).toEqual(['src/b.py']);
48+
});
49+
50+
it('handles a mix of supported and unsupported extensions', () => {
51+
const { byReason, totals } = classifyNativeDrops([
52+
'src/a.ts',
53+
'src/b.fs',
54+
'src/c.fs',
55+
'src/d.gleam',
56+
]);
57+
expect(totals['native-extractor-failure']).toBe(1);
58+
expect(totals['unsupported-by-native']).toBe(3);
59+
expect(byReason['unsupported-by-native'].get('.fs')).toEqual(['src/b.fs', 'src/c.fs']);
60+
expect(byReason['unsupported-by-native'].get('.gleam')).toEqual(['src/d.gleam']);
61+
});
62+
63+
it('lowercases extensions so .R and .r share a bucket', () => {
64+
const { byReason, totals } = classifyNativeDrops(['scripts/a.R', 'scripts/b.r']);
65+
expect(totals['unsupported-by-native']).toBe(2);
66+
expect(byReason['unsupported-by-native'].get('.r')).toEqual(['scripts/a.R', 'scripts/b.r']);
67+
});
68+
69+
it('returns empty buckets when no files are passed', () => {
70+
const { byReason, totals } = classifyNativeDrops([]);
71+
expect(totals['native-extractor-failure']).toBe(0);
72+
expect(totals['unsupported-by-native']).toBe(0);
73+
expect(byReason['native-extractor-failure'].size).toBe(0);
74+
expect(byReason['unsupported-by-native'].size).toBe(0);
75+
});
76+
77+
it('exposes the native-supported extension set for callers', () => {
78+
expect(NATIVE_SUPPORTED_EXTENSIONS.has('.ts')).toBe(true);
79+
expect(NATIVE_SUPPORTED_EXTENSIONS.has('.py')).toBe(true);
80+
expect(NATIVE_SUPPORTED_EXTENSIONS.has('.fs')).toBe(false);
81+
expect(NATIVE_SUPPORTED_EXTENSIONS.has('.gleam')).toBe(false);
82+
});
83+
});
84+
85+
describe('formatDropExtensionSummary', () => {
86+
it('returns an empty string when no buckets are present', () => {
87+
expect(formatDropExtensionSummary(new Map())).toBe('');
88+
});
89+
90+
it('lists every extension when under the cap', () => {
91+
const buckets = new Map<string, string[]>([
92+
['.ts', ['a.ts', 'b.ts']],
93+
['.py', ['c.py']],
94+
]);
95+
expect(formatDropExtensionSummary(buckets)).toBe('.ts (2: a.ts, b.ts); .py (1: c.py)');
96+
});
97+
98+
it('caps samples per extension at 3 and renders +N more', () => {
99+
const buckets = new Map<string, string[]>([['.ts', ['a.ts', 'b.ts', 'c.ts', 'd.ts', 'e.ts']]]);
100+
expect(formatDropExtensionSummary(buckets)).toBe('.ts (5: a.ts, b.ts, c.ts, +2 more)');
101+
});
102+
103+
it('shows exactly MAX_SAMPLES samples without a +N suffix when count equals the cap', () => {
104+
const buckets = new Map<string, string[]>([['.ts', ['a.ts', 'b.ts', 'c.ts']]]);
105+
expect(formatDropExtensionSummary(buckets)).toBe('.ts (3: a.ts, b.ts, c.ts)');
106+
});
107+
108+
it('caps extensions at 6 and renders +N more extension(s)', () => {
109+
// 8 extensions, all with 1 file — sorted by count is a stable tie so insertion
110+
// order wins, and the first 6 are shown.
111+
const buckets = new Map<string, string[]>([
112+
['.a', ['1.a']],
113+
['.b', ['1.b']],
114+
['.c', ['1.c']],
115+
['.d', ['1.d']],
116+
['.e', ['1.e']],
117+
['.f', ['1.f']],
118+
['.g', ['1.g']],
119+
['.h', ['1.h']],
120+
]);
121+
const out = formatDropExtensionSummary(buckets);
122+
expect(out.endsWith('; +2 more extension(s)')).toBe(true);
123+
// First 6 extensions are present, the last 2 (.g, .h) are not.
124+
expect(out).toContain('.a (1: 1.a)');
125+
expect(out).toContain('.f (1: 1.f)');
126+
expect(out).not.toContain('.g (');
127+
expect(out).not.toContain('.h (');
128+
});
129+
130+
it('sorts by descending file count so the loudest offender is first', () => {
131+
const buckets = new Map<string, string[]>([
132+
['.small', ['x']],
133+
['.huge', ['a', 'b', 'c', 'd']],
134+
['.medium', ['m', 'n']],
135+
]);
136+
const out = formatDropExtensionSummary(buckets);
137+
const positions = ['.huge', '.medium', '.small'].map((ext) => out.indexOf(ext));
138+
expect(positions[0]).toBeLessThan(positions[1]);
139+
expect(positions[1]).toBeLessThan(positions[2]);
140+
});
141+
});
142+
143+
/**
144+
* Drift guard for `NATIVE_SUPPORTED_EXTENSIONS`.
145+
*
146+
* Greptile flagged that this set is keyed to one snapshot of
147+
* `LanguageKind::from_extension` in the Rust addon, and silent drift between
148+
* the JS and Rust sides would mis-classify drops (real native failures shown
149+
* as info, parser-limit gaps shown as warn). The native addon doesn't expose
150+
* its own metadata, so we parse the Rust source instead and assert the two
151+
* lists agree at build time. If `parser_registry.rs` is ever refactored, this
152+
* test fails loudly so the maintainer notices.
153+
*/
154+
describe('NATIVE_SUPPORTED_EXTENSIONS drift guard', () => {
155+
it('matches the extension set in crates/codegraph-core/src/parser_registry.rs', () => {
156+
const registryPath = path.join(
157+
REPO_ROOT,
158+
'crates',
159+
'codegraph-core',
160+
'src',
161+
'parser_registry.rs',
162+
);
163+
const src = fs.readFileSync(registryPath, 'utf8');
164+
const fromExtStart = src.indexOf('pub fn from_extension');
165+
expect(fromExtStart, 'from_extension not found in parser_registry.rs').toBeGreaterThan(-1);
166+
// Slice from `pub fn from_extension` to the next `pub fn` (boundary of
167+
// the next method) so we don't accidentally pick up extensions from
168+
// unrelated functions like `from_lang_id` (which contains lang_id
169+
// strings that look extension-like, e.g. "javascript", "python").
170+
const tail = src.slice(fromExtStart);
171+
const nextFnRel = tail.slice(1).search(/\n\s*\/\/\/|\n\s*pub fn /);
172+
const body = nextFnRel === -1 ? tail : tail.slice(0, nextFnRel + 1);
173+
const rustExts = new Set<string>();
174+
// Match string literals like "ts", "py", "tsx", "d.ts" etc.
175+
for (const m of body.matchAll(/"([A-Za-z0-9.]+)"/g)) {
176+
rustExts.add(m[1]);
177+
}
178+
// Normalize Rust forms to the JS `.ext` form. The function mixes:
179+
// - prefix branches with leading dot: ".tsx", ".d.ts"
180+
// - `match ext` arms without dot: "ts", "py", "rb", ...
181+
// `.d.ts` is a TypeScript declaration file alias mapped to TypeScript;
182+
// JS treats those files via `.ts` so the alias is not in the JS set.
183+
const normalized = new Set<string>();
184+
for (const e of rustExts) {
185+
// `.d.ts` declaration files are mapped to TypeScript via a special
186+
// prefix branch — JS handles those via the `.ts` entry, so skip the
187+
// alias in either matched form.
188+
if (e === 'd.ts' || e === '.d.ts') continue;
189+
const withDot = e.startsWith('.') ? e : `.${e}`;
190+
normalized.add(withDot.toLowerCase());
191+
}
192+
const jsExts = new Set(NATIVE_SUPPORTED_EXTENSIONS);
193+
const onlyInRust = [...normalized].filter((e) => !jsExts.has(e));
194+
const onlyInJs = [...jsExts].filter((e) => !normalized.has(e));
195+
expect(
196+
onlyInRust,
197+
`Extensions in parser_registry.rs but missing from NATIVE_SUPPORTED_EXTENSIONS: ${onlyInRust.join(', ')}`,
198+
).toEqual([]);
199+
expect(
200+
onlyInJs,
201+
`Extensions in NATIVE_SUPPORTED_EXTENSIONS but missing from parser_registry.rs: ${onlyInJs.join(', ')}`,
202+
).toEqual([]);
203+
});
204+
});

0 commit comments

Comments
 (0)