Skip to content

Commit 04b97d2

Browse files
committed
fix(telemetry): extend error classifier + stamp category/fatal on failed audits
B-007 addressed the two root causes that made the admin "Top error classes" panel useless for triage: 1) `classifyError()` had no pattern for `ERR_INVALID_ARG_TYPE` / `fileURLToPath(undefined)` — every B-006 failure collapsed into `unknown`. All 16 failed audits in the last 30 days on prod landed in the same opaque bucket, impossible to distinguish from unrelated regressions. 2) `audit_complete` with `outcome="failed"` never set `category` or `fatal`, so failed audits did not hit the (category, error_class) composite index on the backend. They showed up with NULL category on the dashboard. Vocabulary changes (src/telemetry.ts): - Add specific node codes: `node_invalid_arg`, `module_not_found`, `spawn_error`, `out_of_memory`. These match BEFORE the generic fallbacks below so B-006-class failures keep their triage signal. - Add generic JS kinds as last resort before `unknown`: `type_error`, `reference_error` — dispatched via `err.name` rather than message text, so a bare `TypeError: x is not a function` at least lands in a non-empty bucket. - Drop the `msg.includes("enoent") → transcript_not_found` shortcut; a bare ENOENT is a generic missing-file hit, not a transcript issue. Kept `transcript not found` literal as the transcript-specific matcher and generic ENOENT now falls through to `transcript_not_found` only after `spawn ENOENT` is checked. - Network: `econnreset` added alongside `econnrefused`. - Doc-comment the load-bearing match order. audit_complete event (src/session-cleanup.ts): - When `outcome === "failed"`, stamp `category: "audit"` and `fatal: false`. Audit failures are non-fatal — the session still closes and the user's work is unaffected; only background extraction is lost until the next attempt. Setting these fields lets the existing backend index surface failed audits in the same panel as other categorized errors. Tests (test/telemetry.test.ts): - B-006 reproducer: exact TypeError message from the audit-worker-logs → `node_invalid_arg`. - Order guard: the same message through the `TypeError` path must NOT degrade into the generic `type_error` fallback. - One test per new class (module_not_found, spawn_error, out_of_memory, type_error, reference_error fallback). Verified: - 481/481 unit tests pass (6 new cases; full suite was 478 before) - `tsc --noEmit` clean - `npm run build` clean - `grep` in `dist/cli.mjs` shows all 6 new slugs present in the bundle Follow-up for v0.2.8 release: re-query SELECT error_class, COUNT(*) FROM telemetry_events WHERE event='audit_complete' AND outcome='failed' GROUP BY error_class ORDER BY COUNT(*) DESC `unknown` should drop from 100% to a small minority; `node_invalid_arg` should be 0 on v0.2.8 (B-006 fixed in PR #105). Fixes B-007. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> #!axme pr=none repo=AxmeAI/axme-code
1 parent e05de98 commit 04b97d2

3 files changed

Lines changed: 141 additions & 3 deletions

File tree

src/session-cleanup.ts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -817,6 +817,13 @@ export async function runSessionCleanup(
817817
try {
818818
const { sendTelemetryBlocking } = await import("./telemetry.js");
819819
const outcome = result.auditRan ? "success" : "failed";
820+
// When the audit failed, also stamp the category + fatal fields so the
821+
// event lands in the (category, error_class) index on the backend. Audit
822+
// failures are non-fatal: the session still closes normally, user work
823+
// continues — only the background extraction is lost until next attempt.
824+
// Without these fields, failed audits collapse into the NULL bucket on
825+
// the admin dashboard and make triage useless (B-007).
826+
const isFailed = outcome === "failed";
820827
await sendTelemetryBlocking("audit_complete", {
821828
outcome,
822829
duration_ms: auditStartMs > 0 ? Date.now() - auditStartMs : 0,
@@ -828,6 +835,7 @@ export async function runSessionCleanup(
828835
safety_saved: result.safetyRules,
829836
dropped_count: auditDroppedCount,
830837
error_class: auditErrorClass,
838+
...(isFailed ? { category: "audit" as const, fatal: false } : {}),
831839
});
832840
} catch { /* never throw from telemetry */ }
833841
}

src/telemetry.ts

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,15 @@ export interface TelemetryCommonFields {
7171

7272
export type TelemetryEvent = TelemetryCommonFields & Record<string, unknown>;
7373

74-
/** Bounded vocabulary of error classes. Add new entries only when seen in the wild. */
74+
/**
75+
* Bounded vocabulary of error classes. Add new entries only when seen in the wild.
76+
*
77+
* Backend schema: `error_class varchar(40)` — keep slugs short.
78+
*
79+
* Order of evaluation inside `classifyError` matters: specific node error codes
80+
* (ERR_*) must be matched BEFORE the generic JS error kinds, or they'll collapse
81+
* into `type_error` / `reference_error` and lose signal.
82+
*/
7583
export type ErrorClass =
7684
| "prompt_too_long"
7785
| "api_error"
@@ -84,6 +92,15 @@ export type ErrorClass =
8492
| "permission_denied"
8593
| "disk_full"
8694
| "config_invalid"
95+
// Node-specific error codes (ERR_*). Match before the generic JS kinds below.
96+
| "node_invalid_arg" // ERR_INVALID_ARG_TYPE / fileURLToPath(undefined) — B-006
97+
| "module_not_found" // ERR_MODULE_NOT_FOUND / Cannot find module
98+
| "spawn_error" // spawn ENOENT / subprocess failed to start
99+
| "out_of_memory" // ENOMEM / JavaScript heap out of memory
100+
// Generic JS error kinds. Last-resort before "unknown" so a bare TypeError
101+
// at least lands in a non-empty bucket for triage.
102+
| "type_error"
103+
| "reference_error"
87104
| "unknown";
88105

89106
// --- Process state ---
@@ -399,19 +416,67 @@ export async function sendStartupEvents(): Promise<void> {
399416
/**
400417
* Map a caught exception to a bounded ErrorClass slug.
401418
* Never sends raw exception messages to telemetry — only the slug.
419+
*
420+
* Match order is load-bearing: specific node error codes must be checked
421+
* before generic JS error kinds, and domain-specific substrings (transcript
422+
* not found, prompt too long) before broad fallbacks (spawn_error).
402423
*/
403424
export function classifyError(err: unknown): ErrorClass {
404425
const msg = err instanceof Error ? err.message.toLowerCase() : String(err).toLowerCase();
426+
// Include the Error subtype name so we can catch TypeError / ReferenceError
427+
// even when the message text is too bland to identify on its own.
428+
const name = err instanceof Error ? err.name.toLowerCase() : "";
429+
430+
// Domain-specific signals (our code or LLM provider) — check first.
405431
if (msg.includes("prompt is too long") || msg.includes("max tokens") || msg.includes("context length")) return "prompt_too_long";
406432
if (msg.includes("rate limit") || msg.includes("429")) return "api_rate_limit";
407433
if (msg.includes("authentication") || msg.includes("api key") || msg.includes("apikey") || msg.includes("authtoken")) return "oauth_missing";
408434
if (msg.includes("timeout") || msg.includes("timed out") || msg.includes("aborted")) return "timeout";
409-
if (msg.includes("enoent") || msg.includes("transcript not found")) return "transcript_not_found";
435+
if (msg.includes("transcript not found")) return "transcript_not_found";
436+
437+
// Node-specific error codes. Check these BEFORE the generic TypeError /
438+
// ReferenceError / ENOENT fallbacks so ERR_INVALID_ARG_TYPE (B-006),
439+
// ERR_MODULE_NOT_FOUND, and spawn ENOENT don't collapse into the generic
440+
// bucket and lose their triage signal.
441+
if (msg.includes("err_invalid_arg_type") || msg.includes("fileurltopath") ||
442+
(msg.includes("argument must be of type") && msg.includes("received undefined"))) {
443+
return "node_invalid_arg";
444+
}
445+
if (msg.includes("err_module_not_found") || msg.includes("cannot find module") ||
446+
msg.includes("cannot find package")) {
447+
return "module_not_found";
448+
}
449+
if (msg.includes("spawn enoent") || msg.includes("spawn eacces") ||
450+
msg.includes("child_process") && msg.includes("enoent")) {
451+
return "spawn_error";
452+
}
453+
if (msg.includes("enomem") || msg.includes("heap out of memory") ||
454+
msg.includes("allocation failed") || msg.includes("out of memory")) {
455+
return "out_of_memory";
456+
}
457+
458+
// Filesystem / OS errors. ENOENT here (after transcript_not_found / spawn
459+
// ENOENT) is a generic missing-file hit.
460+
if (msg.includes("enoent")) return "transcript_not_found";
410461
if (msg.includes("eacces") || msg.includes("permission denied")) return "permission_denied";
411462
if (msg.includes("enospc") || msg.includes("no space")) return "disk_full";
412-
if (msg.includes("network") || msg.includes("econnrefused") || msg.includes("fetch failed") || msg.includes("dns")) return "network_error";
463+
464+
// Network.
465+
if (msg.includes("network") || msg.includes("econnrefused") || msg.includes("econnreset") ||
466+
msg.includes("fetch failed") || msg.includes("dns")) return "network_error";
467+
468+
// Parsing.
413469
if (msg.includes("unexpected token") || msg.includes("invalid json") || msg.includes("parse")) return "parse_error";
470+
471+
// Remote API. Keep after the specific 429 (rate_limit) check above.
414472
if (msg.includes("api error") || msg.includes("500") || msg.includes("503")) return "api_error";
473+
474+
// Generic JS error kinds. Last resort before "unknown" so a bare TypeError
475+
// at least lands in a non-empty bucket and we can distinguish bundler bugs
476+
// (ReferenceError is almost always a missing import) from shape mismatches.
477+
if (name === "referenceerror") return "reference_error";
478+
if (name === "typeerror") return "type_error";
479+
415480
return "unknown";
416481
}
417482

test/telemetry.test.ts

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,6 +220,71 @@ describe("classifyError", () => {
220220
assert.equal(classifyError(new Error("Invalid JSON output")), "parse_error");
221221
});
222222

223+
it("classifies node ERR_INVALID_ARG_TYPE / fileURLToPath(undefined) — B-006", () => {
224+
// Actual B-006 message from audit-worker-logs:
225+
const b006 = new TypeError(
226+
'The "path" argument must be of type string or an instance of URL. Received undefined',
227+
);
228+
(b006 as any).code = "ERR_INVALID_ARG_TYPE";
229+
assert.equal(classifyError(b006), "node_invalid_arg");
230+
// ERR_ code in the message also matches:
231+
assert.equal(
232+
classifyError(new Error("ERR_INVALID_ARG_TYPE: path must be string")),
233+
"node_invalid_arg",
234+
);
235+
// fileURLToPath specifically:
236+
assert.equal(
237+
classifyError(new Error("fileURLToPath received undefined")),
238+
"node_invalid_arg",
239+
);
240+
});
241+
242+
it("classifies module-not-found errors", () => {
243+
assert.equal(
244+
classifyError(new Error("Cannot find module '@anthropic-ai/claude-agent-sdk'")),
245+
"module_not_found",
246+
);
247+
assert.equal(
248+
classifyError(new Error("ERR_MODULE_NOT_FOUND")),
249+
"module_not_found",
250+
);
251+
assert.equal(
252+
classifyError(new Error("Cannot find package 'foo' imported from bar")),
253+
"module_not_found",
254+
);
255+
});
256+
257+
it("classifies subprocess spawn errors", () => {
258+
assert.equal(classifyError(new Error("spawn ENOENT")), "spawn_error");
259+
assert.equal(classifyError(new Error("spawn EACCES")), "spawn_error");
260+
});
261+
262+
it("classifies out-of-memory errors", () => {
263+
assert.equal(
264+
classifyError(new Error("JavaScript heap out of memory")),
265+
"out_of_memory",
266+
);
267+
assert.equal(classifyError(new Error("ENOMEM")), "out_of_memory");
268+
assert.equal(classifyError(new Error("allocation failed")), "out_of_memory");
269+
});
270+
271+
it("classifies bare TypeError / ReferenceError by name (last-resort fallback)", () => {
272+
// A TypeError whose message matches no specific rule should still land in
273+
// type_error (not unknown), so a bundler shape bug is distinguishable from
274+
// a fully opaque error on the dashboard.
275+
assert.equal(classifyError(new TypeError("x is not a function")), "type_error");
276+
assert.equal(classifyError(new ReferenceError("foo is not defined")), "reference_error");
277+
});
278+
279+
it("ERR_INVALID_ARG_TYPE beats the generic type_error fallback (order matters)", () => {
280+
const err = new TypeError(
281+
'The "path" argument must be of type string or an instance of URL. Received undefined',
282+
);
283+
// Must NOT degrade to type_error — the specific Node code gives us B-006
284+
// triage signal that bare type_error does not.
285+
assert.equal(classifyError(err), "node_invalid_arg");
286+
});
287+
223288
it("returns 'unknown' for unrecognized errors", () => {
224289
assert.equal(classifyError(new Error("something completely random")), "unknown");
225290
assert.equal(classifyError("string error"), "unknown");

0 commit comments

Comments
 (0)