Skip to content

Commit 75b077f

Browse files
anandgupta42claude
andauthored
fix: telemetry improvements from deep AppInsights analysis (#587)
Based on 10-day telemetry analysis of altimate-code-os: Error classification (P0): - Add 4 new error classes: `file_not_found`, `edit_mismatch`, `not_configured`, `resource_exhausted` - Move warehouse/driver keywords from `connection` to `not_configured` - Reduces "unknown" error classification from 85%+ to ~50% Session metadata (P0): - Add `os`, `arch`, `node_version` to `session_start` event - Enables environment-based segmentation in dashboards Doom loop detection (P1): - Add per-tool call counter (threshold=30) to catch varied-input loops - Emits `doom_loop_detected` telemetry event when triggered - Addresses todowrite tool called 2,080x by one user Token visibility (P1): - Add `tokens_input_total` field to generation events - Includes cached tokens for Anthropic (where `tokens_input` excludes cache) - Only emitted when it differs from `tokens_input` Telemetry query docs (P2): - Add KQL reference documenting `customDimensions` vs `customMeasurements` - Prevents analysts from querying the wrong column Cleanup: - Rename `telemetry-moat-signals.test.ts` → `telemetry-signals.test.ts` - Remove "moat" terminology from test comments Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 8449bec commit 75b077f

File tree

6 files changed

+244
-7
lines changed

6 files changed

+244
-7
lines changed

packages/opencode/src/altimate/telemetry/index.ts

Lines changed: 90 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,48 @@ import os from "os"
99

1010
const log = Log.create({ service: "telemetry" })
1111

12+
// altimate_change start — telemetry query reference for Azure App Insights (KQL)
13+
/**
14+
* Telemetry Module — Azure App Insights Integration
15+
*
16+
* QUERYING TELEMETRY DATA (KQL / Log Analytics):
17+
*
18+
* customDimensions → string fields (tool_name, model_id, provider_id, error_class, os, etc.)
19+
* customMeasurements → numeric fields (tokens_input, cost, duration_ms, etc.)
20+
*
21+
* Serialization rules (see toAppInsightsEnvelopes):
22+
* - typeof number → measurements map (customMeasurements)
23+
* - typeof string → properties map (customDimensions)
24+
* - typeof boolean → properties map (as "true"/"false")
25+
* - typeof object → properties map (JSON.stringify)
26+
* - session_id / project_id are lifted into envelope tags, not properties
27+
* - cli_version is injected into every event's properties automatically
28+
*
29+
* Example KQL:
30+
*
31+
* // Token usage per model
32+
* customEvents
33+
* | where name == "generation"
34+
* | extend model = tostring(customDimensions.model_id),
35+
* tokens_in = todouble(customMeasurements.tokens_input),
36+
* tokens_out = todouble(customMeasurements.tokens_output)
37+
* | summarize avg(tokens_in), avg(tokens_out) by model
38+
*
39+
* // Error class distribution
40+
* customEvents
41+
* | where name == "core_failure"
42+
* | extend err = tostring(customDimensions.error_class)
43+
* | summarize count() by err
44+
*/
45+
// altimate_change end
46+
1247
export namespace Telemetry {
1348
const FLUSH_INTERVAL_MS = 5_000
1449
const MAX_BUFFER_SIZE = 200
1550
const REQUEST_TIMEOUT_MS = 10_000
1651

1752
export type Event =
53+
// altimate_change start — add os/arch/node_version for environment segmentation
1854
| {
1955
type: "session_start"
2056
timestamp: number
@@ -23,7 +59,11 @@ export namespace Telemetry {
2359
provider_id: string
2460
agent: string
2561
project_id: string
62+
os: string
63+
arch: string
64+
node_version: string
2665
}
66+
// altimate_change end
2767
| {
2868
type: "session_end"
2969
timestamp: number
@@ -48,6 +88,9 @@ export namespace Telemetry {
4888
// No nested objects: Azure App Insights custom measures must be top-level numbers.
4989
tokens_input: number
5090
tokens_output: number
91+
// altimate_change start — total input tokens including cached (for providers like Anthropic that exclude cache from tokens_input)
92+
tokens_input_total?: number
93+
// altimate_change end
5194
tokens_reasoning?: number // only for reasoning models
5295
tokens_cache_read?: number // only when a cached prompt was reused
5396
tokens_cache_write?: number // only when a new cache entry was written
@@ -432,7 +475,7 @@ export namespace Telemetry {
432475
session_id: string
433476
tool_name: string
434477
tool_category: string
435-
error_class: "parse_error" | "connection" | "timeout" | "validation" | "internal" | "permission" | "http_error" | "unknown"
478+
error_class: "parse_error" | "connection" | "timeout" | "validation" | "internal" | "permission" | "http_error" | "file_not_found" | "edit_mismatch" | "not_configured" | "resource_exhausted" | "unknown"
436479
error_message: string
437480
input_signature: string
438481
masked_args?: string
@@ -678,12 +721,44 @@ export namespace Telemetry {
678721
"sasl",
679722
"scram",
680723
"password must be",
724+
],
725+
},
726+
// altimate_change start — split not_configured out of connection for clearer triage
727+
{
728+
class: "not_configured",
729+
keywords: [
730+
"no warehouse configured",
681731
"driver not installed",
682732
"not found. available:",
683-
"no warehouse configured",
684733
"unsupported database type",
734+
"warehouse not configured",
735+
"connection not configured",
736+
],
737+
},
738+
// altimate_change end
739+
// altimate_change start — file_not_found class for file system errors
740+
{
741+
class: "file_not_found",
742+
keywords: [
743+
"file not found",
744+
"no such file",
745+
"enoent",
746+
"directory not found",
747+
"path not found",
748+
"file does not exist",
749+
],
750+
},
751+
// altimate_change end
752+
// altimate_change start — edit_mismatch class for edit tool failures
753+
{
754+
class: "edit_mismatch",
755+
keywords: [
756+
"could not find oldstring",
757+
"no changes to apply",
758+
"oldstring and newstring are identical",
685759
],
686760
},
761+
// altimate_change end
687762
{ class: "timeout", keywords: ["timeout", "etimedout", "bridge timeout", "timed out"] },
688763
{ class: "permission", keywords: ["permission", "access denied", "permission denied", "unauthorized", "forbidden", "authentication"] },
689764
{
@@ -700,6 +775,19 @@ export namespace Telemetry {
700775
],
701776
},
702777
{ class: "internal", keywords: ["internal", "assertion"] },
778+
// altimate_change start — resource_exhausted class for OOM/quota errors
779+
{
780+
class: "resource_exhausted",
781+
keywords: [
782+
"out of memory",
783+
"resource limit",
784+
"quota exceeded",
785+
"disk i/o",
786+
"enomem",
787+
"heap out of memory",
788+
],
789+
},
790+
// altimate_change end
703791
{
704792
class: "http_error",
705793
keywords: ["status code: 4", "status code: 5", "request failed with status"],

packages/opencode/src/session/index.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -838,6 +838,9 @@ export namespace Session {
838838
const tokens = {
839839
total,
840840
input: adjustedInputTokens,
841+
// altimate_change start — inputTotal includes cached tokens for accurate telemetry reporting
842+
inputTotal: adjustedInputTokens + cacheReadInputTokens + cacheWriteInputTokens,
843+
// altimate_change end
841844
output: outputTokens,
842845
reasoning: reasoningTokens,
843846
cache: {

packages/opencode/src/session/processor.ts

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ import { Telemetry } from "@/altimate/telemetry"
2222

2323
export namespace SessionProcessor {
2424
const DOOM_LOOP_THRESHOLD = 3
25+
// altimate_change start — per-tool repeat threshold to catch varied-input loops (e.g. todowrite 2,080x)
26+
// Legitimate tool use rarely exceeds 20-25 calls per tool per session.
27+
// 30 catches pathological patterns while avoiding false positives for power users.
28+
const TOOL_REPEAT_THRESHOLD = 30
29+
// altimate_change end
2530
const log = Log.create({ service: "session.processor" })
2631

2732
export type Info = Awaited<ReturnType<typeof create>>
@@ -34,6 +39,9 @@ export namespace SessionProcessor {
3439
abort: AbortSignal
3540
}) {
3641
const toolcalls: Record<string, MessageV2.ToolPart> = {}
42+
// altimate_change start — per-tool call counter for varied-input loop detection
43+
const toolCallCounts: Record<string, number> = {}
44+
// altimate_change end
3745
let snapshot: string | undefined
3846
let blocked = false
3947
let attempt = 0
@@ -181,6 +189,37 @@ export namespace SessionProcessor {
181189
ruleset: agent.permission,
182190
})
183191
}
192+
193+
// altimate_change start — per-tool repeat counter (catches varied-input loops like todowrite 2,080x)
194+
// Counter is scoped to the processor lifetime (create() call), so it accumulates
195+
// across multiple process() invocations within a session. This is intentional:
196+
// cross-turn accumulation catches slow-burn loops that stay under the threshold
197+
// per-turn but add up over the session.
198+
toolCallCounts[value.toolName] = (toolCallCounts[value.toolName] ?? 0) + 1
199+
if (toolCallCounts[value.toolName] >= TOOL_REPEAT_THRESHOLD) {
200+
Telemetry.track({
201+
type: "doom_loop_detected",
202+
timestamp: Date.now(),
203+
session_id: input.sessionID,
204+
tool_name: value.toolName,
205+
repeat_count: toolCallCounts[value.toolName],
206+
})
207+
const agent = await Agent.get(input.assistantMessage.agent)
208+
await PermissionNext.ask({
209+
permission: "doom_loop",
210+
patterns: [value.toolName],
211+
sessionID: input.assistantMessage.sessionID,
212+
metadata: {
213+
tool: value.toolName,
214+
input: value.input,
215+
repeat_count: toolCallCounts[value.toolName],
216+
},
217+
always: [value.toolName],
218+
ruleset: agent.permission,
219+
})
220+
toolCallCounts[value.toolName] = 0
221+
}
222+
// altimate_change end
184223
}
185224
break
186225
}
@@ -275,6 +314,9 @@ export namespace SessionProcessor {
275314
duration_ms: Date.now() - stepStartTime,
276315
tokens_input: usage.tokens.input,
277316
tokens_output: usage.tokens.output,
317+
// altimate_change start — include total input tokens (with cache) when they differ from tokens_input
318+
...(usage.tokens.inputTotal !== usage.tokens.input && { tokens_input_total: usage.tokens.inputTotal }),
319+
// altimate_change end
278320
...(value.usage.reasoningTokens !== undefined && { tokens_reasoning: usage.tokens.reasoning }),
279321
...(value.usage.cachedInputTokens !== undefined && { tokens_cache_read: usage.tokens.cache.read }),
280322
...(usage.tokens.cache.write > 0 && { tokens_cache_write: usage.tokens.cache.write }),

packages/opencode/src/session/prompt.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -784,6 +784,9 @@ export namespace SessionPrompt {
784784
provider_id: model.providerID,
785785
agent: lastUser.agent,
786786
project_id: Instance.project?.id ?? "",
787+
os: process.platform,
788+
arch: process.arch,
789+
node_version: process.version,
787790
})
788791
// altimate_change start — task intent classification (keyword/regex, zero LLM cost)
789792
const userMsg = msgs.find((m) => m.info.id === lastUser!.id)

packages/opencode/test/altimate/telemetry-moat-signals.test.ts renamed to packages/opencode/test/altimate/telemetry-signals.test.ts

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
// @ts-nocheck
22
/**
3-
* Integration tests for the 7 telemetry moat signals.
3+
* Integration tests for the 7 telemetry signals.
44
*
55
* These tests verify that events actually fire through real code paths,
66
* not just that the type definitions compile or utility functions work.
@@ -739,6 +739,9 @@ describe("Full E2E session simulation", () => {
739739
provider_id: "anthropic",
740740
agent: "default",
741741
project_id: "test",
742+
os: "linux",
743+
arch: "x64",
744+
node_version: "v22.0.0",
742745
})
743746

744747
// 2. task_classified

0 commit comments

Comments
 (0)