Skip to content

feat(telemetry): integrate telemetry tracking across interactive and …#1798

Open
RealKai42 wants to merge 3 commits intomainfrom
kaiyi/lyon
Open

feat(telemetry): integrate telemetry tracking across interactive and …#1798
RealKai42 wants to merge 3 commits intomainfrom
kaiyi/lyon

Conversation

@RealKai42
Copy link
Copy Markdown
Collaborator

@RealKai42 RealKai42 commented Apr 8, 2026

Copilot AI review requested due to automatic review settings April 8, 2026 09:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an opt-out telemetry system and wires event tracking through key CLI/UI, soul, background-task, hooks, and auth flows to enable anonymous usage analytics and reliability diagnostics.

Changes:

  • Added telemetry core: module-level track() buffering, an EventSink for enrichment/batching/periodic flush, and an AsyncTransport with HTTP sending + disk fallback + startup retry.
  • Instrumented many user interactions and runtime events (slash commands, shortcuts, cancel/interrupt, OAuth refresh, hooks, tool approval/errors, background tasks, MCP connection, etc.).
  • Added config flag (telemetry: bool) and a new test suite covering telemetry behavior and (nominal) instrumentation expectations.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/telemetry/test_telemetry.py Adds unit tests for telemetry queueing, sink enrichment/flush, transport disk fallback/retry behaviors.
tests/telemetry/test_instrumentation.py Adds tests intended to validate emitted event names/properties for production instrumentation paths.
tests/telemetry/init.py Introduces telemetry test package marker.
tests/core/test_config.py Updates default config dump expectations to include telemetry: True.
src/kimi_cli/web/runner/worker.py Passes ui_mode="wire" into KimiCLI.create() for telemetry context in web worker runs.
src/kimi_cli/ui/shell/visualize/_live_view.py Tracks expand shortcut, question answered/dismissed methods, and run cancel events.
src/kimi_cli/ui/shell/visualize/_interactive.py Tracks input queueing and immediate steer actions.
src/kimi_cli/ui/shell/slash.py Tracks various slash-command actions (model/thinking/theme/session ops/web/vis/undo/fork/etc.).
src/kimi_cli/ui/shell/prompt.py Tracks prompt shortcuts (mode switch, plan toggle, newline, editor, paste).
src/kimi_cli/ui/shell/oauth.py Tracks login/logout events.
src/kimi_cli/ui/shell/export_import.py Tracks export/import actions.
src/kimi_cli/ui/shell/init.py Starts/stops periodic telemetry flush + retry on startup; tracks input classifications, command invalid, exit duration, interrupt step, update prompt.
src/kimi_cli/telemetry/transport.py Implements async HTTP transport with token/anonymous retry logic plus JSONL disk fallback and retry-on-start.
src/kimi_cli/telemetry/sink.py Implements buffering, context enrichment, periodic flushing, and sync flush-to-disk for exit.
src/kimi_cli/telemetry/init.py Implements public telemetry API (track, set_context, attach_sink, disable, flush_sync) with pre-sink buffering and atexit flush.
src/kimi_cli/subagents/runner.py Tracks subagent creation.
src/kimi_cli/soul/toolset.py Tracks tool execution errors with tool name.
src/kimi_cli/soul/slash.py Tracks init completion and YOLO toggle state changes.
src/kimi_cli/soul/kimisoul.py Tracks skill/flow invocation, MCP connection outcomes, and API error classification.
src/kimi_cli/soul/approval.py Tracks tool approval/rejection outcomes (including cancel path).
src/kimi_cli/hooks/engine.py Tracks hook trigger outcomes (allow/block) with event type.
src/kimi_cli/config.py Adds telemetry boolean config option (default enabled).
src/kimi_cli/cli/init.py Passes ui_mode down to KimiCLI.create() for consistent telemetry context.
src/kimi_cli/background/manager.py Tracks background task creation and completion outcomes with duration/success.
src/kimi_cli/auth/oauth.py Tracks first-launch device ID creation and OAuth refresh success/failure; adds cached token accessor for transport.
src/kimi_cli/app.py Initializes telemetry based on config/env, attaches sink with context, and tracks startup events/perf.
src/kimi_cli/acp/server.py Passes ui_mode="acp" into KimiCLI.create() for telemetry context in ACP server mode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +84 to +95
if retry_resp.status >= 500:
raise _TransientError(f"HTTP {retry_resp.status}")
elif retry_resp.status >= 400:
# Client error (4xx) — not recoverable, don't retry
logger.debug(
"Anonymous retry got client error HTTP {status}, dropping",
status=retry_resp.status,
)
return
return
elif resp.status >= 400:
raise _TransientError(f"HTTP {resp.status}")
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncTransport._send_http currently raises _TransientError for any HTTP status >= 400 (except the anonymous-retry 4xx early-return). This means permanent client errors like 400/403 will be treated as transient and will repeatedly fall back to disk + retry until expiry, potentially generating lots of on-disk telemetry files without any chance of recovery. Consider classifying 4xx (except possibly 429) as non-transient and dropping them (or logging once), reserving _TransientError for timeouts/connection issues and 5xx.

Suggested change
if retry_resp.status >= 500:
raise _TransientError(f"HTTP {retry_resp.status}")
elif retry_resp.status >= 400:
# Client error (4xx) — not recoverable, don't retry
logger.debug(
"Anonymous retry got client error HTTP {status}, dropping",
status=retry_resp.status,
)
return
return
elif resp.status >= 400:
raise _TransientError(f"HTTP {resp.status}")
if retry_resp.status >= 500 or retry_resp.status == 429:
raise _TransientError(f"HTTP {retry_resp.status}")
elif retry_resp.status >= 400:
# Client error (4xx, except 429) — not recoverable, don't retry
logger.debug(
"Anonymous retry got client error HTTP {status}, dropping",
status=retry_resp.status,
)
return
return
elif resp.status >= 500 or resp.status == 429:
raise _TransientError(f"HTTP {resp.status}")
elif resp.status >= 400:
# Client error (4xx, except 429) — not recoverable, don't retry
logger.debug(
"Telemetry got client error HTTP {status}, dropping",
status=resp.status,
)
return

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +109
try:
path = _telemetry_dir() / f"failed_{uuid.uuid4().hex[:12]}.jsonl"
with open(path, "a", encoding="utf-8") as f:
for event in events:
f.write(json.dumps(event, ensure_ascii=False, separators=(",", ":")))
f.write("\n")
logger.debug(
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telemetry fallback files are written under ~/.kimi (via get_share_dir()) with default mkdir/open permissions, which may be group/world-readable depending on umask. Since these JSONL files include device_id/session_id and environment context, it would be safer to explicitly restrict permissions (e.g., chmod 0o700 on the telemetry dir and 0o600 on failed_*.jsonl files, similar to auth/oauth.py’s _ensure_private_file).

Copilot uses AI. Check for mistakes.
Comment on lines +374 to +389
transport = AsyncTransport(endpoint="https://mock.test/events")

call_count = 0

async def mock_send_http(payload: dict) -> None:
nonlocal call_count
call_count += 1
if call_count == 1:
# Simulate: first call succeeds (no _TransientError), events are "sent"
# But we need to test the internal 4xx path. We'll test via send() + save_to_disk.
pass

# More direct: patch _send_http to NOT raise (simulating 4xx handled internally)
# The 4xx path returns without raising, so send() should not call save_to_disk.
with (
patch.object(transport, "_send_http", new_callable=AsyncMock),
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_anonymous_retry_4xx_drops_events doesn’t currently exercise the “anonymous retry returned 4xx” branch in AsyncTransport._send_http. It patches _send_http itself (bypassing all status-handling logic) and also defines mock_send_http/call_count that are unused. To validate the behavior, mock the aiohttp session.post responses so the first request returns 401 with a token and the anonymous retry returns a 4xx, then assert send() does not call save_to_disk().

Suggested change
transport = AsyncTransport(endpoint="https://mock.test/events")
call_count = 0
async def mock_send_http(payload: dict) -> None:
nonlocal call_count
call_count += 1
if call_count == 1:
# Simulate: first call succeeds (no _TransientError), events are "sent"
# But we need to test the internal 4xx path. We'll test via send() + save_to_disk.
pass
# More direct: patch _send_http to NOT raise (simulating 4xx handled internally)
# The 4xx path returns without raising, so send() should not call save_to_disk.
with (
patch.object(transport, "_send_http", new_callable=AsyncMock),
transport = AsyncTransport(
get_access_token=lambda: "test-token",
endpoint="https://mock.test/events",
)
first_resp = MagicMock()
first_resp.status = 401
first_resp.__aenter__ = AsyncMock(return_value=first_resp)
first_resp.__aexit__ = AsyncMock(return_value=False)
second_resp = MagicMock()
second_resp.status = 400
second_resp.__aenter__ = AsyncMock(return_value=second_resp)
second_resp.__aexit__ = AsyncMock(return_value=False)
mock_session = MagicMock()
mock_session.post.side_effect = [first_resp, second_resp]
mock_session.__aenter__ = AsyncMock(return_value=mock_session)
mock_session.__aexit__ = AsyncMock(return_value=False)
with (
patch("kimi_cli.utils.aiohttp.new_client_session", return_value=mock_session),

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +8
"""Tests for telemetry instrumentation correctness.

These tests verify that track() calls in production code emit the correct events
with the correct properties under the correct conditions. They do NOT test the
telemetry infrastructure itself (that's in test_telemetry.py).

Strategy: patch `kimi_cli.telemetry.track` at the call site and assert the
event name / properties, rather than running the full UI/soul stack.
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says the strategy is to “patch kimi_cli.telemetry.track at the call site and assert … rather than running the full UI/soul stack”, but most tests below call track() directly (which will still pass even if the production instrumentation is removed). Either update the docstring to match the intended scope (event schema tests), or change the tests to patch/execute the actual production functions that should emit these events so they truly validate instrumentation correctness.

Suggested change
"""Tests for telemetry instrumentation correctness.
These tests verify that track() calls in production code emit the correct events
with the correct properties under the correct conditions. They do NOT test the
telemetry infrastructure itself (that's in test_telemetry.py).
Strategy: patch `kimi_cli.telemetry.track` at the call site and assert the
event name / properties, rather than running the full UI/soul stack.
"""Tests for telemetry event behavior and schema correctness.
These tests exercise the telemetry API directly and verify that calls to
`track()` and related helpers produce the expected event names, properties,
queue entries, and sink-forwarded payloads under the correct conditions.
They do NOT verify that specific production UI/soul call sites are still
instrumented, and they do NOT test the transport/infrastructure layer itself
(that coverage belongs in test_telemetry.py).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +94 to +95
elif resp.status >= 400:
raise _TransientError(f"HTTP {resp.status}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Initial HTTP request treats all 4xx client errors as transient, causing futile disk persistence and retries

In the initial HTTP request path at src/kimi_cli/telemetry/transport.py:94, any response with status >= 400 (except 401 with token) raises _TransientError, which causes events to be saved to disk for later retry. However, most 4xx errors (400 Bad Request, 403 Forbidden, 404 Not Found, etc.) are permanent client errors that will never succeed on retry. The anonymous retry path at lines 84-92 correctly distinguishes between 5xx (transient, raise) and 4xx (non-recoverable, drop with a log message), but this same distinction is missing for the initial request. As a result, a 400 Bad Request response saves events to disk, and retry_disk_events() on next startup re-sends them, gets 400 again, and leaves the file — repeating every startup until the 7-day expiry.

Suggested change
elif resp.status >= 400:
raise _TransientError(f"HTTP {resp.status}")
elif resp.status >= 500 or resp.status == 429:
raise _TransientError(f"HTTP {resp.status}")
elif resp.status >= 400:
# Client error (4xx) — not recoverable, don't retry
logger.debug(
"Telemetry got client error HTTP {status}, dropping",
status=resp.status,
)
return
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +1221 to 1227
if self._name:
from kimi_cli.telemetry import track

track("kimi_flow_invoked", flow_name=self._name)
if args.strip():
command = f"/{FLOW_COMMAND_PREFIX}{self._name}" if self._name else "/flow"
logger.warning("Agent flow {command} ignores args: {args}", command=command, args=args)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 kimi_flow_invoked tracked even when flow does not execute due to args validation failure

In FlowRunner.run() at src/kimi_cli/soul/kimisoul.py:1221-1224, the track("kimi_flow_invoked") call is placed before the args validation check at line 1225. When a user passes arguments to a flow command (e.g., /flow:myflow some args), the flow logs a warning and returns immediately at line 1228 without executing. But the telemetry event has already been emitted, inflating the flow invocation count with failed attempts.

(Refers to lines 1221-1228)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 11 additional findings in Devin Review.

Open in Devin Review

Comment on lines +207 to +211
from kimi_cli.telemetry import track

has_block = any(r.action == "block" for r in results)
track("kimi_hook_triggered", event_type=event, action="block" if has_block else "allow")
return results
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Telemetry code inside fail-open try/except can silently discard hook results

The telemetry tracking code (import, any() call, track()) was inserted inside the try block that was originally designed to catch only _execute_hooks() errors and fail open. If any of these new statements raise (e.g., _sink.accept() encounters an unexpected error), the already-computed results are silently discarded and replaced with []. This changes the error-handling semantics: previously only hook-execution failures caused fail-open, now telemetry failures also cause it. For security-critical hooks like PreToolUse that block dangerous operations, a telemetry failure would silently bypass the block.

Original vs new code structure

Original:

try:
    return await self._execute_hooks(...)
except Exception:
    logger.warning("Hook engine error for {}, failing open", event)
    return []

New code places track() between result computation and return, still inside the same try/except:

try:
    results = await self._execute_hooks(...)
    from kimi_cli.telemetry import track
    has_block = any(r.action == "block" for r in results)
    track("kimi_hook_triggered", ...)
    return results
except Exception:
    logger.warning("Hook engine error for {}, failing open", event)
    return []
Suggested change
from kimi_cli.telemetry import track
has_block = any(r.action == "block" for r in results)
track("kimi_hook_triggered", event_type=event, action="block" if has_block else "allow")
return results
from kimi_cli.telemetry import track
has_block = any(r.action == "block" for r in results)
track("kimi_hook_triggered", event_type=event, action="block" if has_block else "allow")
return results
except Exception:
logger.warning("Hook engine error for {}, failing open", event)
return []
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Keep main's work_dir validation check and our telemetry track call.
Track kimi_session_resume only after validation passes to avoid false positives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants