Skip to content

Commit 67bbe7b

Browse files
authored
feat(kiloclaw): add controller telemetry checkins (#1380)
## Summary Add controller phone-home telemetry from Fly machines to the kiloclaw worker and store it in a dedicated Analytics Engine dataset for machine-health observability. - Introduces a new machine-to-worker check-in route: `POST /api/controller/checkin`. - Uses **dual auth** for check-ins: Bearer `KILOCODE_API_KEY` + `x-kiloclaw-gateway-token`. - Adds a second AE dataset binding (`KILOCLAW_CONTROLLER_AE` / `kiloclaw_controller_telemetry`) separate from lifecycle event telemetry. - Wires periodic controller check-ins (5-minute interval, 2-minute initial delay) into controller startup/shutdown. - Extracts and reuses openclaw version caching logic in `controller/src/openclaw-version.ts`. **Reported telemetry payload (every check-in):** | Category | Field | Source | | --- | --- | --- | | Identity | `sandboxId` | `KILOCLAW_SANDBOX_ID` env | | Identity | `machineId` | `FLY_MACHINE_ID` (or explicit dep override) | | Controller version | `controllerVersion` | compiled controller version constant | | Controller version | `controllerCommit` | compiled controller commit constant | | OpenClaw version | `openclawVersion` | cached `openclaw --version` probe | | OpenClaw version | `openclawCommit` | cached `openclaw --version` probe | | Process health | `supervisorState` | controller supervisor stats | | Process health | `totalRestarts` | controller supervisor stats | | Process health | `restartsSinceLastCheckin` | delta from prior check-in | | Process health | `uptimeSeconds` | controller supervisor stats | | Host metric | `loadAvg5m` | `os.loadavg()[1]` | | Network metric | `bandwidthBytesIn` | `/proc/net/dev` delta | | Network metric | `bandwidthBytesOut` | `/proc/net/dev` delta | | Process detail | `lastExitReason` | derived from last exit signal/code | | Infra label | `fly-region` | request header at worker ingress | **AE datapoint layout (`kiloclaw_controller_telemetry`):** | AE slot | Value | | --- | --- | | `blob1..blob9` | sandboxId, controllerVersion, controllerCommit, openclawVersion, openclawCommit, supervisorState, flyRegion, machineId, lastExitReason | | `double1..double6` | restartsSinceLastCheckin, totalRestarts, uptimeSeconds, loadAvg5m, bandwidthBytesIn, bandwidthBytesOut | | `index1` | sandboxId | ## Verification - [x] `pnpm lint` (in `kiloclaw/`) — pass - [x] `pnpm typecheck` (in `kiloclaw/`) — pass - [x] `pnpm test` (in `kiloclaw/`) — pass (`44` files, `949` tests) - [x] `pnpm test controller/src/checkin.test.ts src/gateway/env.test.ts` — pass - [x] `pnpm test controller/src/checkin.test.ts controller/src/routes/health.test.ts src/routes/controller.test.ts` — pass (`23` tests) - [x] `bash scripts/controller-smoke-test.sh` — pass (`11 passed, 0 failed`) - [x] `bash scripts/controller-entrypoint-smoke-test.sh` — pass (`5 passed, 0 failed`) - [x] `bash scripts/controller-proxy-auth-smoke-test.sh` — pass (expected proxy auth behavior: `401` without token, success with token) - [ ] Additional manual verification details (if any) ## Visual Changes N/A ## Reviewer Notes - This PR is backend/controller-only; no UI changes. - `worker-configuration.d.ts` regeneration changes were intentionally excluded from this branch. - Auth path for `/api/controller/checkin` is intentionally custom and mounted before JWT/internal API middleware. - Network stats parser prefers `eth0`, then falls back to summing non-loopback interfaces. - Detailed implementation deviations/deferrals are logged at `~/fd-plans/kiloclaw/controller-telemetry-deviations.md`.
2 parents 4071392 + 0333d96 commit 67bbe7b

15 files changed

Lines changed: 497 additions & 44 deletions

File tree

kiloclaw/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ RUN mkdir -p /root/.openclaw \
111111

112112
# Copy helper scripts (used at runtime by the controller/gateway)
113113
# Build cache bust: 2026-03-17-v65-remove-startup-script
114-
RUN echo "10"
114+
RUN echo "11"
115115
COPY openclaw-pairing-list.js /usr/local/bin/openclaw-pairing-list.js
116116
COPY openclaw-device-pairing-list.js /usr/local/bin/openclaw-device-pairing-list.js
117117

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
import { describe, expect, it } from 'vitest';
2+
import { parseNetDevText } from './checkin';
3+
4+
describe('parseNetDevText', () => {
5+
it('prefers eth0 when present', () => {
6+
const raw = `Inter-| Receive | Transmit
7+
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
8+
lo: 100 1 0 0 0 0 0 0 200 2 0 0 0 0 0 0
9+
eth0: 3000 10 0 0 0 0 0 0 4000 12 0 0 0 0 0 0
10+
`;
11+
12+
expect(parseNetDevText(raw)).toEqual({ bytesIn: 3000, bytesOut: 4000 });
13+
});
14+
15+
it('falls back to summing non-loopback interfaces when eth0 is absent', () => {
16+
const raw = `Inter-| Receive | Transmit
17+
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
18+
lo: 50 1 0 0 0 0 0 0 60 1 0 0 0 0 0 0
19+
ens5: 1000 10 0 0 0 0 0 0 2000 20 0 0 0 0 0 0
20+
eth1: 300 3 0 0 0 0 0 0 700 7 0 0 0 0 0 0
21+
`;
22+
23+
expect(parseNetDevText(raw)).toEqual({ bytesIn: 1300, bytesOut: 2700 });
24+
});
25+
});

kiloclaw/controller/src/checkin.ts

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
import { readFile } from 'node:fs/promises';
2+
import { loadavg } from 'node:os';
3+
import { z } from 'zod';
4+
import type { OpenclawVersionInfo } from './openclaw-version';
5+
import { CONTROLLER_COMMIT, CONTROLLER_VERSION } from './version';
6+
import type { SupervisorStats } from './supervisor';
7+
8+
const CHECKIN_INTERVAL_MS = 5 * 60 * 1000;
9+
const INITIAL_DELAY_MS = 2 * 60 * 1000;
10+
const REQUEST_TIMEOUT_MS = 10 * 1000;
11+
12+
export type NetStats = { bytesIn: number; bytesOut: number };
13+
14+
const NetStatsSchema = z.object({
15+
bytesIn: z.number().int().min(0),
16+
bytesOut: z.number().int().min(0),
17+
});
18+
19+
function normalizeNetStats(value: unknown): NetStats {
20+
const parsed = NetStatsSchema.safeParse(value);
21+
if (!parsed.success) {
22+
return { bytesIn: 0, bytesOut: 0 };
23+
}
24+
return parsed.data;
25+
}
26+
27+
export type CheckinDeps = {
28+
getApiKey: () => string;
29+
getGatewayToken: () => string;
30+
getSandboxId: () => string;
31+
getCheckinUrl: () => string;
32+
getSupervisorStats: () => SupervisorStats;
33+
getOpenclawVersion: () => Promise<OpenclawVersionInfo>;
34+
getMachineId?: () => string;
35+
};
36+
37+
export function parseNetLine(line: string): NetStats {
38+
const parts = line.trim().split(/\s+/);
39+
return normalizeNetStats({
40+
bytesIn: Number.parseInt(parts[1] ?? '', 10) || 0,
41+
bytesOut: Number.parseInt(parts[9] ?? '', 10) || 0,
42+
});
43+
}
44+
45+
export function parseNetDevText(raw: string): NetStats {
46+
const lines = raw.split('\n');
47+
48+
const eth0Line = lines.find(line => line.trim().startsWith('eth0:'));
49+
if (eth0Line) {
50+
return parseNetLine(eth0Line);
51+
}
52+
53+
let bytesIn = 0;
54+
let bytesOut = 0;
55+
56+
for (const line of lines) {
57+
const trimmed = line.trim();
58+
if (!trimmed || trimmed.includes('|') || !trimmed.includes(':') || trimmed.startsWith('lo:')) {
59+
continue;
60+
}
61+
const stats = parseNetLine(trimmed);
62+
bytesIn += stats.bytesIn;
63+
bytesOut += stats.bytesOut;
64+
}
65+
66+
return normalizeNetStats({ bytesIn, bytesOut });
67+
}
68+
69+
export async function readNetStats(): Promise<NetStats> {
70+
try {
71+
const raw = await readFile('/proc/net/dev', 'utf8');
72+
return parseNetDevText(raw);
73+
} catch {
74+
return { bytesIn: 0, bytesOut: 0 };
75+
}
76+
}
77+
78+
export function startCheckin(deps: CheckinDeps): () => void {
79+
const checkinUrl = deps.getCheckinUrl();
80+
if (!checkinUrl) {
81+
return () => {};
82+
}
83+
84+
let previousRestarts = deps.getSupervisorStats().restarts;
85+
let previousNetStats: NetStats = { bytesIn: 0, bytesOut: 0 };
86+
let checkinInFlight = false;
87+
88+
void readNetStats().then(stats => {
89+
previousNetStats = stats;
90+
});
91+
92+
const doCheckin = async (): Promise<void> => {
93+
if (checkinInFlight) {
94+
return;
95+
}
96+
checkinInFlight = true;
97+
98+
try {
99+
const apiKey = deps.getApiKey();
100+
const gatewayToken = deps.getGatewayToken();
101+
const sandboxId = deps.getSandboxId();
102+
if (!apiKey || !gatewayToken || !sandboxId) {
103+
return;
104+
}
105+
106+
const stats = deps.getSupervisorStats();
107+
const openclawVersion = await deps.getOpenclawVersion();
108+
const currentNetStats = await readNetStats();
109+
110+
const restartsSinceLastCheckin = Math.max(0, stats.restarts - previousRestarts);
111+
const bandwidthBytesIn = Math.max(0, currentNetStats.bytesIn - previousNetStats.bytesIn);
112+
const bandwidthBytesOut = Math.max(0, currentNetStats.bytesOut - previousNetStats.bytesOut);
113+
114+
const lastExitReason = stats.lastExit
115+
? stats.lastExit.signal
116+
? `signal:${stats.lastExit.signal}`
117+
: stats.lastExit.code !== null
118+
? `code:${stats.lastExit.code}`
119+
: ''
120+
: '';
121+
122+
const controller = new AbortController();
123+
const timeout = setTimeout(() => {
124+
controller.abort();
125+
}, REQUEST_TIMEOUT_MS);
126+
127+
let response: Response;
128+
try {
129+
response = await fetch(checkinUrl, {
130+
method: 'POST',
131+
headers: {
132+
'content-type': 'application/json',
133+
authorization: `Bearer ${apiKey}`,
134+
'x-kiloclaw-gateway-token': gatewayToken,
135+
},
136+
signal: controller.signal,
137+
body: JSON.stringify({
138+
sandboxId,
139+
machineId: deps.getMachineId?.() ?? process.env.FLY_MACHINE_ID ?? '',
140+
controllerVersion: CONTROLLER_VERSION,
141+
controllerCommit: CONTROLLER_COMMIT,
142+
openclawVersion: openclawVersion.version,
143+
openclawCommit: openclawVersion.commit,
144+
supervisorState: stats.state,
145+
totalRestarts: stats.restarts,
146+
restartsSinceLastCheckin,
147+
uptimeSeconds: stats.uptime,
148+
loadAvg5m: loadavg()[1] ?? 0,
149+
bandwidthBytesIn,
150+
bandwidthBytesOut,
151+
lastExitReason,
152+
}),
153+
});
154+
} finally {
155+
clearTimeout(timeout);
156+
}
157+
158+
if (!response.ok) {
159+
const errorText = await response.text().catch(() => '');
160+
console.error(`[checkin] HTTP ${response.status}: ${errorText}`);
161+
return;
162+
}
163+
164+
// Only advance baselines after a successful checkin.
165+
previousRestarts = stats.restarts;
166+
previousNetStats = currentNetStats;
167+
} catch (err) {
168+
console.error('[checkin] failed:', err);
169+
} finally {
170+
checkinInFlight = false;
171+
}
172+
};
173+
174+
let interval: ReturnType<typeof setInterval> | undefined;
175+
176+
const initialTimeout = setTimeout(() => {
177+
void doCheckin();
178+
interval = setInterval(() => {
179+
void doCheckin();
180+
}, CHECKIN_INTERVAL_MS);
181+
}, INITIAL_DELAY_MS);
182+
183+
return () => {
184+
clearTimeout(initialTimeout);
185+
if (interval) {
186+
clearInterval(interval);
187+
}
188+
};
189+
}

kiloclaw/controller/src/index.ts

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ import { writeGogCredentials } from './gog-credentials';
2424
import { startWatchRenewal, stopWatchRenewal } from './gmail-watch-renewal';
2525
import { bootstrap } from './bootstrap';
2626
import type { ControllerStateRef, ControllerState } from './bootstrap';
27+
import { getOpenclawVersion } from './openclaw-version';
28+
import { startCheckin } from './checkin';
2729

2830
export type RuntimeConfig = {
2931
port: number;
@@ -226,13 +228,15 @@ export async function startController(env: NodeJS.ProcessEnv = process.env): Pro
226228
let supervisor: Supervisor | undefined;
227229
let gmailWatchSupervisor: Supervisor | undefined;
228230
let pairingCache: ReturnType<typeof createPairingCache> | undefined;
231+
let stopCheckin: (() => void) | undefined;
229232

230233
const onSignal = async (signal: NodeJS.Signals): Promise<void> => {
231234
if (shuttingDown) return;
232235
shuttingDown = true;
233236
console.log(`[controller] Received ${signal}, shutting down`);
234237

235238
pairingCache?.cleanup();
239+
stopCheckin?.();
236240
stopWatchRenewal();
237241
const shutdowns: Promise<void>[] = [];
238242
if (supervisor) shutdowns.push(supervisor.shutdown(signal));
@@ -395,6 +399,16 @@ export async function startController(env: NodeJS.ProcessEnv = process.env): Pro
395399
}
396400

397401
controllerState.current = { state: 'ready' };
402+
403+
stopCheckin = startCheckin({
404+
getApiKey: () => env.KILOCODE_API_KEY ?? '',
405+
getGatewayToken: () => config.expectedToken,
406+
getSandboxId: () => env.KILOCLAW_SANDBOX_ID ?? '',
407+
getCheckinUrl: () => env.KILOCLAW_CHECKIN_URL ?? '',
408+
getSupervisorStats: () => supervisor.getStats(),
409+
getOpenclawVersion,
410+
});
411+
398412
console.log(
399413
`[controller] Ready version=${CONTROLLER_VERSION} commit=${CONTROLLER_COMMIT} requireProxyToken=${config.requireProxyToken} wsIdleTimeoutMs=${config.wsIdleTimeoutMs} wsHandshakeTimeoutMs=${config.wsHandshakeTimeoutMs} maxWsConnections=${config.maxWsConnections}`
400414
);
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
import { execFile } from 'node:child_process';
2+
3+
export type OpenclawVersionInfo = { version: string | null; commit: string | null };
4+
5+
const OPENCLAW_VERSION_RE = /(\d{4}\.\d{1,2}\.\d{1,2})(?:\s+\(([0-9a-f]+)\))?/;
6+
7+
export function parseOpenclawVersion(raw: string): OpenclawVersionInfo {
8+
const match = raw.match(OPENCLAW_VERSION_RE);
9+
if (!match) return { version: null, commit: null };
10+
return { version: match[1], commit: match[2] ?? null };
11+
}
12+
13+
let openclawVersionPromise: Promise<OpenclawVersionInfo> | undefined;
14+
15+
/**
16+
* Resolve the installed openclaw version once and cache it for process lifetime.
17+
*
18+
* If openclaw is upgraded while the controller process is still running,
19+
* the cached value remains stale until the controller restarts.
20+
*/
21+
export function getOpenclawVersion(): Promise<OpenclawVersionInfo> {
22+
if (!openclawVersionPromise) {
23+
openclawVersionPromise = new Promise(resolve => {
24+
execFile(
25+
'/usr/bin/env',
26+
['HOME=/root', 'openclaw', '--version'],
27+
{ timeout: 5000 },
28+
(err, stdout) => {
29+
if (err) {
30+
resolve({ version: null, commit: null });
31+
return;
32+
}
33+
resolve(parseOpenclawVersion(stdout.toString().trim()));
34+
}
35+
);
36+
});
37+
}
38+
return openclawVersionPromise;
39+
}

kiloclaw/controller/src/routes/health.ts

Lines changed: 2 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,12 @@
1-
import { execFile } from 'node:child_process';
21
import type { Context, Hono } from 'hono';
32
import { timingSafeTokenEqual } from '../auth';
43
import type { Supervisor } from '../supervisor';
54
import type { ControllerStateRef } from '../bootstrap';
65
import { CONTROLLER_COMMIT, CONTROLLER_VERSION } from '../version';
76
import { getBearerToken } from './gateway';
7+
import { getOpenclawVersion } from '../openclaw-version';
88

9-
/** Parsed result from `openclaw --version` (e.g. "OpenClaw 2026.3.8 (3caab92)"). */
10-
type OpenclawVersionInfo = { version: string | null; commit: string | null };
11-
12-
const OPENCLAW_VERSION_RE = /(\d{4}\.\d{1,2}\.\d{1,2})(?:\s+\(([0-9a-f]+)\))?/;
13-
14-
export function parseOpenclawVersion(raw: string): OpenclawVersionInfo {
15-
const match = raw.match(OPENCLAW_VERSION_RE);
16-
if (!match) return { version: null, commit: null };
17-
return { version: match[1], commit: match[2] ?? null };
18-
}
19-
20-
/**
21-
* Resolve the installed openclaw version once and cache it for the process lifetime.
22-
* If the user upgrades openclaw while the controller is running, the cached value
23-
* becomes stale until the next redeploy (which restarts the controller process).
24-
* This is acceptable: the UI shows a "Modified" badge by comparing image vs running
25-
* version, and spawning a subprocess on every request is not worth the cost.
26-
*/
27-
let openclawVersionPromise: Promise<OpenclawVersionInfo> | undefined;
28-
function getOpenclawVersion(): Promise<OpenclawVersionInfo> {
29-
if (!openclawVersionPromise) {
30-
openclawVersionPromise = new Promise(resolve => {
31-
execFile(
32-
'/usr/bin/env',
33-
['HOME=/root', 'openclaw', '--version'],
34-
{ timeout: 5000 },
35-
(err, stdout) => {
36-
if (err) {
37-
resolve({ version: null, commit: null });
38-
} else {
39-
resolve(parseOpenclawVersion(stdout.toString().trim()));
40-
}
41-
}
42-
);
43-
});
44-
}
45-
return openclawVersionPromise;
46-
}
9+
export { parseOpenclawVersion } from '../openclaw-version';
4710

4811
export function registerHealthRoute(
4912
app: Hono,

kiloclaw/src/gateway/env.test.ts

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -418,8 +418,9 @@ describe('buildEnvVars', () => {
418418
instanceFeatures: ['nonexistent-feature'],
419419
});
420420

421-
// No KILOCLAW_* feature vars should be set (only platform defaults)
422-
const featureVars = Object.keys(result.env).filter(k => k.startsWith('KILOCLAW_'));
421+
// No feature-flag env vars should be set.
422+
const knownFeatureVars = new Set(Object.values(FEATURE_TO_ENV_VAR));
423+
const featureVars = Object.keys(result.env).filter(k => knownFeatureVars.has(k));
423424
expect(featureVars).toEqual([]);
424425
});
425426

kiloclaw/src/gateway/env.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,10 +190,12 @@ export async function buildEnvVars(
190190
if (env.DISCORD_DM_POLICY) plainEnv.DISCORD_DM_POLICY = env.DISCORD_DM_POLICY;
191191
if (env.OPENCLAW_ALLOWED_ORIGINS)
192192
plainEnv.OPENCLAW_ALLOWED_ORIGINS = env.OPENCLAW_ALLOWED_ORIGINS;
193+
if (env.KILOCLAW_CHECKIN_URL) plainEnv.KILOCLAW_CHECKIN_URL = env.KILOCLAW_CHECKIN_URL;
193194
plainEnv.REQUIRE_PROXY_TOKEN = env.REQUIRE_PROXY_TOKEN ?? 'false';
194195

195196
// Layer 5: Reserved system vars (cannot be overridden by any user config)
196197
sensitive.OPENCLAW_GATEWAY_TOKEN = await deriveGatewayToken(sandboxId, gatewayTokenSecret);
198+
plainEnv.KILOCLAW_SANDBOX_ID = sandboxId;
197199
plainEnv.AUTO_APPROVE_DEVICES = 'true';
198200

199201
// User-selected exec permissions preset (non-sensitive, survives restarts).

0 commit comments

Comments
 (0)