Skip to content

Commit 8f22340

Browse files
authored
feat(gastown): add observability infrastructure (#1075)
* feat(gastown): add observability — structured logging, event streaming, alerting, and usage metrics Adds Sentry integration to the Gastown worker, structured container process logging, bead/convoy event broadcasting over WebSocket, alarm-based alerting for review queue depth, escalation rate, and agent restart loops, and Analytics Engine instrumentation for lifecycle events. Closes #228 * fix(gastown): use @cloudflare/workers-types via tsconfig instead of reference directive - Regenerate worker-configuration.d.ts with --include-runtime=false - Install @cloudflare/workers-types and add to tsconfig types array - Update types script to use --include-runtime=false flag - Preserve GitTokenService RPC types as manual override * fix(gastown): address review feedback on observability infrastructure - Add userId to all analytics events via cached owner_user_id - Remove meta/alerting events (queue_depth_alert, rate_spike, restart_loop) and their check methods — alerting belongs upstream - Remove container.cold_start/oom from event union (needs TownContainerDO refactoring, deferred to follow-up) - Implement all previously-declared-but-unimplemented event emissions: bead.status_changed, escalation.acknowledged, nudge.queued, nudge.delivered - Fix dashboard status WebSocket to reconnect when town ID changes - Remove eager connectStatusWs() on page load (was connecting to random placeholder town ID) * feat(gastown): add Sentry sourcemap uploads and clean up DSN config - Enable upload_source_maps and version_metadata in wrangler.jsonc - Add sentry-cli sourcemap upload to deploy:prod via postdeploy hook - Pass CF_VERSION_METADATA.id as Sentry release for stack trace linking - Remove empty SENTRY_DSN var (now a worker secret set via dashboard) - Install @sentry/cli as devDependency * feat(gastown): instrument all HTTP routes and tRPC procedures with analytics - Add delivery (http/trpc/internal), route, and error fields to events - Add timing middleware to capture high-res request start timestamp - Add instrumented() wrapper applied to all 81 HTTP route handlers - Add tRPC analytics middleware on base procedure (wraps all 36 procedures) - Capture all errors to Sentry in both HTTP and tRPC layers - Tag DO-internal events with delivery: 'internal' * fix(gastown): drop all CHECK constraints from DO SQLite tables The beads table on pre-existing DOs still had the old CHECK constraint `status in ('open', 'in_progress', 'closed', 'failed')` which rejects the newer 'in_review' status, causing SQLITE_CONSTRAINT errors in handleAgentDone. - Remove all CHECK constraints from all table definitions (Zod validates at the application layer) - Add dropCheckConstraints() migration that detects tables with CHECK constraints via sqlite_master and recreates them without constraints - Migration is idempotent and includes rollback on failure * feat(admin): add Gastown analytics dashboard with charts - Add API route that proxies SQL queries to CF Analytics Engine (overview, events timeseries, error rates, top users, latency, delivery) - Add React hooks for each query type with 1-minute auto-refresh - Build dashboard page with: - Overview KPI cards (total events, unique users, avg latency, error rate) - Stacked area chart: events over time (top 15 by volume) - Stacked bar chart: delivery breakdown (HTTP/tRPC/internal) over time - Horizontal bar chart: success vs error rates by event with error % line - Latency table: avg response time by event and delivery type - Top users table: most active users with links to admin panel - Configurable time window (1h to 30d) via dropdown - Requires CF_ANALYTICS_ENGINE_TOKEN env var for API access * feat(gastown): add Grafana dashboard for Analytics Engine data 23 panels across 6 sections: - Overview: total events, unique users, avg latency, error rate stats - Throughput: RPS by delivery, event volume stacked bars, top events - Errors: error count over time, error rate by delivery, error counts table, top error messages table - Latency: avg latency by delivery, avg latency by top events, slowest endpoints table with route-level detail - Users & Accounts: active users/towns over time, top users by event count, top users by error count - Domain Breakdown: delivery type pie, top events pie, all events summary table with success/error/latency - Internal DO Events: bead lifecycle, agent/review/convoy events Uses $timeSeries, $timeFilter, $interval_s Grafana macros for the cloudflare-analytics-engine datasource plugin. * fix(gastown): fix Grafana dashboard panel queries for ClickHouse datasource plugin All panels now have the required target properties: - dateTimeType: DATETIME - dateTimeColDataType: timestamp - editorMode: sql - table: gastown_events - query field set (not just rawSql) - datasource type: vertamedia-clickhouse-datasource - $interval_s replaced with $interval * fix(gastown): remove subqueries from Analytics Engine SQL queries CF Analytics Engine doesn't support IN (SELECT ...) subqueries. Removed the top-N subquery filter from: - Grafana panels 7 (Event Volume top events) and 13 (Avg Latency top events) - Admin API events-timeseries query The queries now return all events grouped by time — users can toggle individual series via the Grafana legend. * fix(gastown): address review feedback and fix lint/format - Fix Sentry double-capture: remove captureException from instrumented() and tRPC analytics middleware; keep single capture in app.onError() and trpcServer onError (guarded to skip TRPCErrors) - Fix CHECK constraint regex: handle nested parens in check(col in (...)) - Fix sourcemap release mismatch: inject SENTRY_RELEASE via --var at deploy time using sentry-cli propose-version (git SHA) - Fix agent-auth userId: fall back to agentJWT.userId when kiloUserId is unset (agent-authenticated routes) - Fix status WebSocket: connect on initial page load, not just on change - Fix SDK session leak: decrement sessionCount when agent completes normally via session.idle - Fix index loss: reorder initBeadTables to run dropCheckConstraints before index creation - Fix lint: use type narrowing instead of String() for sqlite_master rows - Fix format: prettier on container/src/logger.ts - Fix Grafana: convert Total Events and Unique Users to time series stat panels with correct field selectors * fix(gastown): address remaining review comments - Fix SDK session leak on stream errors (process-manager catch path) - Add convoyId/role/beadType to Analytics Engine blobs (blob11-13) - Fix error-rate line plotted on count axis — add secondary X axis - Add client-side top-15 filtering to EventsTimeseriesChart - Add LIMIT 500 to unbounded Grafana time series panels (7, 13) - Update Grafana panel titles to reflect actual behavior * fix(gastown): fix event name collisions, latency dilution, and formatting - Fix deriveHttpEventName: distinguish list vs get by checking if route ends with a param segment; keep 'mayor' as a meaningful segment - Fix overview avg_latency_ms: only average http/trpc events (skip zero-duration internal events) - Fix top-users avg_latency_ms: same conditional filtering - Format gastown-grafana-dash-1.json with prettier
1 parent cba9409 commit 8f22340

30 files changed

Lines changed: 4515 additions & 11414 deletions

cloudflare-gastown/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
11
.dev.vars
22
container/dist/
33
dist-types/
4+
5+
# Sentry Config File
6+
.sentryclirc
7+
dist
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
/**
2+
* Structured logger for the container process.
3+
*
4+
* Outputs JSON to console.* so entries appear in Workers Logs
5+
* (captured via `observability: { enabled: true }` in wrangler.jsonc).
6+
* Keep this module dependency-free and synchronous.
7+
*/
8+
9+
function flatten(data?: unknown): Record<string, unknown> {
10+
if (!data || typeof data !== 'object') return {};
11+
return data as Record<string, unknown>;
12+
}
13+
14+
export const log = {
15+
info: (msg: string, data?: unknown) =>
16+
console.log(
17+
JSON.stringify({ level: 'info', msg, ...flatten(data), ts: new Date().toISOString() })
18+
),
19+
warn: (msg: string, data?: unknown) =>
20+
console.warn(
21+
JSON.stringify({ level: 'warn', msg, ...flatten(data), ts: new Date().toISOString() })
22+
),
23+
error: (msg: string, data?: unknown) =>
24+
console.error(
25+
JSON.stringify({ level: 'error', msg, ...flatten(data), ts: new Date().toISOString() })
26+
),
27+
};
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
11
import { startControlServer } from './control-server';
2+
import { log } from './logger';
3+
4+
log.info('container.cold_start', { uptime: 0, ts: new Date().toISOString() });
5+
6+
process.on('uncaughtException', err => {
7+
log.error('container.uncaught_exception', { error: err.message, stack: err.stack });
8+
process.exit(1);
9+
});
210

311
startControlServer();

cloudflare-gastown/container/src/process-manager.ts

Lines changed: 58 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import { createKilo, type KiloClient } from '@kilocode/sdk';
1010
import { z } from 'zod';
1111
import type { ManagedAgent, StartAgentRequest } from './types';
1212
import { reportAgentCompleted } from './completion-reporter';
13+
import { log } from './logger';
1314

1415
const MANAGER_LOG = '[process-manager]';
1516

@@ -272,24 +273,50 @@ async function subscribeToEvents(
272273
const isTerminal = event.type === 'session.idle' && request.role !== 'mayor';
273274

274275
if (isTerminal) {
275-
console.log(
276-
`${MANAGER_LOG} Completion detected for agent ${agent.agentId} (${agent.name}) event=${event.type}`
277-
);
276+
log.info('agent.exit', {
277+
agentId: agent.agentId,
278+
name: agent.name,
279+
reason: 'completed',
280+
exitReason: 'completed',
281+
});
278282
agent.status = 'exited';
279283
agent.exitReason = 'completed';
280284
broadcastEvent(agent.agentId, 'agent.exited', { reason: 'completed' });
281285
void reportAgentCompleted(agent, 'completed');
286+
287+
// Release SDK session so the server can shut down when idle
288+
const inst = sdkInstances.get(agent.workdir);
289+
if (inst) {
290+
inst.sessionCount--;
291+
if (inst.sessionCount <= 0) {
292+
inst.server.close();
293+
sdkInstances.delete(agent.workdir);
294+
}
295+
}
282296
break;
283297
}
284298
}
285299
} catch (err) {
286300
if (!controller.signal.aborted) {
287-
console.error(`${MANAGER_LOG} Event stream error for agent ${agent.agentId}:`, err);
301+
log.error('agent.stream_error', {
302+
agentId: agent.agentId,
303+
error: err instanceof Error ? err.message : String(err),
304+
});
288305
if (agent.status === 'running') {
289306
agent.status = 'failed';
290307
agent.exitReason = 'Event stream error';
291308
broadcastEvent(agent.agentId, 'agent.exited', { reason: 'stream error' });
292309
void reportAgentCompleted(agent, 'failed', 'Event stream error');
310+
311+
// Release SDK session on stream error (same cleanup as normal completion)
312+
const inst = sdkInstances.get(agent.workdir);
313+
if (inst) {
314+
inst.sessionCount--;
315+
if (inst.sessionCount <= 0) {
316+
inst.server.close();
317+
sdkInstances.delete(agent.workdir);
318+
}
319+
}
293320
}
294321
}
295322
} finally {
@@ -390,9 +417,13 @@ export async function startAgent(
390417
}
391418
agent.messageCount = 1;
392419

393-
console.log(
394-
`${MANAGER_LOG} Started agent ${request.name} (${request.agentId}) session=${sessionId} port=${port}`
395-
);
420+
log.info('agent.start', {
421+
agentId: request.agentId,
422+
role: request.role,
423+
name: request.name,
424+
sessionId,
425+
port,
426+
});
396427

397428
return agent;
398429
} catch (err) {
@@ -433,11 +464,15 @@ export async function stopAgent(agentId: string): Promise<void> {
433464
}
434465
}
435466
} catch (err) {
436-
console.warn(`${MANAGER_LOG} Failed to abort session for agent ${agentId}:`, err);
467+
log.warn('agent.stop_failed', {
468+
agentId,
469+
error: err instanceof Error ? err.message : String(err),
470+
});
437471
}
438472

439473
agent.status = 'exited';
440474
agent.exitReason = 'stopped';
475+
log.info('agent.exit', { agentId, reason: 'stopped', exitReason: 'stopped' });
441476
broadcastEvent(agentId, 'agent.exited', { reason: 'stopped' });
442477
}
443478

@@ -454,13 +489,21 @@ export async function sendMessage(agentId: string, prompt: string): Promise<void
454489
const instance = sdkInstances.get(agent.workdir);
455490
if (!instance) throw new Error(`No SDK instance for agent ${agentId}`);
456491

457-
await instance.client.session.prompt({
458-
path: { id: agent.sessionId },
459-
body: {
460-
parts: [{ type: 'text', text: prompt }],
461-
...(agent.model ? { model: { providerID: 'kilo', modelID: agent.model } } : {}),
462-
},
463-
});
492+
try {
493+
await instance.client.session.prompt({
494+
path: { id: agent.sessionId },
495+
body: {
496+
parts: [{ type: 'text', text: prompt }],
497+
...(agent.model ? { model: { providerID: 'kilo', modelID: agent.model } } : {}),
498+
},
499+
});
500+
} catch (err) {
501+
log.error('agent.send_failed', {
502+
agentId,
503+
error: err instanceof Error ? err.message : String(err),
504+
});
505+
throw err;
506+
}
464507

465508
agent.messageCount++;
466509
agent.lastActivityAt = new Date().toISOString();

0 commit comments

Comments
 (0)