Skip to content

Commit 39164e2

Browse files
committed
feat(gastown): measure true container cold-start and mayor-ready latency
The prior container-startup panels measured DO→container RPC round-trips (/health and /agents/start), not actual cold-start time — /health is truncated at a 5s client timeout so p99 was bounded below the true cold-start budget, and the dashboard queries filtered on blob8/blob9 values that don't exist in the AE schema so the panels showed nothing. This replaces those with two metrics that answer the original questions: 1. container.cold_start — TownContainerDO.warmUp() invokes the Container class's startAndWaitForPorts() directly and times it. Emitted only when the container was actually started (state != healthy), so the quantiles reflect real cold starts without being capped by an arbitrary client-side timeout. 2. mayor.session_ready — container stamps mayorReadyAt when the first mayor agent transitions to 'running' and exposes it via /health. Town DO reads it and emits durationMs = mayorReadyAt - startedAt exactly once per container lifetime (deduped in DO storage keyed by containerStartedAt). Dashboard fixes: - Rename 'Container Startup Latency' row to 'DO → Container RPC Latency' and clarify panel titles so operators don't read p99 off those and think it's cold-start time. - Fix broken success/failure filters: blob8='ok' / blob9='true' → blob5='' (error absent), blob5!='' (error present). - Convert quantile queries from label-column style (which collapsed all three series to 'latency_ms') to column-name style (AS p50/p90/ p99), so the legend actually distinguishes the percentiles. - Add new 'Container Cold Start & Mayor Ready' row with p50/p90/p99 panels for the two new events.
1 parent 5f6f613 commit 39164e2

6 files changed

Lines changed: 448 additions & 32 deletions

File tree

services/gastown/container/src/control-server.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import {
1010
activeServerCount,
1111
getUptime,
1212
getStartTime,
13+
getMayorReadyAt,
1314
stopAll,
1415
drainAll,
1516
isDraining,
@@ -221,6 +222,7 @@ app.get('/health', c => {
221222
uptime: getUptime(),
222223
draining: isDraining() || undefined,
223224
startedAt: getStartTime(),
225+
mayorReadyAt: getMayorReadyAt() ?? undefined,
224226
};
225227
return c.json(response);
226228
});

services/gastown/container/src/process-manager.ts

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,26 @@ export function getStartTime(): string {
7878
return new Date(startTime).toISOString();
7979
}
8080

81+
// Timestamp (ISO 8601) of the moment the first mayor agent in this container
82+
// reached 'running' status. Used by /health so the Town DO can compute
83+
// container-start-to-mayor-ready latency. Stays null until a mayor is up;
84+
// survives subsequent mayor exits since the window is measured against the
85+
// first mayor ready in the container's lifetime.
86+
let mayorReadyAt: string | null = null;
87+
88+
export function getMayorReadyAt(): string | null {
89+
return mayorReadyAt;
90+
}
91+
92+
function markMayorReadyOnce(): void {
93+
if (mayorReadyAt !== null) return;
94+
mayorReadyAt = new Date().toISOString();
95+
log.info('mayor.ready', {
96+
containerUptimeMs: getUptime(),
97+
mayorReadyAt,
98+
});
99+
}
100+
81101
async function hydrateDbFromSnapshot(
82102
agentId: string,
83103
apiUrl: string,
@@ -1044,6 +1064,9 @@ export async function startAgent(
10441064
// despite being active — causing the drain to wait indefinitely.
10451065
if (agent.status === 'starting') {
10461066
agent.status = 'running';
1067+
if (request.role === 'mayor') {
1068+
markMayorReadyOnce();
1069+
}
10471070
}
10481071

10491072
// 4. Send the initial prompt

services/gastown/container/src/types.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,10 @@ export type HealthResponse = {
165165
uptime: number;
166166
draining?: boolean;
167167
startedAt?: string;
168+
/** ISO 8601 timestamp of the first mayor agent reaching 'running' status
169+
* in this container's lifetime. Used by the worker to measure container
170+
* cold-start → mayor-session-ready latency. */
171+
mayorReadyAt?: string;
168172
};
169173

170174
// ── Kilo serve instance ─────────────────────────────────────────────────

0 commit comments

Comments
 (0)