Commit 8f22340
authored
feat(gastown): add observability infrastructure (#1075)
* feat(gastown): add observability — structured logging, event streaming, alerting, and usage metrics
Adds Sentry integration to the Gastown worker, structured container process
logging, bead/convoy event broadcasting over WebSocket, alarm-based alerting
for review queue depth, escalation rate, and agent restart loops, and
Analytics Engine instrumentation for lifecycle events.
Closes #228
* fix(gastown): use @cloudflare/workers-types via tsconfig instead of reference directive
- Regenerate worker-configuration.d.ts with --include-runtime=false
- Install @cloudflare/workers-types and add to tsconfig types array
- Update types script to use --include-runtime=false flag
- Preserve GitTokenService RPC types as manual override
* fix(gastown): address review feedback on observability infrastructure
- Add userId to all analytics events via cached owner_user_id
- Remove meta/alerting events (queue_depth_alert, rate_spike, restart_loop)
and their check methods — alerting belongs upstream
- Remove container.cold_start/oom from event union (needs TownContainerDO
refactoring, deferred to follow-up)
- Implement all previously-declared-but-unimplemented event emissions:
bead.status_changed, escalation.acknowledged, nudge.queued, nudge.delivered
- Fix dashboard status WebSocket to reconnect when town ID changes
- Remove eager connectStatusWs() on page load (was connecting to
random placeholder town ID)
* feat(gastown): add Sentry sourcemap uploads and clean up DSN config
- Enable upload_source_maps and version_metadata in wrangler.jsonc
- Add sentry-cli sourcemap upload to deploy:prod via postdeploy hook
- Pass CF_VERSION_METADATA.id as Sentry release for stack trace linking
- Remove empty SENTRY_DSN var (now a worker secret set via dashboard)
- Install @sentry/cli as devDependency
* feat(gastown): instrument all HTTP routes and tRPC procedures with analytics
- Add delivery (http/trpc/internal), route, and error fields to events
- Add timing middleware to capture high-res request start timestamp
- Add instrumented() wrapper applied to all 81 HTTP route handlers
- Add tRPC analytics middleware on base procedure (wraps all 36 procedures)
- Capture all errors to Sentry in both HTTP and tRPC layers
- Tag DO-internal events with delivery: 'internal'
* fix(gastown): drop all CHECK constraints from DO SQLite tables
The beads table on pre-existing DOs still had the old CHECK constraint
`status in ('open', 'in_progress', 'closed', 'failed')` which rejects
the newer 'in_review' status, causing SQLITE_CONSTRAINT errors in
handleAgentDone.
- Remove all CHECK constraints from all table definitions (Zod validates
at the application layer)
- Add dropCheckConstraints() migration that detects tables with CHECK
constraints via sqlite_master and recreates them without constraints
- Migration is idempotent and includes rollback on failure
* feat(admin): add Gastown analytics dashboard with charts
- Add API route that proxies SQL queries to CF Analytics Engine
(overview, events timeseries, error rates, top users, latency, delivery)
- Add React hooks for each query type with 1-minute auto-refresh
- Build dashboard page with:
- Overview KPI cards (total events, unique users, avg latency, error rate)
- Stacked area chart: events over time (top 15 by volume)
- Stacked bar chart: delivery breakdown (HTTP/tRPC/internal) over time
- Horizontal bar chart: success vs error rates by event with error % line
- Latency table: avg response time by event and delivery type
- Top users table: most active users with links to admin panel
- Configurable time window (1h to 30d) via dropdown
- Requires CF_ANALYTICS_ENGINE_TOKEN env var for API access
* feat(gastown): add Grafana dashboard for Analytics Engine data
23 panels across 6 sections:
- Overview: total events, unique users, avg latency, error rate stats
- Throughput: RPS by delivery, event volume stacked bars, top events
- Errors: error count over time, error rate by delivery, error
counts table, top error messages table
- Latency: avg latency by delivery, avg latency by top events,
slowest endpoints table with route-level detail
- Users & Accounts: active users/towns over time, top users by
event count, top users by error count
- Domain Breakdown: delivery type pie, top events pie, all events
summary table with success/error/latency
- Internal DO Events: bead lifecycle, agent/review/convoy events
Uses $timeSeries, $timeFilter, $interval_s Grafana macros for the
cloudflare-analytics-engine datasource plugin.
* fix(gastown): fix Grafana dashboard panel queries for ClickHouse datasource plugin
All panels now have the required target properties:
- dateTimeType: DATETIME
- dateTimeColDataType: timestamp
- editorMode: sql
- table: gastown_events
- query field set (not just rawSql)
- datasource type: vertamedia-clickhouse-datasource
- $interval_s replaced with $interval
* fix(gastown): remove subqueries from Analytics Engine SQL queries
CF Analytics Engine doesn't support IN (SELECT ...) subqueries.
Removed the top-N subquery filter from:
- Grafana panels 7 (Event Volume top events) and 13 (Avg Latency top events)
- Admin API events-timeseries query
The queries now return all events grouped by time — users can
toggle individual series via the Grafana legend.
* fix(gastown): address review feedback and fix lint/format
- Fix Sentry double-capture: remove captureException from instrumented()
and tRPC analytics middleware; keep single capture in app.onError() and
trpcServer onError (guarded to skip TRPCErrors)
- Fix CHECK constraint regex: handle nested parens in check(col in (...))
- Fix sourcemap release mismatch: inject SENTRY_RELEASE via --var at
deploy time using sentry-cli propose-version (git SHA)
- Fix agent-auth userId: fall back to agentJWT.userId when kiloUserId
is unset (agent-authenticated routes)
- Fix status WebSocket: connect on initial page load, not just on change
- Fix SDK session leak: decrement sessionCount when agent completes
normally via session.idle
- Fix index loss: reorder initBeadTables to run dropCheckConstraints
before index creation
- Fix lint: use type narrowing instead of String() for sqlite_master rows
- Fix format: prettier on container/src/logger.ts
- Fix Grafana: convert Total Events and Unique Users to time series
stat panels with correct field selectors
* fix(gastown): address remaining review comments
- Fix SDK session leak on stream errors (process-manager catch path)
- Add convoyId/role/beadType to Analytics Engine blobs (blob11-13)
- Fix error-rate line plotted on count axis — add secondary X axis
- Add client-side top-15 filtering to EventsTimeseriesChart
- Add LIMIT 500 to unbounded Grafana time series panels (7, 13)
- Update Grafana panel titles to reflect actual behavior
* fix(gastown): fix event name collisions, latency dilution, and formatting
- Fix deriveHttpEventName: distinguish list vs get by checking if route
ends with a param segment; keep 'mayor' as a meaningful segment
- Fix overview avg_latency_ms: only average http/trpc events (skip
zero-duration internal events)
- Fix top-users avg_latency_ms: same conditional filtering
- Format gastown-grafana-dash-1.json with prettier1 parent cba9409 commit 8f22340
30 files changed
Lines changed: 4515 additions & 11414 deletions
File tree
- cloudflare-gastown
- container/src
- scripts
- src
- db/tables
- dos
- town
- middleware
- trpc
- ui
- util
- src/app/admin
- api/gastown-analytics
- gastown
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
2 | 10 | | |
3 | 11 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| |||
272 | 273 | | |
273 | 274 | | |
274 | 275 | | |
275 | | - | |
276 | | - | |
277 | | - | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
278 | 282 | | |
279 | 283 | | |
280 | 284 | | |
281 | 285 | | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
282 | 296 | | |
283 | 297 | | |
284 | 298 | | |
285 | 299 | | |
286 | 300 | | |
287 | | - | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
288 | 305 | | |
289 | 306 | | |
290 | 307 | | |
291 | 308 | | |
292 | 309 | | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
293 | 320 | | |
294 | 321 | | |
295 | 322 | | |
| |||
390 | 417 | | |
391 | 418 | | |
392 | 419 | | |
393 | | - | |
394 | | - | |
395 | | - | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
396 | 427 | | |
397 | 428 | | |
398 | 429 | | |
| |||
433 | 464 | | |
434 | 465 | | |
435 | 466 | | |
436 | | - | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
437 | 471 | | |
438 | 472 | | |
439 | 473 | | |
440 | 474 | | |
| 475 | + | |
441 | 476 | | |
442 | 477 | | |
443 | 478 | | |
| |||
454 | 489 | | |
455 | 490 | | |
456 | 491 | | |
457 | | - | |
458 | | - | |
459 | | - | |
460 | | - | |
461 | | - | |
462 | | - | |
463 | | - | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
464 | 507 | | |
465 | 508 | | |
466 | 509 | | |
| |||
0 commit comments