Skip to content

Commit cd15013

Browse files
georgeglarsonclaude
andcommitted
Update README with observability stack documentation
Add Observability section covering structured logging (Pino), distributed tracing (OpenTelemetry), log-trace correlation, and self-hosted SigNoz dashboards. Update architecture diagram, tech stack table, project structure, running instructions, and portfolio summary to reflect the full monitoring pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 30cf314 commit cd15013

1 file changed

Lines changed: 52 additions & 4 deletions

File tree

README.md

Lines changed: 52 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@ This is the same approach that works on enterprise software. It works on games t
2626

2727
**4. Make the architecture earn its keep.** Every abstraction exists because it solved a real problem. The SpatialManager exists because the aggro tick was O(n\*m) and needed spatial partitioning. The AggroPolicy exists because safe zones, density caps, and level scaling were scattered across three files. The EventBus exists because mob death needed to notify five decoupled systems.
2828

29-
**5. Ship it.** The game is live, behind nginx with SSL. It handles concurrent players. It's been through a full security audit. It's deployed and maintained.
29+
**5. Add observability.** You can't maintain what you can't see. Structured logging (Pino), distributed tracing (OpenTelemetry), and a self-hosted monitoring dashboard (SigNoz) so every request, every save, every AI call is traceable end-to-end. The same stack you'd wire up for a production microservice, applied to a game server.
30+
31+
**6. Ship it.** The game is live, behind nginx with SSL. It handles concurrent players. It's been through a full security audit. It's deployed and maintained.
3032

3133
## What I built on top of the legacy code
3234

@@ -54,6 +56,10 @@ Decomposed an 1,100-line God-object Player class into focused modules using SRP.
5456
│ │ Venice AI SDK (llama-3.3-70b) + Fish Audio TTS │ │
5557
│ └───────────────────────────────────────────────────┘ │
5658
├─────────────────────────────────────────────────────────┤
59+
│ Observability │
60+
│ Pino → OTel Collector → ClickHouse ← SigNoz UI │
61+
│ Structured logs, distributed traces, span metrics │
62+
├─────────────────────────────────────────────────────────┤
5763
│ Shared (4k LOC TypeScript) │
5864
│ Game types, zone data, skills, items, events │
5965
└─────────────────────────────────────────────────────────┘
@@ -75,6 +81,25 @@ Seven progression zones from village (safe) to boss arena. An AggroPolicy engine
7581

7682
Socket.IO WebSocket transport with 105 message types. Spatial partitioning for zone-based broadcasting so mobs only scan players in adjacent groups, not all players globally. Party system with proximity-based XP sharing. Per-message-type rate limiting. Spawn protection. Anti-exploit validation. SQLite persistence for all player state.
7783

84+
### Observability
85+
86+
Production-grade monitoring stack using the same tools and patterns as commercial microservices:
87+
88+
**Structured logging.** Every `console.*` call (316 across 48 files) replaced with Pino structured logging. 50 modules emit JSON logs with typed context: player IDs, item kinds, damage values, zone names. Player-scoped child loggers automatically attach identity to every log line. Hot-path logging (aggro ticks, movement) uses `trace` level that Pino skips entirely unless explicitly enabled.
89+
90+
**Distributed tracing.** OpenTelemetry SDK with manual span instrumentation on the paths that matter: message routing (`player.message.{type}`), persistence operations (`storage.saveCharacter`, `storage.loadPlayerState`), aggro ticks (`game.aggro_tick` with mob count attributes), and external AI calls (`ai.venice`, `ai.tts` with latency tracking). HTTP auto-instrumentation covers Socket.IO transport. 10% sampling in production to control volume.
91+
92+
**Log-trace correlation.** `pino-opentelemetry-transport` injects `trace_id` and `span_id` into every log line and ships logs via OTLP to the same collector that receives traces. Click a trace in SigNoz, see every log that happened during that request.
93+
94+
**Self-hosted dashboards.** SigNoz (ClickHouse-backed) with dashboards for server operations and AI/persistence monitoring. Four alert rules: aggro tick P99 latency, Venice AI response time, error rate spike, and save frequency drop. All running on the same VPS with ClickHouse capped at 2GB.
95+
96+
```
97+
Game Server ──OTLP HTTP──→ OTel Collector ──→ ClickHouse
98+
│ Pino (JSON logs) │ ↑
99+
│ OTel SDK (traces) │ SigNoz UI
100+
└─────────────────────────────┘ (dashboards, alerts)
101+
```
102+
78103
## Test suite
79104

80105
```
@@ -101,8 +126,9 @@ Vitest with v8 coverage. Storage tests use in-memory SQLite. Coverage thresholds
101126
| **Server** | Node.js, TypeScript 5.8, Socket.IO 4 |
102127
| **Database** | SQLite (better-sqlite3) |
103128
| **AI** | Venice AI SDK (llama-3.3-70b), Fish Audio TTS |
129+
| **Observability** | OpenTelemetry, Pino, SigNoz, ClickHouse |
104130
| **Testing** | Vitest 4, v8 coverage |
105-
| **Production** | nginx, Let's Encrypt SSL |
131+
| **Production** | nginx, Let's Encrypt SSL, Docker Compose |
106132
| **Package manager** | pnpm |
107133

108134
## Running locally
@@ -130,6 +156,21 @@ pnpm test:coverage
130156

131157
Client connects to `localhost:8000` by default. For production, configure `client/config/config.prod.json`.
132158

159+
### Observability stack (optional)
160+
161+
```bash
162+
# Start SigNoz (ClickHouse + OTel Collector + UI)
163+
docker compose -f docker-compose.signoz.yml up -d
164+
165+
# Start the server with OTel export enabled
166+
NODE_ENV=production OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
167+
node dist/server/ts/main.js
168+
169+
# SigNoz UI at http://localhost:3301
170+
```
171+
172+
In dev mode (`NODE_ENV !== 'production'`), traces print to the console and logs use pino-pretty. No collector needed for local development.
173+
133174
## Project structure
134175

135176
```
@@ -141,27 +182,34 @@ Fracture/
141182
│ ├── renderer/ # Canvas rendering, camera, particles
142183
│ └── ui/ # HUD, inventory, shop, achievement panels
143184
├── server/ts/ # Game server (128 files)
185+
│ ├── tracing.ts # OTel SDK bootstrap (imported first)
144186
│ ├── ai/ # Venice AI, narration, TTS
145187
│ ├── combat/ # Aggro policy, combat tracker, kill streaks, nemesis
146188
│ ├── player/ # MessageRouter + handler modules
147-
│ ├── storage/ # SQLite persistence
189+
│ ├── storage/ # SQLite persistence (instrumented with spans)
190+
│ ├── utils/logger.ts # Pino structured logging + OTel transport
148191
│ ├── world/ # Spatial manager, spawn manager, game loop
149192
│ └── __tests__/ # Test suite (43 files)
150193
├── shared/ts/ # Shared types (27 files)
151194
│ ├── zones/ # Zone boundaries and bonuses
152195
│ ├── skills/ # Skill definitions
153196
│ ├── items/ # Item types, legendaries, rarity
154197
│ └── events/ # Typed event bus
198+
├── deploy/ # Deployment configs
199+
│ ├── signoz/ # OTel Collector config
200+
│ └── common/ # ClickHouse configs
201+
├── docker-compose.signoz.yml # SigNoz observability stack
155202
└── specs/ # Feature specifications
156203
```
157204

158205
## What this demonstrates
159206

160207
- **Legacy modernization.** Taking a real codebase from 2012 and systematically improving it without rewriting from scratch.
161208
- **Systems design.** Combat, inventory, progression, zones, AI, persistence, real-time networking, all integrated and tested.
209+
- **Observability engineering.** Structured logging, distributed tracing, and self-hosted monitoring wired end-to-end. The same OTel + Pino + SigNoz stack used in production microservices, applied to a game server.
162210
- **AI-augmented development.** Built with Claude as a development partner, showing what one engineer can ship with AI tooling.
163211
- **Testing discipline.** 2,255 tests, coverage thresholds enforced, tests written before refactors.
164-
- **Production operations.** SSL, reverse proxy, rate limiting, anti-exploit guards, persistence, deployed and running.
212+
- **Production operations.** SSL, reverse proxy, rate limiting, anti-exploit guards, Docker Compose infrastructure, deployed and running.
165213

166214
## Credits
167215

0 commit comments

Comments
 (0)