graph LR
App["FastAPI App"]
App -->|"LLM traces\n(every call)"| Langfuse
App -->|"HTTP metrics\n(/metrics)"| Prometheus
Prometheus --> Grafana
App -->|"structured logs\n(stdout)"| Logs["Log aggregator\n(or stdout)"]
App -->|"slow request profiles\n(DEBUG only)"| Profiles["JSON files\n(PROFILING_DIR)"]
Every LLM call is traced via the LangChain CallbackHandler. Traces include:
- Input messages and output
- Token usage and cost
- Latency per call and per session
- Model name, temperature, and other parameters
Setup:
LANGFUSE_TRACING_ENABLED=true
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted URLDisable for local dev:
LANGFUSE_TRACING_ENABLED=falseTraces are also used as the data source for the evaluation framework.
All logs use structlog in a consistent format:
- Development: coloured console output
- Production: JSON (pipe to your log aggregator)
Every log line automatically carries request_id, session_id, and user_id when available — bound by LoggingContextMiddleware.
# Good
logger.info("chat_request_received", session_id=session.id, message_count=5)
# Never
logger.info(f"chat request received for {session.id}") # no f-strings
logger.error("something failed", error=str(e)) # use logger.exception for exceptionsRules:
- Event names are
lowercase_with_underscores - Variables are keyword arguments, never interpolated into the event string
- Use
logger.exception()(not.error()) when inside anexceptblock — preserves the full traceback
| Environment | Level |
|---|---|
| development | DEBUG |
| staging | INFO |
| production | WARNING |
Metrics are exposed at GET /metrics and scraped by Prometheus.
| Metric | Type | Description |
|---|---|---|
http_requests_total |
Counter | Request count by method, endpoint, status |
http_request_duration_seconds |
Histogram | Request latency by method, endpoint |
llm_inference_duration_seconds |
Histogram | LLM call latency by model |
llm_stream_duration_seconds |
Histogram | Streaming call latency by model |
db_connections |
Gauge | Active database connections |
Grafana dashboards are pre-configured in grafana/. Start the full stack with make stack-up ENV=development to access them at http://localhost:3000 (admin/admin).
When DEBUG=true, ProfilingMiddleware profiles every request using pyinstrument. When a request exceeds PROFILING_THRESHOLD_SECONDS, a JSON report is saved to PROFILING_DIR.
Each report file is named {request_id}.json and contains:
{
"request_id": "...",
"endpoint": "POST /api/v1/chatbot/chat",
"wall_time_ms": 1842,
"cpu_time_ms": 145,
"io_wait_ms": 1697,
"memory_peak_kb": 4820,
"top_memory_allocators": [...],
"call_tree": {...}
}Set PROFILING_THRESHOLD_SECONDS=0 to profile every request.
The request_id in the filename matches the X-Request-ID response header, so you can correlate profiles with specific log lines.
Every request gets a unique X-Request-ID header via asgi-correlation-id. This ID is:
- Returned in the response headers
- Bound to every log line for that request
- Used as the filename for profile reports
Use the X-Request-ID from a response to grep logs, find profiles, and look up Langfuse traces for that exact request.