|
1 | | -# Observability |
| 1 | +# Observability and Monitoring |
2 | 2 |
|
3 | | -## Logging |
| 3 | +The platform includes a comprehensive observability stack designed for production readiness, leveraging **OpenTelemetry**, **Structured Logging**, and **Forensic Audit Logs**. |
4 | 4 |
|
5 | | -We use a structured logging approach suitable for production environments (Splunk, Datadog, ELK). |
| 5 | +## 1. Metrics & Tracing (OpenTelemetry) |
6 | 6 |
|
7 | | -* **Format**: JSON (Production) or Human-Readable (Dev). |
8 | | -* **Attributes**: Logs include `request_id`, `user_id`, `node_name`, and `execution_time`. |
| 7 | +We use **OpenTelemetry (OTel)** for vendor-neutral instrumentation. |
9 | 8 |
|
10 | | -### Enabling JSON Logs |
| 9 | +### Configuration |
11 | 10 |
|
12 | | -Set the environment variable or use the flag: |
| 11 | +Set the following environment variables: |
13 | 12 |
|
14 | | -```bash |
15 | | -export LOG_FORMAT=json |
16 | | -# or |
17 | | -nl2sql run "query" --json-logs |
18 | | -``` |
| 13 | +- `OBSERVABILITY_EXPORTER="otlp"`: Enables the OTLP exporter (requires a collector like Jaeger or Datadog Agent). |
| 14 | +- `OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"`: The endpoint for the collector (gRPC). |
19 | 15 |
|
20 | | -::: nl2sql.common.logger.JsonFormatter |
| 16 | +### Key Metrics |
21 | 17 |
|
22 | | -## Tracing |
| 18 | +| Metric Name | Type | Unit | Attributes | Description | |
| 19 | +| :--- | :--- | :--- | :--- | :--- | |
| 20 | +| `nl2sql.token.usage` | Counter | `1` | `model`, `agent`, `datasource_id` | Total LLM tokens consumed. | |
| 21 | +| `nl2sql.node.duration` | Histogram | `s` | `node`, `datasource_id` | Execution duration of graph nodes. | |
23 | 22 |
|
24 | | -The platform is instrumented with [LangSmith](https://smith.langchain.com/) for deep tracing of the Agentic Graph. |
| 23 | +### Visualization |
25 | 24 |
|
26 | | -1. Set `LANGCHAIN_TRACING_V2=true`. |
27 | | -2. Set `LANGCHAIN_API_KEY=...`. |
| 25 | +- **Local**: Use [Jaeger](https://www.jaegertracing.io/) for traces and [Prometheus](https://prometheus.io/) for metrics. |
| 26 | +- **Production**: Compatible with Datadog, Honeycomb, New Relic, etc. |
28 | 27 |
|
29 | | -This will stream full traces of the Planner, Validator, and Generator steps to the LangSmith dashboard. |
| 28 | +## 2. Structured Logging |
30 | 29 |
|
31 | | -## Metrics (Prometheus) |
| 30 | +For production, logs are output in **JSON format** to facilitate parsing by aggregators (Splunk, ELK). |
32 | 31 |
|
33 | | -The platform exposes a `/metrics` endpoint for Prometheus scraping. |
| 32 | +- **Activation**: JSON logging is automatically enabled when `OBSERVABILITY_EXPORTER="otlp"`. |
| 33 | +- **Correlation**: Every log entry includes a `trace_id` and `tenant_id` (if authenticated) to correlate logs across the request lifecycle. |
34 | 34 |
|
35 | | -### Key Metrics |
| 35 | +**Example Log Entry:** |
36 | 36 |
|
37 | | -| Metric Name | Type | Description | |
38 | | -| :--- | :--- | :--- | |
39 | | -| `nl2sql_requests_total` | Counter | Total number of requests served. | |
40 | | -| `nl2sql_request_latency_seconds` | Histogram | End-to-end latency distribution. | |
41 | | -| `nl2sql_token_usage_total` | Counter | Total LLM tokens consumed (prompt + completion). | |
42 | | -| `nl2sql_active_connections` | Gauge | Current number of active DB connections. | |
| 37 | +```json |
| 38 | +{ |
| 39 | + "timestamp": "2024-01-01T12:00:00", |
| 40 | + "level": "INFO", |
| 41 | + "message": "Planning phase completed", |
| 42 | + "trace_id": "8a3c...", |
| 43 | + "tenant_id": "org_123", |
| 44 | + "node": "planner" |
| 45 | +} |
| 46 | +``` |
| 47 | + |
| 48 | +## 3. Persistent Audit Log |
| 49 | + |
| 50 | +For forensic analysis and "Time Travel" debugging, the system maintains a separate, persistent audit log. |
| 51 | + |
| 52 | +- **Location**: `logs/audit_events.log` (Rotation enabled: 10MB x 5 backups). |
| 53 | +- **Content**: detailed record of AI Decisions (Prompt inputs, Model responses, Token usage). |
| 54 | +- **Purpose**: Allows operators to answer "Why did the AI say X?" hours or days later. |
| 55 | + |
| 56 | +**Event Structure:** |
| 57 | + |
| 58 | +```json |
| 59 | +{ |
| 60 | + "timestamp": "...", |
| 61 | + "event_type": "llm_interaction", |
| 62 | + "trace_id": "...", |
| 63 | + "tenant_id": "...", |
| 64 | + "data": { |
| 65 | + "agent": "planner", |
| 66 | + "model": "gpt-4o", |
| 67 | + "response_snippet": "SELECT * FROM...", |
| 68 | + "token_usage": {"total_tokens": 150} |
| 69 | + } |
| 70 | +} |
| 71 | +``` |
43 | 72 |
|
44 | | -### Grafana Dashboard |
| 73 | +## 4. Legacy Tooling |
45 | 74 |
|
46 | | -A standard Grafana dashboard ID `#12345` is available for import to visualize these metrics. |
| 75 | +The CLI `Performance Tree` is preserved for local development convenience but piggybacks on the same instrumentation hooks. |
0 commit comments