nadeem4
diff --git a/‎README.md‎
Lines changed: 8 additions & 0 deletions b/‎README.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎audit/remediation_plan.md‎
Lines changed: 10 additions & 5 deletions b/‎audit/remediation_plan.md‎
Lines changed: 10 additions & 5 deletions
diff --git a/‎audit/remediation_plan_observability.md‎
Lines changed: 64 additions & 0 deletions b/‎audit/remediation_plan_observability.md‎
Lines changed: 64 additions & 0 deletions
diff --git a/‎configs/llm.demo.yaml‎
Lines changed: 1 addition & 1 deletion b/‎configs/llm.demo.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/ops/configuration.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/ops/configuration.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/ops/observability.md‎
Lines changed: 58 additions & 29 deletions b/‎docs/ops/observability.md‎
Lines changed: 58 additions & 29 deletions
diff --git a/‎packages/core/pyproject.toml‎
Lines changed: 1 addition & 0 deletions b/‎packages/core/pyproject.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎packages/core/src/nl2sql/common/event_logger.py‎
Lines changed: 89 additions & 0 deletions b/‎packages/core/src/nl2sql/common/event_logger.py‎
Lines changed: 89 additions & 0 deletions
diff --git a/‎packages/core/src/nl2sql/common/logger.py‎
Lines changed: 17 additions & 6 deletions b/‎packages/core/src/nl2sql/common/logger.py‎
Lines changed: 17 additions & 6 deletions
@@ -39,6 +39,13 @@ The architecture is composed of three distinct planes, ensuring separation of co
 * **Layered Defense**: A combination of **[Retries, Circuit Breakers, and Sandboxing](docs/core/reliability.md)** ensures the system stays up even when LLMs or Databases go down.
 * **Fail-Fast**: We stop processing immediately if a dependency is unresponsive, preserving resources.
 
+### 5. The Observability Plane (The Watchtower)
+
+**Responsibility**: Visibility, Forensics, and Compliance.
+
+* **Full-Stack Telemetry**: Native [OpenTelemetry](docs/ops/observability.md) integration provides distributed tracing (Jaeger) and metrics (Prometheus) for every node execution.
+* **Forensic Audit Logs**: A tamper-evident, persistent [Audit Log](docs/ops/observability.md#3-persistent-audit-log) records every AI decision (Prompt/Response/Reasoning) for compliance and debugging.
+
 ---
 
 ## 📐 Architectural Invariants
@@ -92,6 +99,7 @@ nl2sql setup --demo
 * **[Security Model](docs/safety/security.md)**: Defense-in-depth strategy against prompt injection and unauthorized access.
 * **[Security Model](docs/safety/security.md)**: Defense-in-depth strategy against prompt injection and unauthorized access.
 * **[Reliability & Fault Tolerance](docs/core/reliability.md)**: Guide to Circuit Breakers, Sandbox isolation, and Recovery strategies.
+* **[Observability & Operations](docs/ops/observability.md)**: Configuring OpenTelemetry, Logging, and Audit Trails.
 
 ---
 
 
@@ -71,17 +71,22 @@ This document serves as the master backlog for addressing findings from the Arch
   - **Status**: Fixed. Implemented in `nl2sql.common.resilience` and verified in `tests/unit/test_resilience.py`.
 
 - [ ] **ENH-003: OpenTelemetry Integration** (P1 - High)
-  - **Value**: Enable standard APM features (Datadog/Jaeger) for trace visualization.
-  - **Action**: Replace custom `json` logging with OTelSDK.
+  - **Value**: Standardize metrics export (Latency, Token Counts, Errors) to OTLP-compatible backends (Datadog, Honeycomb).
+  - **Action**: Replace in-memory `LATENCY_LOG` with `opentelemetry-sdk` MeterProvider.
+  - **Dependencies**: `opentelemetry-api`, `opentelemetry-sdk`, `opentelemetry-exporter-otlp`.
 
 - [ ] **ENH-004: Persistent Audit Log** (P1 - High)
-  - **Value**: Required for Compliance and Regression Testing.
-  - **Action**: Create a `request_audit` database table to store Query/Plan/SQL tuples.
+  - **Value**: Enable forensic debugging of AI decisions ("Time Travel Debugging").
+  - **Action**: Implement `EventLogger` that writes {prompt, response, trace_id, duration} to a persistent store (file/DB) securely.
 
 - [ ] **ENH-005: Tenant-Aware RLS Middleware** (P2 - Medium)
   - **Value**: Defense-in-depth enforcement of multi-tenancy.
-  - **Action**: Implement a SQL transformation layer in `Generator` that automatically injects `WHERE tenant_id = ?` clauses into every generated AST.
+  - **Action**: Implement a SQL transformation layer in `Generator` that automatically injects `WHERE tenant_id = ?` clauses into every generic AST.
 
 - [ ] **ENH-006: Streaming Response Support** (P2 - Medium)
   - **Value**: Improves perceived latency.
   - **Action**: Update `AggregatorNode` to stream tokens to the frontend instead of waiting for full generation.
+
+- [ ] **ENH-007: Structured Logging (JSON)** (P1 - High)
+  - **Value**: Machine-readable logs for Splunk/ELK.
+  - **Action**: Update `nl2sql.common.logger` to support `JsonFormatter` by default in production.
@@ -0,0 +1,64 @@
+# Remediation Plan: Observability & Reliability
+
+**Source Audit**: `production_readiness_report.md`
+**Date**: 2026-01-13
+**Focus**: Telemetry, Forensics, and Production Visibility.
+
+---
+
+## 🚀 High Priority Enhancements
+
+### [x] **ENH-OBS-001: OpenTelemetry Integration** (P0 - Critical)
+
+- **Goal**: Enable standard APM features (Datadog, Jaeger, Honeycomb) for trace and metric visualization.
+- **Problem**: Current metrics (`LATENCY_LOG`) are in-memory only and lost on restart. No visualization of latency distribution.
+- **Implementation**:
+  - [x] **Wire Up**: Connect `monitor.py` to the existing `opentelemetry-sdk` (dependencies already present).
+  - [x] **Refactor**: Update `nl2sql.common.metrics` to replace in-memory lists with `MeterProvider`.
+  - [x] **Instrument**: Update `monitor.py` to record OTeL Histograms for node execution duration.
+  - [x] **Instrument**: Update `TokenHandler` to record OTeL Counters for token usage.
+  - [x] **Config**: Wire up `OBSERVABILITY_EXPORTER` setting to initialize the generic exporter in `monitor.py`.
+
+### Backend Strategy (Local & Prod)
+>
+> **Why OTLP?** It decouples Python code from the backend. Code sends to `OTLP Collector`, which routes data.
+
+- **Traces** (Waterfalls): Sent to **Jaeger** (Local) or Datadog/Honeycomb (Prod).
+- **Metrics** (Latency/Errors): Sent to **Prometheus** (Local) or Datadog (Prod).
+- **Visualization**: Use **Grafana** to view Prometheus metrics and Jaeger traces in one UI.
+
+### [x] **ENH-OBS-002: Structured Logging (JSON)** (P0 - Critical)
+
+- **Goal**: Machine-readable logs for ingestion by Splunk/ELK/Datadog.
+- **Problem**: Logs are text-based and lack easy parsing for fields like `trace_id` or `user_id`.
+- **Implementation**:
+  - [x] **Enable**: Wire `OBSERVABILITY_EXPORTER=otlp` to trigger the existing `JsonFormatter` in `configure_logging()`.
+  - [x] **Verify**: Ensure `trace_id` injection (already implemented in `TraceContextFilter`) works correctly with the JSON output.
+
+### [x] **ENH-OBS-003: Persistent Audit Log** (P1 - High)
+
+- **Goal**: Forensic "Time Travel" debugging for AI decisions.
+- **Problem**: "Reasoning" is transient. We cannot explain past AI decisions to customers.
+- **Implementation**:
+  - [x] Create `EventLogger` class.
+  - [x] Log `{trace_id, timestamp, node, prompt_text, response_text, model, tokens}` to a persistent store (initially `events.log` rotated file, extensible to DB).
+  - [x] Ensure PII/Secrets are sanitized before logging prompts.
+
+## 🛠️ Medium Priority
+
+### [x] **ENH-OBS-004: Tenant Context Propagation** (P2 - Medium)
+
+- **Goal**: Multi-tenant observability.
+- **Problem**: Logs don't consistently show which tenant/user initiated the request.
+- **Implementation**:
+  - [x] **Schema Validation**: Define strict Pydantic model for `user_context` in `GraphState` (currently untyped Dict).
+  - [x] **Correlation**: Inject `tenant_id` from `user_context` into `trace_context` for log correlation.
+
+---
+
+## 📉 Success Metrics
+
+- **Latency Visibility**: Can view p95 latency per node in APM.
+- **Error Tracking**: Can alert on "Validation Failure Rate > 5%".
+- **Cost Tracking**: Can verify "Token Usage per Tenant".
+- **Debuggability**: Can retrieve the exact prompt that caused a specific error 24 hours later.
@@ -3,6 +3,6 @@
 version: 1
 default:
   provider: openai
-  model: gpt-4o
+  model: gpt-5.2
   temperature: 0.0
   api_key: ${env:OPENAI_API_KEY}
@@ -20,6 +20,9 @@ These settings control the startup behavior and file locations.
 | `SECRETS_CONFIG` | `configs/secrets.yaml` | Path to the secrets provider file. |
 | `POLICIES_CONFIG` | `configs/policies.json` | Path to the RBAC definitions. |
 | `ROUTER_L1_THRESHOLD` | `0.4` | Vector search similarity threshold. |
+| `OBSERVABILITY_EXPORTER` | `none` | Telemetry exporter: `otlp` (prod), `console` (dev), `none`. |
+| `OTEL_EXPORTER_OTLP_ENDPOINT`| `None` | Endpoint for OTeL Collector (e.g. `http://localhost:4317`). |
+| `AUDIT_LOG_PATH` | `logs/audit_events.log` | Path for the persistent forensic audit log. |
 
 ## 2. Datasources (`datasources.yaml`)
 
 
@@ -1,46 +1,75 @@
-# Observability
+# Observability and Monitoring
 
-## Logging
+The platform includes a comprehensive observability stack designed for production readiness, leveraging **OpenTelemetry**, **Structured Logging**, and **Forensic Audit Logs**.
 
-We use a structured logging approach suitable for production environments (Splunk, Datadog, ELK).
+## 1. Metrics & Tracing (OpenTelemetry)
 
-* **Format**: JSON (Production) or Human-Readable (Dev).
-* **Attributes**: Logs include `request_id`, `user_id`, `node_name`, and `execution_time`.
+We use **OpenTelemetry (OTel)** for vendor-neutral instrumentation.
 
-### Enabling JSON Logs
+### Configuration
 
-Set the environment variable or use the flag:
+Set the following environment variables:
 
-```bash
-export LOG_FORMAT=json
-# or
-nl2sql run "query" --json-logs
-```
+- `OBSERVABILITY_EXPORTER="otlp"`: Enables the OTLP exporter (requires a collector like Jaeger or Datadog Agent).
+- `OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"`: The endpoint for the collector (gRPC).
 
-::: nl2sql.common.logger.JsonFormatter
+### Key Metrics
 
-## Tracing
+| Metric Name | Type | Unit | Attributes | Description |
+| :--- | :--- | :--- | :--- | :--- |
+| `nl2sql.token.usage` | Counter | `1` | `model`, `agent`, `datasource_id` | Total LLM tokens consumed. |
+| `nl2sql.node.duration` | Histogram | `s` | `node`, `datasource_id` | Execution duration of graph nodes. |
 
-The platform is instrumented with [LangSmith](https://smith.langchain.com/) for deep tracing of the Agentic Graph.
+### Visualization
 
-1. Set `LANGCHAIN_TRACING_V2=true`.
-2. Set `LANGCHAIN_API_KEY=...`.
+- **Local**: Use [Jaeger](https://www.jaegertracing.io/) for traces and [Prometheus](https://prometheus.io/) for metrics.
+- **Production**: Compatible with Datadog, Honeycomb, New Relic, etc.
 
-This will stream full traces of the Planner, Validator, and Generator steps to the LangSmith dashboard.
+## 2. Structured Logging
 
-## Metrics (Prometheus)
+For production, logs are output in **JSON format** to facilitate parsing by aggregators (Splunk, ELK).
 
-The platform exposes a `/metrics` endpoint for Prometheus scraping.
+- **Activation**: JSON logging is automatically enabled when `OBSERVABILITY_EXPORTER="otlp"`.
+- **Correlation**: Every log entry includes a `trace_id` and `tenant_id` (if authenticated) to correlate logs across the request lifecycle.
 
-### Key Metrics
+**Example Log Entry:**
 
-| Metric Name | Type | Description |
-| :--- | :--- | :--- |
-| `nl2sql_requests_total` | Counter | Total number of requests served. |
-| `nl2sql_request_latency_seconds` | Histogram | End-to-end latency distribution. |
-| `nl2sql_token_usage_total` | Counter | Total LLM tokens consumed (prompt + completion). |
-| `nl2sql_active_connections` | Gauge | Current number of active DB connections. |
+```json
+{
+  "timestamp": "2024-01-01T12:00:00",
+  "level": "INFO",
+  "message": "Planning phase completed",
+  "trace_id": "8a3c...",
+  "tenant_id": "org_123",
+  "node": "planner"
+}
+```
+
+## 3. Persistent Audit Log
+
+For forensic analysis and "Time Travel" debugging, the system maintains a separate, persistent audit log.
+
+- **Location**: `logs/audit_events.log` (Rotation enabled: 10MB x 5 backups).
+- **Content**: detailed record of AI Decisions (Prompt inputs, Model responses, Token usage).
+- **Purpose**: Allows operators to answer "Why did the AI say X?" hours or days later.
+
+**Event Structure:**
+
+```json
+{
+  "timestamp": "...",
+  "event_type": "llm_interaction",
+  "trace_id": "...",
+  "tenant_id": "...",
+  "data": {
+    "agent": "planner",
+    "model": "gpt-4o",
+    "response_snippet": "SELECT * FROM...",
+    "token_usage": {"total_tokens": 150}
+  }
+}
+```
 
-### Grafana Dashboard
+## 4. Legacy Tooling
 
-A standard Grafana dashboard ID `#12345` is available for import to visualize these metrics.
+The CLI `Performance Tree` is preserved for local development convenience but piggybacks on the same instrumentation hooks.
@@ -17,6 +17,7 @@ dependencies = [
     "pydantic>=1.10",
     "opentelemetry-api>=1.20.0",
     "opentelemetry-sdk>=1.20.0",
+    "opentelemetry-exporter-otlp>=1.20.0",
     "sqlglot>=23.0.0",
     "pydantic-settings>=2.0.0",
     "pandas>=1.5.0", # For metrics/evals
 
@@ -0,0 +1,89 @@
+import logging
+import json
+import os
+from logging.handlers import RotatingFileHandler
+from typing import Any, Dict, Optional
+from datetime import datetime
+from nl2sql.common.settings import settings
+
+class EventLogger:
+    """Persistent audit logger for high-value AI events.
+    
+    Writes structured JSON events to a dedicated log file, separate from
+    application debug logs. Used for forensic analysis and "Time Travel" debugging.
+    """
+    
+    def __init__(self):
+        self.logger = logging.getLogger("nl2sql.audit")
+        self.logger.setLevel(logging.INFO)
+        self.logger.propagate = False  # Do not bubble up to root logger (avoid stdout spam)
+        
+        # Ensure handlers are set up (singleton-ish check)
+        if not self.logger.handlers:
+            log_path = getattr(settings, "audit_log_path", "logs/audit_events.log")
+            
+            # Ensure directory exists
+            os.makedirs(os.path.dirname(log_path), exist_ok=True)
+            
+            # 10MB per file, max 5 backup files
+            handler = RotatingFileHandler(
+                log_path, maxBytes=10*1024*1024, backupCount=5, encoding="utf-8"
+            )
+            
+            # Use specific JSON formatter for audit events
+            formatter = logging.Formatter("%(message)s")
+            handler.setFormatter(formatter)
+            
+            self.logger.addHandler(handler)
+
+    def log_event(
+        self, 
+        event_type: str, 
+        payload: Dict[str, Any], 
+        trace_id: Optional[str] = None, 
+        tenant_id: Optional[str] = None
+    ):
+        """Logs a structured event to the audit log.
+        
+        Args:
+            event_type: Category of event (e.g., 'llm_interaction', 'security_violation')
+            payload: The event data dictionary.
+            trace_id: Correlation ID.
+            tenant_id: Tenant/Customer ID.
+        """
+        
+        sensitive_keys = {"api_key", "password", "secret", "authorization"}
+        cleaned_payload = self._redact(payload, sensitive_keys)
+        
+        event = {
+            "timestamp": datetime.utcnow().isoformat(),
+            "event_type": event_type,
+            "trace_id": trace_id,
+            "tenant_id": tenant_id,
+            "data": cleaned_payload
+        }
+        
+        self.logger.info(json.dumps(event))
+
+    def _redact(self, data: Any, keys_to_redact: set) -> Any:
+        """Recursively redact sensitive keys from dictionary.
+
+        Args:
+            data: Input data (dict, list, or primitive).
+            keys_to_redact: Set of lowercase keys to match and redact.
+
+        Returns:
+            The sanitized data structure with sensitive values replaced by '***REDACTED***'.
+        """
+        if isinstance(data, dict):
+            return {
+                k: ("***REDACTED***" if k.lower() in keys_to_redact else self._redact(v, keys_to_redact))
+                for k, v in data.items()
+            }
+        elif isinstance(data, list):
+            return [self._redact(item, keys_to_redact) for item in data]
+        else:
+            return data
+
+# Global instance
+event_logger = EventLogger()
@@ -6,11 +6,13 @@
 from typing import Any, Dict, Optional
 
 _trace_id_ctx = contextvars.ContextVar("trace_id", default=None)
+_tenant_id_ctx = contextvars.ContextVar("tenant_id", default=None)
 
 class TraceContextFilter(logging.Filter):
-    """Injects trace_id from contextvar into the log record."""
+    """Injects trace_id and tenant_id from contextvars into the log record."""
     def filter(self, record):
         record.trace_id = _trace_id_ctx.get()
+        record.tenant_id = _tenant_id_ctx.get()
         return True
 
 @contextmanager
@@ -22,6 +24,15 @@ def trace_context(trace_id: str):
     finally:
         _trace_id_ctx.reset(token)
 
+@contextmanager
+def tenant_context(tenant_id: Optional[str]):
+    """Context manager to set the tenant_id for the current context."""
+    token = _tenant_id_ctx.set(tenant_id)
+    try:
+        yield
+    finally:
+        _tenant_id_ctx.reset(token)
+
 class JsonFormatter(logging.Formatter):
     """Formatter that outputs JSON strings after parsing the LogRecord."""
 
@@ -43,14 +54,17 @@ def format(self, record: logging.LogRecord) -> str:
 
         if getattr(record, "trace_id", None):
             log_record["trace_id"] = record.trace_id
+            
+        if getattr(record, "tenant_id", None):
+            log_record["tenant_id"] = record.tenant_id
 
         # Standard LogRecord attributes to ignore
         standard_attrs = {
             "args", "asctime", "created", "exc_info", "exc_text", "filename",
             "funcName", "levelname", "levelno", "lineno", "module",
             "msecs", "message", "msg", "name", "pathname", "process",
             "processName", "relativeCreated", "stack_info", "thread", "threadName",
-            "taskName", "trace_id"
+            "taskName", "trace_id", "tenant_id"
         }
 
         for key, value in record.__dict__.items():
@@ -80,10 +94,7 @@ def configure_logging(level: str = "INFO", json_format: bool = False):
     if json_format:
         handler.setFormatter(JsonFormatter())
     else:
-        # Include trace_id in standard format if present
-        # This is a bit tricky with dynamic formatting, usually easier to check record in formatter
-        # For simplicity, we stick to standard format but maybe prepend trace_id if possible?
-        # We'll stick to a standard format for text logs for now, trace_id mainly for JSON/Production
+        # Standard text format
         formatter = logging.Formatter(
             "%(asctime)s - [%(trace_id)s] - %(name)s - %(levelname)s - %(message)s"
         )