feat(telemetry): default OTLP to otel.proto-labs.ai over HTTP, bearer auth, strict opt-in (Phase 4 of homelab-iac#34) (#171)

mabry1985 · Automaker · claude · web-flow · commit 0707bd701e55 · 2026-05-01T01:27:32.000Z
* feat(telemetry): default OTLP to otel.proto-labs.ai over HTTP, bearer auth, strict opt-in Phase 4 of homelab-iac#34 — wires protoCLI to the public LGTM ingress that just landed on the ava node. Also tightens the opt-in posture so no telemetry leaves the host unless the user explicitly enables it. Defaults - DEFAULT_OTLP_ENDPOINT: 'http://localhost:4317' → 'https://otel.proto-labs.ai' (Cloudflare-fronted, TLS-terminated, hosts the Tempo / Loki / Mimir stack chosen in homelab-iac#34) - getTelemetryOtlpProtocol() default: 'grpc' → 'http' to match the ingress shape; gRPC override still works for users who run their own local OTel collector. Auth - OTEL_INGRESS_TOKEN env var, plumbed as `Authorization: Bearer <token>` into all three HTTP exporters (trace/log/metric). Header is omitted entirely when the env var is unset, preserving exact-match shape for existing tests against arbitrary collectors. - gRPC path picks up the same token via grpc-js Metadata for users on the gRPC override. - Token convention matches Infisical (homelab-media/prod), composes with the existing settings.json `env` block alongside Langfuse keys. Strict opt-in - initializeTelemetry now requires telemetry.enabled === true for ANY outbound exporter to activate. Previously, Langfuse env vars alone could spin up the Langfuse exporter without an explicit opt-in — closes that hole. Privacy is the default; users opt in via `"telemetry": { "enabled": true }` in settings.json (or --telemetry). - Debug log surfaces when Langfuse env vars are detected but telemetry is disabled, so the new behavior is discoverable rather than silent. Tests - New: bearer header attached to HTTP exporters when OTEL_INGRESS_TOKEN is set; omitted when unset. - Updated: existing "Langfuse auto-activates with disabled telemetry" test inverted to assert the opt-in semantics. - Updated: protocol-default tests on both core and cli sides expect 'http'. - Updated: endpoint-default tests on cli side expect the public ingress. Verified locally: 5311 core / 3774 cli tests pass, typecheck + lint clean. End-to-end smoke against the live ingress is gated on the maintainer adding OTEL_INGRESS_TOKEN to settings.json `env`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(telemetry): document strict opt-in policy and OTel ingress defaults Updates user-facing docs to match the Phase 4 behavior change: - README Observability section: shows the opt-in setup with both OTEL_INGRESS_TOKEN and Langfuse keys, removes the now-incorrect 'Langfuse activates independently' subsection (it doesn't anymore), notes the new https://otel.proto-labs.ai default endpoint and bearer auth, surfaces gen_ai.response.thinking and thinking_tokens in the trace table. - README upstream-comparison row reflects the LGTM + Langfuse fan-out. - docs/contributing/telemetry.md: rewritten around the opt-in default, documents OTEL_INGRESS_TOKEN, expands the configuration reference table to include otlpProtocol and the new endpoint default, updates the privacy section to lead with 'silence unless you say otherwise'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: trigger checks (no-op) --------- Co-authored-by: Automaker <automaker@localhost> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ At-a-glance overview vs. upstream Qwen Code. For the full architectural breakdow
 | Ignore files          | `.qwenignore`             | `.protoignore` + inherits `.claudeignore` patterns                                                                                                       |
 | ACP / Zed integration | Stock                     | Cron-in-Session, concurrent Agent calls, SSE/HTTP MCP, internal-part filtering                                                                           |
 | Extra built-in tools  | Standard set              | + browser automation, repo-map (PageRank), task tools, mailbox, LSP, voice/STT                                                                           |
-| Observability         | Console                   | Langfuse OTLP traces with harness-intervention spans (SFT-ready)                                                                                         |
+| Observability         | Console                   | OTLP/HTTP to LGTM stack + Langfuse, opt-in, with `gen_ai.response.thinking` and harness-intervention spans (SFT-ready)                                   |
 | Release pipeline      | Manual                    | Conventional-commit auto-release (`feat:` → minor, `fix:` → patch)                                                                                       |
 | VS Code companion     | Included                  | Removed (focus on TUI + ACP/Zed)                                                                                                                         |
 
@@ -206,53 +206,56 @@ Both no-op outside a TTY, in screen-reader mode, or under tmux/SSH.
 
 ## Observability
 
-proto supports [Langfuse](https://langfuse.com) tracing out of the box. Set three environment variables and every session is fully traced — LLM calls (all providers), tool executions, subagent lifecycles, and turn hierarchy.
+proto ships OpenTelemetry-native, with both a Tempo/LGTM-style ops backend and Langfuse for prompt-grade trace UI. Both are **opt-in** — nothing is sent anywhere until `telemetry.enabled` is `true`.
 
 ### Setup
 
-Add to the `env` block in `~/.proto/settings.json`:
+Add to `~/.proto/settings.json`:
 
 ```json
 {
+  "telemetry": { "enabled": true },
   "env": {
+    "OTEL_INGRESS_TOKEN": "<bearer token from your Infisical or vault>",
     "LANGFUSE_PUBLIC_KEY": "pk-lf-...",
     "LANGFUSE_SECRET_KEY": "sk-lf-...",
-    "LANGFUSE_BASE_URL": "https://cloud.langfuse.com"
+    "LANGFUSE_BASE_URL": "https://your-langfuse-instance.example.com"
   }
 }
 ```
 
-`LANGFUSE_BASE_URL` is optional and defaults to `https://cloud.langfuse.com`. For a self-hosted instance, set it to your deployment URL.
+With `telemetry.enabled = true`:
 
-> **Why `settings.json` and not `.env`?** proto walks up from your CWD loading `.env` files, so a project-level `.env` with Langfuse keys would bleed into proto's tracing and mix your traces into the wrong dataset. The `env` block in `settings.json` is proto-namespaced and completely isolated from your projects.
+- **OTLP traces** ship to `https://otel.proto-labs.ai` over HTTP, bearer-auth via `OTEL_INGRESS_TOKEN`. Override `telemetry.otlpEndpoint` / `telemetry.otlpProtocol` to point at a local OTel collector or a different vendor.
+- **Langfuse traces** ship to `LANGFUSE_BASE_URL` (defaults to `https://cloud.langfuse.com`) when both Langfuse keys are present.
+
+Without `telemetry.enabled = true`, neither exporter activates regardless of env vars.
+
+> **Why `settings.json` and not `.env`?** proto walks up from your CWD loading `.env` files, so a project-level `.env` with telemetry keys would bleed into proto's tracing and mix your traces into the wrong dataset. The `env` block in `settings.json` is proto-namespaced and completely isolated from your projects.
 
 ### What gets traced
 
-| Span                  | Attributes                                                                                           |
-| --------------------- | ---------------------------------------------------------------------------------------------------- |
-| `turn`                | `session.id`, `turn.id` — root span per user prompt                                                  |
-| `gen_ai chat {model}` | `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.request.model` — one per LLM call |
-| `tool/{name}`         | `tool.name`, `tool.type`, `tool.duration_ms` — one per tool execution                                |
-| `agent/{name}`        | `agent.name`, `agent.status`, `agent.duration_ms` — one per subagent                                 |
+| Span                  | Attributes                                                                                                                          |
+| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
+| `turn`                | `session.id`, `turn.id` — root span per user prompt                                                                                 |
+| `gen_ai chat {model}` | `gen_ai.usage.{input,output,thinking}_tokens`, `gen_ai.request.model`, `gen_ai.response.thinking` (when present) — one per LLM call |
+| `tool/{name}`         | `tool.name`, `tool.type`, `tool.duration_ms` — one per tool execution                                                               |
+| `agent/{name}`        | `agent.name`, `agent.status`, `agent.duration_ms` — one per subagent                                                                |
 
 All three provider backends are covered: OpenAI-compatible, Anthropic, and Gemini.
 
 ### Prompt content logging
 
-Full prompt messages and response text are included in traces by default. To disable:
+Full prompt messages, response text, and reasoning text are included in traces by default. To disable:
 
 ```json
 // ~/.proto/settings.json
 {
-  "telemetry": { "logPrompts": false }
+  "telemetry": { "enabled": true, "logPrompts": false }
 }
 ```
 
-> **Privacy note:** `logPrompts` is enabled by default. When enabled, full prompt and response content is sent to your Langfuse instance. Set to `false` if you want traces without message content.
-
-### Langfuse activates independently
-
-Langfuse tracing activates from env vars alone — it does not require `telemetry.enabled: true` in settings. The general telemetry pipeline (OTLP/GCP) and Langfuse are independent.
+> **Privacy note:** Telemetry is off by default. When you opt in, `logPrompts` defaults to `true` — full prompt, response, and reasoning content are attached to spans (truncated at 10K chars each). Set `logPrompts: false` if you want token counts and timings without message content.
 
 ## Task Management
 
diff --git a/docs/contributing/telemetry.md b/docs/contributing/telemetry.md
@@ -2,65 +2,111 @@
 
 proto is built on [OpenTelemetry](https://opentelemetry.io/) — the vendor-neutral observability standard. All traces, spans, and metrics use OTLP format and can be exported to any compatible backend.
 
-## Langfuse (built-in, recommended)
+## Opt-in by default
 
-proto ships with a Langfuse exporter. Set these environment variables to activate it — no other configuration needed:
+**No telemetry leaves the host until you explicitly enable it.** This is true for every exporter — OTLP, Langfuse, file output. Set:
 
-```bash
-export LANGFUSE_PUBLIC_KEY="pk-lf-..."
-export LANGFUSE_SECRET_KEY="sk-lf-..."
-export LANGFUSE_BASE_URL="https://your-langfuse-instance.example.com"  # optional, defaults to cloud
+```json
+{ "telemetry": { "enabled": true } }
 ```
 
-What is traced:
+in `settings.json`, or pass `--telemetry` on the CLI. Without that flag, nothing is sent anywhere — even if Langfuse env vars are present.
+
+## Default ingress: `otel.proto-labs.ai`
+
+When opted in, traces ship to the homelab LGTM stack at `https://otel.proto-labs.ai` over OTLP/HTTP. The endpoint is bearer-token authenticated:
+
+```json
+{
+  "telemetry": { "enabled": true },
+  "env": {
+    "OTEL_INGRESS_TOKEN": "<token from Infisical homelab-media/prod>"
+  }
+}
+```
+
+Without `OTEL_INGRESS_TOKEN`, the ingress returns 401 (debug-logged in `~/.proto/debug/latest`).
+
+## Langfuse (LLM-grade trace UI)
+
+Langfuse is wired in addition to the LGTM stack — keeps prompt-grade trace debugging available alongside Tempo's APM views. Activate by setting all of these in `settings.json` `env`:
+
+```json
+{
+  "telemetry": { "enabled": true },
+  "env": {
+    "LANGFUSE_PUBLIC_KEY": "pk-lf-...",
+    "LANGFUSE_SECRET_KEY": "sk-lf-...",
+    "LANGFUSE_BASE_URL": "https://your-langfuse-instance.example.com"
+  }
+}
+```
+
+`LANGFUSE_BASE_URL` is optional; defaults to `https://cloud.langfuse.com`.
+
+What's traced:
 
 - Every session turn
 - All LLM calls (all providers) with token counts
 - Tool calls with input/output
 - Sub-agent spawns and completions
+- Model reasoning content as `gen_ai.response.thinking` span attribute (when surfaced by the model or gateway)
 
-## OpenTelemetry configuration
-
-Configure via `settings.json` or environment variables:
+## Configuration reference
 
-| Setting              | Env var                    | CLI flag                           | Values          | Default |
-| -------------------- | -------------------------- | ---------------------------------- | --------------- | ------- |
-| `telemetry.enabled`  | `PROTO_TELEMETRY_ENABLED`  | `--telemetry` / `--no-telemetry`   | `true`/`false`  | `false` |
-| `telemetry.target`   | `PROTO_TELEMETRY_TARGET`   | `--telemetry-target <local\|otel>` | `local`, `otel` | `local` |
-| `telemetry.endpoint` | `PROTO_TELEMETRY_ENDPOINT` | `--telemetry-endpoint <url>`       | OTLP URL        | —       |
+| Setting                  | Env var                    | CLI flag                            | Values          | Default                           |
+| ------------------------ | -------------------------- | ----------------------------------- | --------------- | --------------------------------- |
+| `telemetry.enabled`      | `PROTO_TELEMETRY_ENABLED`  | `--telemetry` / `--no-telemetry`    | `true`/`false`  | `false` _(opt-in)_                |
+| `telemetry.target`       | `PROTO_TELEMETRY_TARGET`   | `--telemetry-target <local\|otel>`  | `local`, `otel` | `local`                           |
+| `telemetry.otlpEndpoint` | `PROTO_TELEMETRY_ENDPOINT` | `--telemetry-otlp-endpoint <url>`   | OTLP URL        | `https://otel.proto-labs.ai`      |
+| `telemetry.otlpProtocol` | —                          | `--telemetry-otlp-protocol <proto>` | `http`, `grpc`  | `http`                            |
+| `telemetry.logPrompts`   | —                          | `--telemetry-log-prompts`           | `true`/`false`  | `true`                            |
+| `OTEL_INGRESS_TOKEN`     | —                          | —                                   | bearer token    | — (required for default endpoint) |
 
 ### File-based local output
 
+For local-only debugging without shipping anywhere:
+
 ```json
 {
   "telemetry": {
     "enabled": true,
     "target": "local",
-    "logFile": "~/.proto/telemetry/traces.jsonl"
+    "outfile": "~/.proto/telemetry/traces.jsonl"
   }
 }
 ```
 
-### Export to any OTLP backend (Jaeger, Datadog, etc.)
+### Override to a different OTLP backend
+
+If you run a local OTel Collector or want to point at a different vendor:
 
 ```json
 {
   "telemetry": {
     "enabled": true,
-    "target": "otel",
-    "endpoint": "http://localhost:4318/v1/traces"
+    "otlpEndpoint": "http://localhost:4318",
+    "otlpProtocol": "http"
   }
 }
 ```
 
+For a local gRPC collector, set `otlpProtocol: "grpc"` and use port `4317`.
+
 ## What is instrumented
 
 - **Session turns** — user prompt, model response, duration
-- **LLM calls** — provider, model, input/output tokens, latency
+- **LLM calls** — provider, model, input/output/thinking tokens, latency, reasoning content
 - **Tool calls** — name, arguments, result, duration
 - **Sub-agent lifecycle** — spawn, completion, token usage
 - **Harness interventions** — doom loop detection, multi-sample retries
+- **Hook executions** — which hook fired, success, duration, exit code, captured stdout/stderr
 
 ## Privacy
 
-Traces include prompt content and tool outputs by default. For production environments, configure sampling or filtering at your OTLP collector level to avoid capturing sensitive data.
+The opt-in default means privacy is the baseline — silence unless you say otherwise. When opted in:
+
+- `telemetry.logPrompts` controls whether prompt content + reasoning text are attached to spans (default `true`).
+- Reasoning text and completion content are truncated at 10K chars on each span.
+- For production deployments shipping to shared infra, configure sampling/filtering at the OTel Collector layer.
+- Local sandboxed runs default to a per-process session ID; no stable user identifier is exported.
diff --git a/packages/cli/src/config/config.test.ts b/packages/cli/src/config/config.test.ts
@@ -889,7 +889,9 @@ describe('loadCliConfig telemetry', () => {
     const argv = await parseArguments();
     const settings: Settings = { telemetry: { enabled: true } };
     const config = await loadCliConfig(settings, argv);
-    expect(config.getTelemetryOtlpEndpoint()).toBe('http://localhost:4317');
+    expect(config.getTelemetryOtlpEndpoint()).toBe(
+      'https://otel.proto-labs.ai',
+    );
   });
 
   it('should use telemetry target from settings if CLI flag is not present', async () => {
@@ -981,7 +983,7 @@ describe('loadCliConfig telemetry', () => {
     const argv = await parseArguments();
     const settings: Settings = { telemetry: { enabled: true } };
     const config = await loadCliConfig(settings, argv);
-    expect(config.getTelemetryOtlpProtocol()).toBe('grpc');
+    expect(config.getTelemetryOtlpProtocol()).toBe('http');
   });
 
   it('should reject invalid --telemetry-otlp-protocol values', async () => {
diff --git a/packages/core/src/config/config.test.ts b/packages/core/src/config/config.test.ts
@@ -710,20 +710,20 @@ describe('Server Config (config.ts)', () => {
       expect(config.getTelemetryOtlpProtocol()).toBe('http');
     });
 
-    it('should return default OTLP protocol if not provided', () => {
+    it('should return default OTLP protocol of "http" if not provided (matches public ingress at otel.proto-labs.ai)', () => {
       const params: ConfigParameters = {
         ...baseParams,
         telemetry: { enabled: true },
       };
       const config = new Config(params);
-      expect(config.getTelemetryOtlpProtocol()).toBe('grpc');
+      expect(config.getTelemetryOtlpProtocol()).toBe('http');
     });
 
-    it('should return default OTLP protocol if telemetry object is not provided', () => {
+    it('should return default OTLP protocol of "http" if telemetry object is not provided', () => {
       const paramsWithoutTelemetry: ConfigParameters = { ...baseParams };
       delete paramsWithoutTelemetry.telemetry;
       const config = new Config(paramsWithoutTelemetry);
-      expect(config.getTelemetryOtlpProtocol()).toBe('grpc');
+      expect(config.getTelemetryOtlpProtocol()).toBe('http');
     });
   });
 
diff --git a/packages/core/src/config/config.ts b/packages/core/src/config/config.ts
@@ -1788,7 +1788,10 @@ export class Config {
   }
 
   getTelemetryOtlpProtocol(): 'grpc' | 'http' {
-    return this.telemetrySettings.otlpProtocol ?? 'grpc';
+    // Default 'http' aligns with the public OTLP ingress at
+    // otel.proto-labs.ai (Cloudflare-fronted, HTTPS only). Set
+    // telemetry.otlpProtocol = 'grpc' for a local OTel collector.
+    return this.telemetrySettings.otlpProtocol ?? 'http';
   }
 
   getTelemetryTarget(): TelemetryTarget {
diff --git a/packages/core/src/telemetry/index.ts b/packages/core/src/telemetry/index.ts
@@ -11,7 +11,10 @@ export enum TelemetryTarget {
 }
 
 const DEFAULT_TELEMETRY_TARGET = TelemetryTarget.LOCAL;
-const DEFAULT_OTLP_ENDPOINT = 'http://localhost:4317';
+// Public OTLP/HTTP ingress fronting the homelab LGTM stack (Cloudflare-fronted,
+// TLS-terminated). Authenticated via OTEL_INGRESS_TOKEN bearer token. See
+// homelab-iac#34. Override per-host via telemetry.otlpEndpoint in settings.
+const DEFAULT_OTLP_ENDPOINT = 'https://otel.proto-labs.ai';
 
 export { DEFAULT_TELEMETRY_TARGET, DEFAULT_OTLP_ENDPOINT };
 export {
diff --git a/packages/core/src/telemetry/sdk.test.ts b/packages/core/src/telemetry/sdk.test.ts
@@ -108,6 +108,44 @@ describe('Telemetry SDK', () => {
     expect(NodeSDK.prototype.start).toHaveBeenCalled();
   });
 
+  it('attaches Authorization: Bearer header to HTTP exporters when OTEL_INGRESS_TOKEN is set', () => {
+    process.env['OTEL_INGRESS_TOKEN'] = 'test-bearer-token';
+    vi.spyOn(mockConfig, 'getTelemetryOtlpProtocol').mockReturnValue('http');
+    vi.spyOn(mockConfig, 'getTelemetryOtlpEndpoint').mockReturnValue(
+      'https://otel.proto-labs.ai',
+    );
+
+    initializeTelemetry(mockConfig);
+
+    const expectedHeaders = { Authorization: 'Bearer test-bearer-token' };
+    expect(OTLPTraceExporterHttp).toHaveBeenCalledWith(
+      expect.objectContaining({ headers: expectedHeaders }),
+    );
+    expect(OTLPLogExporterHttp).toHaveBeenCalledWith(
+      expect.objectContaining({ headers: expectedHeaders }),
+    );
+    expect(OTLPMetricExporterHttp).toHaveBeenCalledWith(
+      expect.objectContaining({ headers: expectedHeaders }),
+    );
+
+    delete process.env['OTEL_INGRESS_TOKEN'];
+  });
+
+  it('omits the headers field on HTTP exporters when OTEL_INGRESS_TOKEN is unset', () => {
+    delete process.env['OTEL_INGRESS_TOKEN'];
+    vi.spyOn(mockConfig, 'getTelemetryOtlpProtocol').mockReturnValue('http');
+    vi.spyOn(mockConfig, 'getTelemetryOtlpEndpoint').mockReturnValue(
+      'http://localhost:4318',
+    );
+
+    initializeTelemetry(mockConfig);
+
+    // Exact match — no `headers` key on the call args.
+    expect(OTLPTraceExporterHttp).toHaveBeenCalledWith({
+      url: 'http://localhost:4318/',
+    });
+  });
+
   it('should parse gRPC endpoint correctly', () => {
     vi.spyOn(mockConfig, 'getTelemetryOtlpEndpoint').mockReturnValue(
       'https://my-collector.com',
@@ -272,33 +310,20 @@ describe('Telemetry SDK', () => {
       });
     });
 
-    it('still initializes telemetry when only Langfuse is configured and primary telemetry is disabled', () => {
-      process.env['LANGFUSE_PUBLIC_KEY'] = 'pk-lf-test';
-      process.env['LANGFUSE_SECRET_KEY'] = 'sk-lf-test';
-      vi.spyOn(mockConfig, 'getTelemetryEnabled').mockReturnValue(false);
-
-      initializeTelemetry(mockConfig);
-
-      // Should still start the SDK because Langfuse processor is non-null
-      expect(NodeSDK.prototype.start).toHaveBeenCalled();
-    });
-
-    it('does not create the default gRPC OTLP exporter when only Langfuse is configured', () => {
+    it('does NOT initialize telemetry when only Langfuse env vars are set and telemetry.enabled is false', () => {
+      // Opt-in policy: privacy is the default. Even if Langfuse env vars are
+      // present, telemetry stays off until the user explicitly opts in via
+      // telemetry.enabled = true.
       process.env['LANGFUSE_PUBLIC_KEY'] = 'pk-lf-test';
       process.env['LANGFUSE_SECRET_KEY'] = 'sk-lf-test';
       vi.spyOn(mockConfig, 'getTelemetryEnabled').mockReturnValue(false);
 
       initializeTelemetry(mockConfig);
 
-      // The default endpoint is localhost:4317 but with telemetry disabled,
-      // no gRPC exporter should be instantiated (only the Langfuse HTTP one).
+      expect(NodeSDK.prototype.start).not.toHaveBeenCalled();
       expect(OTLPTraceExporter).not.toHaveBeenCalled();
       expect(OTLPLogExporter).not.toHaveBeenCalled();
       expect(OTLPMetricExporter).not.toHaveBeenCalled();
-
-      const sdkCalls = vi.mocked(NodeSDK).mock.calls;
-      const spanProcessors = sdkCalls[0][0]?.spanProcessors;
-      expect(spanProcessors).toHaveLength(1);
     });
   });
 });
diff --git a/packages/core/src/telemetry/sdk.ts b/packages/core/src/telemetry/sdk.ts

Original file line number	Diff line number	Diff line change
`@@ -1788,7 +1788,10 @@ export class Config {`
`1788`	`1788`	`}`
`1789`	`1789`
`1790`	`1790`	`getTelemetryOtlpProtocol(): 'grpc' \| 'http' {`
`1791`		`- return this.telemetrySettings.otlpProtocol ?? 'grpc';`
	`1791`	`+ // Default 'http' aligns with the public OTLP ingress at`
	`1792`	`+ // otel.proto-labs.ai (Cloudflare-fronted, HTTPS only). Set`
	`1793`	`+ // telemetry.otlpProtocol = 'grpc' for a local OTel collector.`
	`1794`	`+ return this.telemetrySettings.otlpProtocol ?? 'http';`
`1792`	`1795`	`}`
`1793`	`1796`
`1794`	`1797`	`getTelemetryTarget(): TelemetryTarget {`