Skip to content

fix: enrich OTEL log records with per-request LogContext#8812

Merged
rubenfiszel merged 7 commits intomainfrom
propagate-span-fields
Apr 13, 2026
Merged

fix: enrich OTEL log records with per-request LogContext#8812
rubenfiszel merged 7 commits intomainfrom
propagate-span-fields

Conversation

@rubenfiszel
Copy link
Copy Markdown
Contributor

@rubenfiszel rubenfiszel commented Apr 13, 2026

Summary

Exported OTEL LogRecords were missing the per-request identifiers that live on the surrounding tracing spans in windmill-api/src/tracing_init.rs (method, uri, email, workspace_id) and windmill-worker/src/worker.rs (script_path, job_kind, language, worker, job_id, …). The EE opentelemetry-appender-tracing bridge only visits event-level fields, so a failing HTTP request's "response" log in Sentry reads as "something returned 500 somewhere" — with no way to filter by workspace or user.

This carries the same identifiers through a tokio task-local (LogContext) seeded at the request/job entry points, and a new EE log bridge that reads the task-local on on_event and attaches the fields to each exported LogRecord. The design mirrors how Java/Go/Python/Node/.NET OTEL stacks all solve this problem (MDC / context.Context / contextvars / AsyncLocalStorage / ILogger.BeginScope) — log enrichment lives in a per-task store that's independent of the trace span tree.

Changes

windmill-common

  • New windmill-common/src/log_context.rs module:
    • LogContext struct with the allowlisted fields we promote to log records
    • LOG_CONTEXT tokio task_local! holding Arc<Mutex<LogContext>>
    • with_log_context(ctx, fut) — wrap a future in a LogContext scope
    • current_log_context() — snapshot the active context
    • update_log_context(|c| …) — mutate in place (used by auth after the scope is already entered)
    • spawn_with_log_context(fut)tokio::spawn with the current context forwarded as a snapshot

windmill-common (EE, via otel_ee.rs symlink)

  • Replace the vanilla OpenTelemetryTracingBridge in init_logs_bridge with a new LogContextBridge:
    • On on_event, constructs a LogRecord, calls attach_log_context to read LOG_CONTEXT and push the active fields onto the record, then visits the event fields (event-level values override inherited ones on collision)
    • No span tracking, no extensions, no allowlist scan per span, no register_callsite gymnastics. The bridge is ~150 lines including a full event-field visitor.
    • Works cleanly under the existing .with_filter(otel_logs_filter) wrap in tracing_init.rs — no filter interaction bugs, because the bridge doesn't care about span creation.

windmill-api

  • tracing_init.rs: new log_context_middleware that seeds LogContext with method/uri/trace_id and wraps next.run(request) in with_log_context(...).
  • lib.rs: registers the middleware one layer outside TraceLayer (and inside CatchPanicLayer) so MyOnResponse::on_response's "response" log fires inside the scope.
  • windmill-api-auth/src/auth.rs: alongside the existing Span::current().record("username"/"email"/"workspace_id") calls, mirrors the same values into LOG_CONTEXT via update_log_context(...) so logs fired after auth carry user/workspace identifiers.

windmill-worker

  • worker.rs: new log_context_for_job helper that mirrors the fields recorded on the job span from arc_job, and the job execution call site wraps handle_queued_job(...).instrument(span) in with_log_context(log_ctx, ...).
  • worker.rs:1297 (wait-time insertion) and worker_utils.rs:367 (queue vacuum) — the two existing .in_current_span() / .instrument(current_span) spawn sites now use spawn_with_log_context to forward the context snapshot across the task boundary. Tracing span instrumentation preserved alongside.

Tests

10 new unit tests, all passing:

windmill-common/src/log_context.rs (5)

  • reads_and_mutates_current_context — seed + update_log_context + current_log_context
  • outside_scope_returns_noneupdate_log_context is a no-op outside a scope, doesn't panic
  • spawn_inherits_snapshotspawn_with_log_context forwards the context
  • bare_spawn_has_no_context — plain tokio::spawn drops context (documents why we need the helper)
  • spawn_snapshot_is_independent_of_parent_mutations — snapshot semantics

windmill-common/src/otel_ee.rs::tests (5)

  • vanilla_bridge_drops_parent_span_fields — regression doc: confirms upstream OpenTelemetryTracingBridge drops span fields
  • bridge_reads_log_context_on_event — full HTTP-path scenario, verifies method/uri/workspace_id/email/username/traceId all propagate
  • late_context_mutations_propagate — auth-middleware scenario, mutations after entry visible to the bridge
  • no_context_is_not_an_error — events outside any scope still emit, just without context fields
  • filter_gates_events_but_context_still_propagates — production-realistic .with_filter(EnvFilter::ERROR); INFO is suppressed while ERROR carries its context

Test plan

  • cargo test --features enterprise,private,otel -p windmill-common — 381/381 pass
  • cargo check — OSS clean
  • cargo check --features enterprise,private — EE no-otel clean
  • cargo check --features enterprise,private,otel — EE+otel clean
  • Staging: OTEL_LOGS_ENABLED=true + OTLP endpoint. Hit a deliberately-failing route and a failing job. Verify the exported log records in Sentry / Grafana Loki / Datadog contain workspace_id, email, method, uri for the HTTP error and workspace_id, script_path, job_kind, language, worker for the job error.
  • Confirm INFO events are still suppressed under the default RUST_LOG=error and only ERROR/WARN reach OTEL.

Performance

  • Per span creation: zero. No on_new_span, no extensions, no visitor. sqlx / hyper / tonic / internal spans pay nothing.
  • Per request: one tokio::task_local::scope push at the log_context middleware (~15–30 ns) plus one Arc<Mutex<LogContext>> allocation.
  • Per log event (default ERROR filter, so handful per minute): one task-local lookup + one Mutex::lock + one LogContext::clone — total sub-µs.
  • Per update_log_context call (auth path): one Mutex::lock + field write — ~50 ns.
  • Per spawn_with_log_context: one context clone + one new Arc<Mutex<…>>. Two call sites in the entire backend today.

Companion EE PR

windmill-labs/windmill-ee-private#529

Design notes

This replaces an earlier attempt (commit history overwritten — force-pushed) that used a tracing_subscriber::Layer walking parent spans via ctx.event_scope(). That approach:

  • had a subtle filter-interaction bug (.with_filter(ERROR) blocked on_new_span for INFO spans, so fields weren't captured in production) that required custom register_callsite / enabled overrides to work around
  • paid ~5–20µs per span creation in the worker / API hot path
  • diverged from every other OTEL SDK's solution to the same problem

LogContext is cross-ecosystem canonical: it matches Java MDC, Go context.Context, Python contextvars, Node AsyncLocalStorage, and .NET ILogger.BeginScope.


Generated with Claude Code

@rubenfiszel rubenfiszel marked this pull request as ready for review April 13, 2026 17:28
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 13, 2026

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 42d8173
Status:⚡️  Build in progress...

View logs

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rubenfiszel rubenfiszel force-pushed the propagate-span-fields branch from fd69159 to 7d35c0e Compare April 13, 2026 20:31
@rubenfiszel rubenfiszel changed the title [ee] fix: propagate parent span fields to OTEL log records [ee] fix: enrich OTEL log records with per-request LogContext Apr 13, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rubenfiszel
Copy link
Copy Markdown
Contributor Author

End-to-end verification with a real OTEL collector

Spun up otel/opentelemetry-collector-contrib locally on localhost:4317 (gRPC) with a debug exporter, then ran cargo run --example otlp_smoke -p windmill-common --features enterprise,private,otel with OTEL_LOGS=1 OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317. All four expected LogRecords were received over real OTLP gRPC and decoded correctly by the collector:

LogRecord #0 — HTTP 5xx "response" log (the customer's primary complaint)

Body: Str(response)
Attributes:
  -> method: Str(POST)
  -> uri: Str(/api/w/acme/jobs/run/f/foo)
  -> traceId: Str(smoke-trace-id-abc-123)
  -> email: Str(alice@acme.co)
  -> username: Str(alice)
  -> workspace_id: Str(acme)
  -> status: Int(500)
  -> latency: Int(123)

LogRecord #1 — ad-hoc handler error inside a request

Body: Str(database connection refused)
Attributes:
  -> method / uri / traceId / email / username / workspace_id

LogRecord #2 — worker job failure

Body: Str(job failed: ValueError: bad input)
Attributes:
  -> workspace_id / worker / hostname / tag
  -> job_id / script_path / job_kind / language

LogRecord #3 — background task outside any scope (zero-context is still a valid log)

Body: Str(background task failure (no context))
(no context attributes — correct behavior)

Resource attributes (service.name=windmill-server, service.version=v1.682.0-2-g7d35c0eb82, host.name, deployment.environment=dev) are all populated correctly from otlp_service_resource(...).

The smoke test is committed as backend/windmill-common/examples/otlp_smoke.rs so anyone can re-run it against a fresh collector (instructions in the file's doc comment).

@rubenfiszel
Copy link
Copy Markdown
Contributor Author

Full axum pipeline verification with a running backend

Followed up the earlier standalone smoke test by running a real windmill backend instance (MODE=server PORT=8261, same DB as dev, OTEL_LOGS_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317, full feature set enterprise,private,otel) against the local OTEL collector, then hit it with a series of real HTTP requests and inspected the exported LogRecords.

Note on env var naming: discovered during this that windmill's load_otel reads OTEL_LOGS_ENABLED (not OTEL_LOGS), and it runs before initialize_tracing — so you have to set OTEL_LOGS_ENABLED=true to actually enable the bridge on a fresh instance. Documenting here for anyone trying to reproduce.

The collector received 45 LogRecords during a ~3 second window of traffic. The notable "response" records, all exported over real OTLP gRPC:

Request Status method uri email username workspace_id
POST /api/auth/login 200 (INFO) (pre-auth)
GET /api/workspaces/list (unauthed) 401 (ERROR) (unauthed)
POST /api/w/admins/scripts/create (bad body) 422 (ERROR) admin@windmill.dev admin@windmill.dev admins
GET /api/w/__nonexistent__/scripts/list 200 (INFO) admin@windmill.dev admin@windmill.dev __nonexistent__
POST /api/w/admins/jobs/run/p/does_not_exist 404 (WARN) admin@windmill.dev admin@windmill.dev admins

Sample LogRecord from the collector's debug exporter (the 422 / bad-body case — this is exactly the shape Sentry will now see for failing authenticated requests):

LogRecord #2
SeverityText: ERROR
SeverityNumber: Error(17)
EventName: event windmill-api/src/tracing_init.rs:53
Body: Str(response)
Attributes:
     -> method: Str(POST)
     -> uri: Str(/api/w/admins/scripts/create)
     -> email: Str(admin@windmill.dev)
     -> username: Str(admin@windmill.dev)
     -> workspace_id: Str(admins)
     -> latency: Str(2)
     -> status: Int(422)

Compare to the unauthenticated 401 case — email/username/workspace_id correctly absent because auth hasn't run yet, but method / uri still flow from the top-of-stack log_context_middleware:

LogRecord #1
SeverityText: ERROR
EventName: event windmill-api/src/tracing_init.rs:53
Body: Str(response)
Attributes:
     -> method: Str(GET)
     -> uri: Str(/api/workspaces/list)
     -> latency: Str(0)
     -> status: Int(401)

This confirms:

  1. The log_context_middleware registered in windmill-api/src/lib.rs is in the right layer position — its scope wraps TraceLayer, so MyOnResponse::on_response's "response" log (a tracing::error! inside tower-http) fires inside the scope.
  2. The update_log_context(...) call in windmill-api-auth/src/auth.rs is reached and propagates email/username/workspace_id into the same scope before the response hook runs.
  3. The EE LogContextBridge reads the task-local correctly at event time and attaches every field to the exported LogRecord.
  4. The OTLP gRPC exporter, the batch processor, and the collector pipeline all handle the real wire format without issues.

Fully verified end-to-end on top of the earlier unit tests (which cover bridge mechanics) and the otlp_smoke example (which covers the isolated pipeline). Ready to merge pending human review.

This commit updates the EE repository reference after PR #529 was merged in windmill-ee-private.

Previous ee-repo-ref: 45959d063bc941c567488d330b5819601cdd2d3d

New ee-repo-ref: 5d6b713b74fc46735807f5c32883002e8d976fbc

Automated by sync-ee-ref workflow.
@windmill-internal-app
Copy link
Copy Markdown
Contributor

🤖 Updated ee-repo-ref.txt to 5d6b713b74fc46735807f5c32883002e8d976fbc after windmill-ee-private PR #529 was merged.

@windmill-internal-app windmill-internal-app Bot changed the title [ee] fix: enrich OTEL log records with per-request LogContext fix: enrich OTEL log records with per-request LogContext Apr 13, 2026
rubenfiszel and others added 2 commits April 13, 2026 21:41
@rubenfiszel
Copy link
Copy Markdown
Contributor Author

Follow-up: swap Mutex<LogContext> for ArcSwap<LogContext>

Pushed two commits refactoring the LogContext storage primitive from Arc<Mutex<LogContext>> to ArcSwap<LogContext> based on the reentrancy/correctness concern raised in review.

Why:

  • Eliminates the latent reentrant-deadlock footgun (mutex is not reentrant; a future contributor writing a closure inside update_log_context(|c| …) that itself called update_log_context would deadlock)
  • Removes mutex poisoning as a failure mode entirely (no lock → nothing to poison)
  • Reads drop from ~20 ns (uncontended mutex acquire) to ~1 ns (atomic load)
  • Writes stay roughly the same cost
  • Semantics become "each mutation is an atomic swap of an immutable snapshot" — easier to reason about

Cost: one new direct dep on arc-swap. The crate is ~5k LOC of pure std (no proc-macros, no heavy trait machinery), already a transitive dep in the lockfile (used by tracing-subscriber and others), zero extra compile time in practice.

API change: update_log_context now takes FnOnce(&LogContext) -> LogContext (returns a new owned value via functional record update) instead of FnOnce(&mut LogContext) (mutates in place). Example:

// Before
update_log_context(|c| {
    c.email = Some(email);
    c.username = Some(username);
});

// After
update_log_context(|c| LogContext {
    email: Some(email),
    username: Some(username),
    ..c.clone()
});

current_log_context() now returns Option<Arc<LogContext>> instead of Option<LogContext>. Auto-deref makes field access identical; the Arc is free to clone (refcount increment) instead of cloning all 20 Option<String> fields.

All 12 tests (6 log_context::tests + 5 otel_ee::tests + 1 new update_preserves_other_fields) still pass, and all three feature configurations (cargo check, cargo check --features enterprise,private, cargo check --features enterprise,private,otel) build clean.

Companion EE PR: windmill-labs/windmill-ee-private#530 (adapts attach_log_context to the new signature). Merge order: EE #530 first, then update ee-repo-ref.txt to the new merged commit hash, then this PR.

Follow-up opportunity (separate PR worth having)

While auditing for Arc<RwLock<T>> patterns I found ~10 more lazy_statics that are genuine ArcSwap candidates (read-hot, write-on-settings-reload): WORKER_CONFIG, CUSTOM_TAGS_PER_WORKSPACE, OAUTH_CLIENTS, JWT_SECRET, BASE_URL, HUB_BASE_URL, MIN_VERSION, SMTP_CONFIG, ALL_TAGS, LICENSE_KEY / LICENSE_KEY_ID. Plus 6 that should just be AtomicBool / AtomicI64: LICENSE_KEY_VALID, IS_SECURE, MONITOR_LOGS_ON_OBJECT_STORE, HTTP_ROUTE_WORKSPACED_ROUTE, JOB_RETENTION_SECS, AUDIT_LOG_RETENTION_DAYS. Not touching any of them in this PR — they deserve their own focused cleanup where the before/after is easy to review. WORKER_CONFIG alone likely has measurable perf impact given it's read on every job poll.

This commit updates the EE repository reference after PR #530 was merged in windmill-ee-private.

Previous ee-repo-ref: 45b4d7963a9ebcd583d1a87abe7d07d3d521584a

New ee-repo-ref: be2f3d4d11bb7110200524d7157caab3aac53996

Automated by sync-ee-ref workflow.
@windmill-internal-app
Copy link
Copy Markdown
Contributor

🤖 Updated ee-repo-ref.txt to be2f3d4d11bb7110200524d7157caab3aac53996 after windmill-ee-private PR #530 was merged.

@rubenfiszel rubenfiszel merged commit 42d3e8c into main Apr 13, 2026
12 of 14 checks passed
@rubenfiszel rubenfiszel deleted the propagate-span-fields branch April 13, 2026 21:50
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 13, 2026
@rubenfiszel
Copy link
Copy Markdown
Contributor Author

/ai-fast review

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant