fix: enrich OTEL log records with per-request LogContext#8812
fix: enrich OTEL log records with per-request LogContext#8812rubenfiszel merged 7 commits intomainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fd69159 to
7d35c0e
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
End-to-end verification with a real OTEL collectorSpun up LogRecord #0 — HTTP 5xx LogRecord #1 — ad-hoc handler error inside a request LogRecord #2 — worker job failure LogRecord #3 — background task outside any scope (zero-context is still a valid log) Resource attributes ( The smoke test is committed as |
Full axum pipeline verification with a running backendFollowed up the earlier standalone smoke test by running a real windmill backend instance (
The collector received 45 LogRecords during a ~3 second window of traffic. The notable
Sample Compare to the unauthenticated 401 case — This confirms:
Fully verified end-to-end on top of the earlier unit tests (which cover bridge mechanics) and the |
This commit updates the EE repository reference after PR #529 was merged in windmill-ee-private. Previous ee-repo-ref: 45959d063bc941c567488d330b5819601cdd2d3d New ee-repo-ref: 5d6b713b74fc46735807f5c32883002e8d976fbc Automated by sync-ee-ref workflow.
|
🤖 Updated |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up: swap
|
This commit updates the EE repository reference after PR #530 was merged in windmill-ee-private. Previous ee-repo-ref: 45b4d7963a9ebcd583d1a87abe7d07d3d521584a New ee-repo-ref: be2f3d4d11bb7110200524d7157caab3aac53996 Automated by sync-ee-ref workflow.
|
🤖 Updated |
|
/ai-fast review |
Summary
Exported OTEL
LogRecords were missing the per-request identifiers that live on the surrounding tracing spans inwindmill-api/src/tracing_init.rs(method,uri,email,workspace_id) andwindmill-worker/src/worker.rs(script_path,job_kind,language,worker,job_id, …). The EEopentelemetry-appender-tracingbridge only visits event-level fields, so a failing HTTP request's"response"log in Sentry reads as "something returned 500 somewhere" — with no way to filter by workspace or user.This carries the same identifiers through a tokio task-local (
LogContext) seeded at the request/job entry points, and a new EE log bridge that reads the task-local onon_eventand attaches the fields to each exportedLogRecord. The design mirrors how Java/Go/Python/Node/.NET OTEL stacks all solve this problem (MDC /context.Context/contextvars/AsyncLocalStorage/ILogger.BeginScope) — log enrichment lives in a per-task store that's independent of the trace span tree.Changes
windmill-common
windmill-common/src/log_context.rsmodule:LogContextstruct with the allowlisted fields we promote to log recordsLOG_CONTEXTtokiotask_local!holdingArc<Mutex<LogContext>>with_log_context(ctx, fut)— wrap a future in a LogContext scopecurrent_log_context()— snapshot the active contextupdate_log_context(|c| …)— mutate in place (used by auth after the scope is already entered)spawn_with_log_context(fut)—tokio::spawnwith the current context forwarded as a snapshotwindmill-common (EE, via
otel_ee.rssymlink)OpenTelemetryTracingBridgeininit_logs_bridgewith a newLogContextBridge:on_event, constructs aLogRecord, callsattach_log_contextto readLOG_CONTEXTand push the active fields onto the record, then visits the event fields (event-level values override inherited ones on collision)register_callsitegymnastics. The bridge is ~150 lines including a full event-field visitor..with_filter(otel_logs_filter)wrap intracing_init.rs— no filter interaction bugs, because the bridge doesn't care about span creation.windmill-api
tracing_init.rs: newlog_context_middlewarethat seedsLogContextwithmethod/uri/trace_idand wrapsnext.run(request)inwith_log_context(...).lib.rs: registers the middleware one layer outsideTraceLayer(and insideCatchPanicLayer) soMyOnResponse::on_response's"response"log fires inside the scope.windmill-api-auth/src/auth.rs: alongside the existingSpan::current().record("username"/"email"/"workspace_id")calls, mirrors the same values intoLOG_CONTEXTviaupdate_log_context(...)so logs fired after auth carry user/workspace identifiers.windmill-worker
worker.rs: newlog_context_for_jobhelper that mirrors the fields recorded on the job span fromarc_job, and the job execution call site wrapshandle_queued_job(...).instrument(span)inwith_log_context(log_ctx, ...).worker.rs:1297(wait-time insertion) andworker_utils.rs:367(queue vacuum) — the two existing.in_current_span()/.instrument(current_span)spawn sites now usespawn_with_log_contextto forward the context snapshot across the task boundary. Tracing span instrumentation preserved alongside.Tests
10 new unit tests, all passing:
windmill-common/src/log_context.rs(5)reads_and_mutates_current_context— seed +update_log_context+current_log_contextoutside_scope_returns_none—update_log_contextis a no-op outside a scope, doesn't panicspawn_inherits_snapshot—spawn_with_log_contextforwards the contextbare_spawn_has_no_context— plaintokio::spawndrops context (documents why we need the helper)spawn_snapshot_is_independent_of_parent_mutations— snapshot semanticswindmill-common/src/otel_ee.rs::tests(5)vanilla_bridge_drops_parent_span_fields— regression doc: confirms upstreamOpenTelemetryTracingBridgedrops span fieldsbridge_reads_log_context_on_event— full HTTP-path scenario, verifies method/uri/workspace_id/email/username/traceId all propagatelate_context_mutations_propagate— auth-middleware scenario, mutations after entry visible to the bridgeno_context_is_not_an_error— events outside any scope still emit, just without context fieldsfilter_gates_events_but_context_still_propagates— production-realistic.with_filter(EnvFilter::ERROR); INFO is suppressed while ERROR carries its contextTest plan
cargo test --features enterprise,private,otel -p windmill-common— 381/381 passcargo check— OSS cleancargo check --features enterprise,private— EE no-otel cleancargo check --features enterprise,private,otel— EE+otel cleanOTEL_LOGS_ENABLED=true+ OTLP endpoint. Hit a deliberately-failing route and a failing job. Verify the exported log records in Sentry / Grafana Loki / Datadog containworkspace_id,email,method,urifor the HTTP error andworkspace_id,script_path,job_kind,language,workerfor the job error.INFOevents are still suppressed under the defaultRUST_LOG=errorand onlyERROR/WARNreach OTEL.Performance
on_new_span, no extensions, no visitor. sqlx / hyper / tonic / internal spans pay nothing.tokio::task_local::scopepush at the log_context middleware (~15–30 ns) plus oneArc<Mutex<LogContext>>allocation.ERRORfilter, so handful per minute): one task-local lookup + oneMutex::lock+ oneLogContext::clone— total sub-µs.update_log_contextcall (auth path): oneMutex::lock+ field write — ~50 ns.spawn_with_log_context: one context clone + one newArc<Mutex<…>>. Two call sites in the entire backend today.Companion EE PR
windmill-labs/windmill-ee-private#529
Design notes
This replaces an earlier attempt (commit history overwritten — force-pushed) that used a
tracing_subscriber::Layerwalking parent spans viactx.event_scope(). That approach:.with_filter(ERROR)blockedon_new_spanfor INFO spans, so fields weren't captured in production) that required customregister_callsite/enabledoverrides to work aroundLogContextis cross-ecosystem canonical: it matches Java MDC, Gocontext.Context, Pythoncontextvars, NodeAsyncLocalStorage, and .NETILogger.BeginScope.Generated with Claude Code