Flush OTEL logs with Vercel waitUntil#354
Conversation
PR SummaryLow Risk Overview OpenTelemetry log export is wrapped in Reviewed by Cursor Bugbot for commit 44e4931. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR improves OpenTelemetry log delivery reliability on Vercel by ensuring the OTLP batch log processor gets flushed via Vercel’s waitUntil, and includes a small Next.js patch update plus focused unit coverage for the new processor.
Changes:
- Bump
nextand@next/envfrom16.2.6to16.2.7. - Introduce
VercelWaitUntilLogRecordProcessorto schedule aforceFlush()through Vercel request contextwaitUntil. - Add unit tests validating delegation behavior and “single scheduled flush while pending” semantics.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit/vercel-wait-until-log-record-processor.test.ts |
Adds unit coverage for the waitUntil-based log flushing wrapper. |
src/instrumentation.node.ts |
Wraps the OTLP BatchLogRecordProcessor with the new Vercel waitUntil-aware processor. |
src/core/server/observability/vercel-wait-until-log-record-processor.ts |
Implements the wrapper that schedules log flushes via Vercel request context waitUntil. |
package.json |
Updates next and @next/env dependency versions to 16.2.7. |
bun.lock |
Locks updated Next.js-related package versions/checksums. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| try { | ||
| requestContext.waitUntil(async () => { | ||
| try { | ||
| await this.delegate.forceFlush() | ||
| } catch (error) { | ||
| diag.warn( | ||
| 'Failed to flush OpenTelemetry logs in Vercel waitUntil', | ||
| error | ||
| ) | ||
| } finally { | ||
| this.pendingFlushContexts.delete(requestContext) | ||
| } | ||
| }) | ||
| } catch (error) { |
There was a problem hiding this comment.
🔴 The dedupe set marks the context as "pending" for the entire duration of forceFlush(), but BatchLogRecordProcessor.forceFlush() snapshots its internal buffer synchronously and then awaits the OTLP export. Any log emitted during that export window goes into the next batch — but scheduleRequestFlush short-circuits (line 46-48) because the context is still in pendingFlushContexts, so no new waitUntil is registered. When the in-flight flush resolves, Vercel can terminate the runtime before the BatchLogRecordProcessor's 5s timer fires (it's unref()'d), silently dropping that log — the exact failure mode this PR aims to prevent. Fix: delete the context from pendingFlushContexts before awaiting forceFlush, so concurrent emits schedule a fresh waitUntil that picks up late additions.
Extended reasoning...
What the bug is
scheduleRequestFlush at src/core/server/observability/vercel-wait-until-log-record-processor.ts:43-72 uses pendingFlushContexts to dedupe — only one waitUntil(forceFlush) should be scheduled per request context. The intent is sound, but the dedupe flag is released too late: the context is removed from the set only in the finally block at line 62, after forceFlush() resolves. This is a classic flag-after-side-effect race.
Why it manifests
BatchLogRecordProcessor.forceFlush() (in @opentelemetry/sdk-logs) calls _flushAll(), which synchronously snapshots the buffer:
let toFlush = this._finishedLogRecords;
this._finishedLogRecords = []; // buffer swapped to empty
// ... then async OTLP HTTP export happensRecords emitted after this snapshot land in the new (empty) buffer and are not part of the in-flight export. The OTLP HTTP POST typically takes tens to hundreds of milliseconds, which is plenty of time for additional logs to be emitted from concurrent async work (database calls, fetch instrumentation auto-spans, etc.).
Step-by-step proof
- Log A emitted →
delegate.onEmitpushes A →scheduleRequestFlushaddsrequestContexttopendingFlushContexts, callsrequestContext.waitUntil(asyncFn). asyncFnruns;delegate.forceFlush()is invoked. Internally,_flushAllsnapshotstoFlush = [A], resets the buffer to[], and starts the OTLP HTTPS POST.asyncFnis now suspended atawait.- The request handler continues (e.g.
await db.query()orawait fetch(...)). During this window, log B is emitted. delegate.onEmit(B, ctx)pushes B into the now-empty buffer.scheduleRequestFlushruns:pendingFlushContexts.has(requestContext)istrue→ early return at line 46-48. No newwaitUntilis registered.- The OTLP export for [A] resolves.
_flushAllcalls_maybeStartTimerwhich schedules a delayed flush — but the BatchLogRecordProcessor's timer isunref()'d, so it cannot keep the Vercel runtime alive. asyncFn'sfinallyremovesrequestContextfrompendingFlushContextsand the async function resolves.- Vercel observes that all
waitUntilpromises have settled. The runtime is terminated. - Log B is still sitting in
BatchLogRecordProcessor._finishedLogRecordsand is silently dropped.
This is precisely the failure mode the PR title ("flush otel logs with vercel waitUntil") is trying to prevent — under realistic conditions where logs are interleaved with async I/O, the fix is only partial.
Why existing code does not prevent it
The test at tests/unit/vercel-wait-until-log-record-processor.test.ts exercises back-to-back synchronous emissions before the callback runs (processor.onEmit(); processor.onEmit(); await callbacks[0]();). It does not cover emissions that happen during the await this.delegate.forceFlush() window — the exact case this bug describes.
Impact
In Vercel serverless functions handling typical async request workflows, logs emitted while an OTLP flush is in flight are at risk of being lost on runtime termination. The race window is the OTLP HTTP RTT (tens to hundreds of ms), easily long enough to overlap with subsequent log emissions. Auto-instrumentations like FetchInstrumentation that emit telemetry from async span completions further widen the window.
How to fix
Remove the context from pendingFlushContexts before awaiting forceFlush, so any log emitted during the export schedules a fresh waitUntil:
requestContext.waitUntil(async () => {
this.pendingFlushContexts.delete(requestContext)
try {
await this.delegate.forceFlush()
} catch (error) {
diag.warn('Failed to flush OpenTelemetry logs in Vercel waitUntil', error)
}
})With this change, concurrent emits add the context back into the set and register a new waitUntil, extending the runtime lifetime until the late additions are exported.
Summary
Verification
Build note: bun run build was blocked locally by missing billing env values: BILLING_API_URL or NEXT_PUBLIC_STRIPE_PUBLISHABLE_KEY while NEXT_PUBLIC_INCLUDE_BILLING is enabled.