Skip to content

feat: add adaptive sampling to node sdk#155

Open
jy-tan wants to merge 1 commit intomainfrom
adaptive-sampling
Open

feat: add adaptive sampling to node sdk#155
jy-tan wants to merge 1 commit intomainfrom
adaptive-sampling

Conversation

@jy-tan
Copy link
Copy Markdown
Contributor

@jy-tan jy-tan commented Apr 11, 2026

Summary

Add the Node half of adaptive sampling so the SDK can shed root-request recording load locally based on queue, exporter, event-loop, and memory pressure signals while preserving whole-trace semantics and pre-app-start capture.

This keeps admission decisions local to the SDK process and defers backend-driven rarity sampling in favor of app and exporter protection.

Changes

  • Parse recording.sampling.mode, base_rate, and min_rate while keeping legacy recording.sampling_rate behavior and supporting TUSK_SAMPLING_RATE as a compatibility alias alongside canonical TUSK_RECORDING_SAMPLING_RATE
  • Add AdaptiveSamplingController and wire it into TuskDrift with periodic health updates driven by queue fill, dropped spans, export failures/timeouts, event-loop lag, and memory pressure
  • Replace the opaque batching path with a Drift-owned DriftBatchSpanProcessor so queue health, dropped spans, and export failures are first-class control signals for load shedding
  • Add exporter retry and circuit-breaker resilience in the API span adapter and surface exporter health through TdSpanExporter
  • Gate inbound HTTP and Next.js root requests through adaptive admission and run sampled-out requests under explicit no-record context propagation
  • Preserve pre-app-start capture for the first inbound HTTP request before auto-marking the app ready
  • Update the Node docs for adaptive sampling configuration and add focused tests for controller behavior, env-var precedence, batch-processor health, and HTTP pre-app-start admission

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 18 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/core/tracing/adapters/ApiSpanAdapter.ts">

<violation number="1" location="src/core/tracing/adapters/ApiSpanAdapter.ts:233">
P2: Timeout failures are not counted because the abort check only matches `AbortError`, while this code aborts with `new Error(...)` as the reason.</violation>
</file>

<file name="docs/nextjs-initialization.md">

<violation number="1" location="docs/nextjs-initialization.md:207">
P2: The startup note overstates pre-ready capture semantics. It says all requests before `markAppAsReady()` are always recorded, but this feature is described as preserving only the first inbound pre-app-start request.</violation>
</file>

<file name="src/core/tracing/TdSpanExporter.ts">

<violation number="1" location="src/core/tracing/TdSpanExporter.ts:102">
P2: Preserve `null` when no export latency has been observed instead of coercing it to `0` during aggregation.</violation>
</file>

<file name="src/core/sampling/AdaptiveSamplingController.ts">

<violation number="1" location="src/core/sampling/AdaptiveSamplingController.ts:205">
P2: When adaptive load shedding drives `effectiveRate` to exactly 0 (hot/warm state with `minRate = 0`), the reason is misreported as `"not_sampled"` instead of `"load_shed"`. This breaks observability: operators cannot distinguish "configured rate is zero" from "adaptive controller suppressed all traffic".</violation>
</file>

<file name="src/core/tracing/DriftBatchSpanProcessor.ts">

<violation number="1" location="src/core/tracing/DriftBatchSpanProcessor.ts:51">
P2: `void this.flushOneBatch()` discards the returned promise. If the exporter throws synchronously inside `export()`, the resulting rejection is unhandled and can crash the process. Add a `.catch()` at these fire-and-forget call sites, or wrap the `this.exporter.export(...)` call in a try/catch that calls `resolve()`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review, or fix all with cubic.

throw new Error(`Remote export failed: ${parsed.message}`);
}
} catch (error) {
if (error instanceof Error && error.name === "AbortError") {
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Timeout failures are not counted because the abort check only matches AbortError, while this code aborts with new Error(...) as the reason.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/core/tracing/adapters/ApiSpanAdapter.ts, line 233:

<comment>Timeout failures are not counted because the abort check only matches `AbortError`, while this code aborts with `new Error(...)` as the reason.</comment>

<file context>
@@ -178,4 +167,76 @@ export class ApiSpanAdapter implements SpanExportAdapter {
+        throw new Error(`Remote export failed: ${parsed.message}`);
+      }
+    } catch (error) {
+      if (error instanceof Error && error.name === "AbortError") {
+        this.timeoutCount += 1;
+        throw error;
</file context>
Suggested change
if (error instanceof Error && error.name === "AbortError") {
if (
(error instanceof Error && error.name === "AbortError") ||
error === controller.signal.reason
) {
Fix with Cubic

- `recording.sampling.mode` comes from `.tusk/config.yaml` and defaults to `fixed`
- `recording.sampling.min_rate` is only used in `adaptive` mode and defaults to `0.001` when omitted

> **Note:** Requests before `TuskDrift.markAppAsReady()` are always recorded. Sampling applies to normal inbound traffic after startup.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The startup note overstates pre-ready capture semantics. It says all requests before markAppAsReady() are always recorded, but this feature is described as preserving only the first inbound pre-app-start request.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/nextjs-initialization.md, line 207:

<comment>The startup note overstates pre-ready capture semantics. It says all requests before `markAppAsReady()` are always recorded, but this feature is described as preserving only the first inbound pre-app-start request.</comment>

<file context>
@@ -184,17 +184,31 @@ More context on setting up instrumentations for Next.js apps can be found [here]
+   - `recording.sampling.mode` comes from `.tusk/config.yaml` and defaults to `fixed`
+   - `recording.sampling.min_rate` is only used in `adaptive` mode and defaults to `0.001` when omitted
+
+> **Note:** Requests before `TuskDrift.markAppAsReady()` are always recorded. Sampling applies to normal inbound traffic after startup.
 
 ### Method 1: Init Parameter
</file context>
Suggested change
> **Note:** Requests before `TuskDrift.markAppAsReady()` are always recorded. Sampling applies to normal inbound traffic after startup.
> **Note:** The first inbound request before the SDK marks the app ready is always recorded. Sampling applies to normal inbound traffic after startup.
Fix with Cubic

Comment on lines +102 to +105
lastExportLatencyMs: Math.max(
accumulator.lastExportLatencyMs ?? 0,
snapshot.lastExportLatencyMs ?? 0,
),
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Preserve null when no export latency has been observed instead of coercing it to 0 during aggregation.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/core/tracing/TdSpanExporter.ts, line 102:

<comment>Preserve `null` when no export latency has been observed instead of coercing it to `0` during aggregation.</comment>

<file context>
@@ -71,6 +80,39 @@ export class TdSpanExporter implements SpanExporter {
+        failureCount: accumulator.failureCount + snapshot.failureCount,
+        timeoutCount: accumulator.timeoutCount + snapshot.timeoutCount,
+        circuitOpen: accumulator.circuitOpen || snapshot.circuitState === "open",
+        lastExportLatencyMs: Math.max(
+          accumulator.lastExportLatencyMs ?? 0,
+          snapshot.lastExportLatencyMs ?? 0,
</file context>
Suggested change
lastExportLatencyMs: Math.max(
accumulator.lastExportLatencyMs ?? 0,
snapshot.lastExportLatencyMs ?? 0,
),
lastExportLatencyMs:
snapshot.lastExportLatencyMs === null
? accumulator.lastExportLatencyMs
: accumulator.lastExportLatencyMs === null
? snapshot.lastExportLatencyMs
: Math.max(accumulator.lastExportLatencyMs, snapshot.lastExportLatencyMs),
Fix with Cubic

if (effectiveRate <= 0) {
return {
shouldRecord: false,
reason: this.state === "critical_pause" ? "critical_pause" : "not_sampled",
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: When adaptive load shedding drives effectiveRate to exactly 0 (hot/warm state with minRate = 0), the reason is misreported as "not_sampled" instead of "load_shed". This breaks observability: operators cannot distinguish "configured rate is zero" from "adaptive controller suppressed all traffic".

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/core/sampling/AdaptiveSamplingController.ts, line 205:

<comment>When adaptive load shedding drives `effectiveRate` to exactly 0 (hot/warm state with `minRate = 0`), the reason is misreported as `"not_sampled"` instead of `"load_shed"`. This breaks observability: operators cannot distinguish "configured rate is zero" from "adaptive controller suppressed all traffic".</comment>

<file context>
@@ -0,0 +1,283 @@
+    if (effectiveRate <= 0) {
+      return {
+        shouldRecord: false,
+        reason: this.state === "critical_pause" ? "critical_pause" : "not_sampled",
+        mode: this.config.mode,
+        state: this.state,
</file context>
Fix with Cubic

this.config = config;
this.mode = mode;
this.interval = setInterval(() => {
void this.flushOneBatch();
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: void this.flushOneBatch() discards the returned promise. If the exporter throws synchronously inside export(), the resulting rejection is unhandled and can crash the process. Add a .catch() at these fire-and-forget call sites, or wrap the this.exporter.export(...) call in a try/catch that calls resolve().

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/core/tracing/DriftBatchSpanProcessor.ts, line 51:

<comment>`void this.flushOneBatch()` discards the returned promise. If the exporter throws synchronously inside `export()`, the resulting rejection is unhandled and can crash the process. Add a `.catch()` at these fire-and-forget call sites, or wrap the `this.exporter.export(...)` call in a try/catch that calls `resolve()`.</comment>

<file context>
@@ -0,0 +1,152 @@
+    this.config = config;
+    this.mode = mode;
+    this.interval = setInterval(() => {
+      void this.flushOneBatch();
+    }, this.config.scheduledDelayMillis);
+    this.interval.unref?.();
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants