feat(cloudflare): add telemetry collection and /metrics endpoint by vahidlazio · Pull Request #400 · spotify/confidence-resolver

vahidlazio · 2026-05-07T09:17:03Z

Summary

Add telemetry collection to the Cloudflare resolver with the same metric names as all other providers, enabling shared Grafana dashboards.

Key discovery: `scheduler.wait(0)` unfreezes CF Workers timers

CF Workers freeze Date.now() and performance.now() during synchronous CPU work (Spectre mitigation), making internal latency measurement appear impossible. We discovered that scheduler.wait(0) — a zero-delay yield to the runtime — unfreezes the clock with no measurable overhead. This enables inline CPU time measurement identical to other providers, without any external API dependency.

What's included

Inline resolve latency via scheduler.wait(0) + Date.now() (1ms resolution, Grafana interpolates to sub-ms)
Resolve rate telemetry by reason (MATCH, NO_SEGMENT_MATCH, ERROR, etc.)
/metrics endpoint serving Prometheus exposition format (confidence_resolve_latency_microseconds histogram + confidence_resolves_total counters)
Queue-based aggregation — per-isolate deltas accumulated into KV-backed cumulative snapshot
Backend telemetry — WriteFlagLogsRequest with SDK ID (SDK_ID_CLOUDFLARE_RESOLVER = 25), resolver version, resolve rates, and latency histogram
Cache-Control: no-store on /metrics response
Deployer auto-creates KV namespace for metrics storage

Data flow

Hot path:     Date.now() → resolve → scheduler.wait(0) → Date.now() → elapsed_us
              Push {elapsed_us, reasons} to PENDING_METRICS
wait_until:   Drain metrics → TELEMETRY.record_latency_us / mark_resolve
              → checkpoint() → Queue
Queue consumer (batched, all isolates):
  ├─ KV: read cumulative → accumulate deltas → write snapshot + prometheus text
  └─ POST aggregated WriteFlagLogsRequest to Confidence backend
/metrics:     Read prometheus text from KV

What was explored and removed

The initial approach sourced latency from Cloudflare's GraphQL Analytics API (workersInvocationsAdaptive percentiles). This worked but added complexity: cursor-based pagination, rate-limiting, percentile-to-histogram distribution, cached percentile estimation. Once scheduler.wait(0) was proven to unfreeze timers, the entire GraphQL machinery was removed (~360 lines), leaving a clean inline timer identical to other providers.

Test plan

cargo check -p confidence-cloudflare-resolver compiles clean
cargo test -p confidence_resolver -- telemetry passes (25 tests)
Deploy to test account and verify /metrics returns valid Prometheus text
Verify resolve reasons accumulate correctly (MATCH, NO_SEGMENT_MATCH, ERROR)
Verify latency histogram matches warm resolve times (~1-2ms p50)
Verify telemetry appears on Grafana dashboard (resolve rate continuous, latency continuous)
Verify scheduler.wait(0) adds no measurable latency to resolve path (0ms overhead in 19/20 samples)
Verify metric names match other providers (confidence_resolve_latency_microseconds, confidence_resolves_total)

🤖 Generated with Claude Code

Add Prometheus-compatible telemetry to the Cloudflare resolver, matching the same metric names as the WASM providers so they can share dashboards. - Collect per-flag resolve latency and reason in the fetch handler, deferred to ctx.wait_until to keep the hot path clean - Include telemetry deltas in WriteFlagLogsRequest via checkpoint() - Queue consumer accumulates cross-isolate deltas into a cumulative TelemetrySnapshot persisted in KV - Serve /metrics endpoint reading Prometheus text from KV - Add serde derives to TelemetrySnapshot and accumulate_delta() method for reconstructing flat histograms from compressed BucketSpans - Deployer auto-creates KV namespace (same pattern as queue creation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Await KV put operations in update_prometheus_kv (were fire-and-forget) - Guard against negative/oversized BucketSpan offsets in accumulate_delta - Add race condition comment on KV read-modify-write - Add CORS headers to /metrics endpoint for consistency - Add unit tests for accumulate_delta: basic, negative offset, oversized Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allow indexing_slicing and arithmetic_side_effects on the method since bounds are checked before every index. Use saturating_add for resize. Re-sync Go WASM module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously aggregate_batch discarded all telemetry data except the SDK field from the first message. Now it merges latency histograms, resolve rate counters, and gauge fields across all messages in the batch, so the Confidence backend receives aggregated telemetry matching what the WASM providers send. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use performance.now() instead of Date.now() for better timing resolution - Rename METRICS_KV binding to CONFIDENCE_METRICS_KV - Fallback to Date.now() if performance API is unavailable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…formance.now() Use web_sys::WorkerGlobalScope::performance() instead of js_sys::Reflect::get() to avoid dynamic JS lookups on the hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Cloudflare Workers freeze all timer APIs (Date.now, performance.now) during synchronous CPU work as a Spectre mitigation, making it impossible to measure resolve latency internally. Instead, we query Cloudflare's GraphQL analytics API from the queue consumer to get real CPU time percentiles (p25/p50/p75/p90/p99) and distribute them into the same exponential histogram buckets used by all other providers. This ensures: - Same metric names (confidence_resolve_latency_microseconds) - Same histogram format (compatible with shared Grafana dashboards) - Real CPU time data (~1-2ms p50 for in-memory flag evaluation) The queue consumer uses a cursor stored in KV (cpu_time_cursor) to avoid double-counting across batches. The deployer now passes CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID as Worker env vars. Also removes the broken internal timer (performance.now/Date.now) and the web-sys dependency that was added for it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use floor() instead of round() for percentile band allocation to prevent assigning more observations than requests - Track actual observations placed into buckets for _count to ensure +Inf bucket >= all cumulative bucket counts - Revert sum to use weighted distribution across all percentile bands (correct representation of the latency distribution) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Single-request CF analytics data points have all percentiles identical (p25=p50=p75=p90=p99). Spreading them across 6 bands inflates the histogram. Now place a single observation at p50 for these cases. Also fix rounding: use floor() and track remaining to prevent over-allocation across bands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rvals Store last fetch timestamp in KV (cpu_time_last_fetch_ms) and skip the GraphQL query if less than 10 seconds have elapsed. Prevents excessive API calls when queue batches arrive faster than analytics data updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…f extra KV key Compare cursor timestamp to now — skip CF analytics call if cursor is less than 10s old. No extra KV reads/writes needed; the cursor we already store doubles as the rate-limit check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When the queue consumer fetches new CPU time data from CF analytics (rate-limited to every 10s via cursor age), the latency histogram delta is now included in the WriteFlagLogsRequest sent to the Confidence backend. When no new latency data is available (rate limit or no traffic), only resolve reasons are sent — no stale latency data is re-sent. This ensures the backend's VictoriaMetrics telemetry consumer receives the same latency distribution data as the /metrics endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add new SDK ID for the Cloudflare resolver to all proto definitions and set it in the telemetry data sent by the CF resolver. Also syncs missing IDs 23 (PYTHON_LOCAL_PROVIDER) and 24 (RUST_LOCAL_PROVIDER) from epx-flags-resolver. Without this, the backend receives telemetry with sdk=null, making it invisible on the SDK telemetry Grafana dashboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The backend's VictoriaMetricsClient skips resolve_rate metrics when resolver_version is empty. Without this, resolve rates from the CF resolver are silently dropped and don't appear on Grafana dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move CLOUDFLARE_API_TOKEN from plaintext [vars] to encrypted Worker secret via `env.secret()` and `wrangler secret put` after deploy - Inject CF_SCRIPT_NAME from WORKER_NAME so prefixed deployments query the correct script in CF analytics - Merge update_prometheus_kv and update_cpu_time_kv into a single update_metrics_kv to avoid double KV read-modify-write of "snapshot" - Clamp u64→u32 truncation on sum/count with .min(u32::MAX as u64) - Cap histogram bucket index at BUCKET_COUNT-1 to prevent unbounded growth from unexpected CF analytics values - Add Cache-Control: no-store to /metrics response - Export BUCKET_COUNT as pub from telemetry module Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…etry When CF analytics rate-limit fires, estimate latency from cached percentiles instead of sending nothing. This keeps the Grafana latency graph continuous between analytics fetches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

scheduler.wait(0) unfreezes CF Workers' Spectre-mitigated timers with zero overhead, allowing Date.now() to reflect actual CPU time. This enables inline latency measurement identical to other providers, removing the need for the CF GraphQL analytics API dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the entire Cloudflare GraphQL analytics API machinery with inline latency measurement via scheduler.wait(0). This zero-overhead yield unfreezes CF Workers' Spectre-mitigated timers, enabling Date.now()-based CPU time measurement identical to other providers. Removes ~360 lines: GraphQL query, cursor-based pagination, rate-limiting, percentile-to-histogram distribution, cached percentile estimation, and the CLOUDFLARE_API_TOKEN dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove deployer vars (CLOUDFLARE_ACCOUNT_ID, CF_SCRIPT_NAME, CLOUDFLARE_API_TOKEN secret) that were only needed for the CF analytics API. Revert BUCKET_COUNT to non-pub. Remove unused imports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Auth-gate /metrics with ClientSecret header (#9) - Extract Prometheus content-type to named constant (#6) - Record 0μs sub-ms observations instead of dropping them (#7) - Use Option<u32> for elapsed_us: None when scheduler API unavailable, Some(0) for measured sub-ms resolves (#8) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Document latency measurement (scheduler.wait(0), 1ms precision) - Document /metrics endpoint auth (ClientSecret header) - Document KV store usage and DISABLE_METRICS deployer option - Add DISABLE_METRICS env var to skip KV creation when scraping not needed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_METRICS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace DISABLE_METRICS with ENABLE_METRICS. The /metrics endpoint and KV store are now only created when explicitly enabled, reducing default resource usage for customers who don't need Prometheus scraping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace RESOLVE_LOGGER, ASSIGN_LOGGER, TELEMETRY, and LAST_FLUSHED statics with a single thread-local WriteFlagLogsRequest populated per-request. Host callbacks append directly via new public builder functions (build_flag_assigned, build_resolve_log, build_request_telemetry). All cross-request aggregation now happens in the queue consumer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Nicklas Lundin <nicklasl@spotify.com>

…FLAG_LOG

vahidlazio marked this pull request as draft May 7, 2026 09:17

vahidlazio marked this pull request as ready for review May 7, 2026 11:50

andreas-karlsson reviewed May 8, 2026

View reviewed changes

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

Comment thread confidence-cloudflare-resolver/deployer/script.sh Outdated

spotify deleted a comment from yyoyoian-pixel May 11, 2026

vahidlazio and others added 25 commits May 12, 2026 13:14

chore: sync WASM module for Go provider

d71af3f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: satisfy strict clippy lints in accumulate_delta

d149fe5

Allow indexing_slicing and arithmetic_side_effects on the method since bounds are checked before every index. Use saturating_add for resize. Re-sync Go WASM module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix rustfmt line length in accumulate_delta

58cdfc9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: re-sync WASM module for Go provider after fmt fix

d241fa7

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: resolve clippy redundant closure warning

f5a745e

perf(cloudflare): replace Reflect with web-sys typed bindings for per…

86fb301

…formance.now() Use web_sys::WorkerGlobalScope::performance() instead of js_sys::Reflect::get() to avoid dynamic JS lookups on the hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: explain why js_sys::global() cast is needed for WorkerGlobalScope

cc4ee3f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: re-sync WASM module for Go provider after rebase

6afd948

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vahidlazio force-pushed the feat/cloudflare-telemetry-analytics-engine branch from ebf57e3 to 6afd948 Compare May 12, 2026 11:16

nicklasl reviewed May 12, 2026

View reviewed changes

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated

vahidlazio and others added 7 commits May 12, 2026 14:10

docs(cloudflare): clarify backend telemetry is independent of DISABLE…

dd15df5

…_METRICS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs(cloudflare): remove Grafana references from deployer docs

e081a84

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs(cloudflare): link KV pricing in metrics docs

aeadc9f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nicklasl approved these changes May 13, 2026

View reviewed changes

nicklasl reviewed May 13, 2026

View reviewed changes

Comment thread confidence-cloudflare-resolver/deployer/README.md Outdated

andreas-karlsson and others added 4 commits May 13, 2026 13:26

Update confidence-cloudflare-resolver/deployer/README.md

9e4d07b

Co-authored-by: Nicklas Lundin <nicklasl@spotify.com>

fixup! refactor(cloudflare): replace static loggers with per-request …

5ae3403

…FLAG_LOG

fix: formatting and linting

1461226

fix: wasm go....

123add6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cloudflare): add telemetry collection and /metrics endpoint#400

feat(cloudflare): add telemetry collection and /metrics endpoint#400
vahidlazio wants to merge 36 commits into
mainfrom
feat/cloudflare-telemetry-analytics-engine

vahidlazio commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vahidlazio commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key discovery: scheduler.wait(0) unfreezes CF Workers timers

What's included

Data flow

What was explored and removed

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vahidlazio commented May 7, 2026 •

edited

Loading

Key discovery: `scheduler.wait(0)` unfreezes CF Workers timers