feat(cloudflare): add telemetry collection and /metrics endpoint#400
Open
vahidlazio wants to merge 36 commits into
Open
feat(cloudflare): add telemetry collection and /metrics endpoint#400vahidlazio wants to merge 36 commits into
vahidlazio wants to merge 36 commits into
Conversation
Add Prometheus-compatible telemetry to the Cloudflare resolver, matching the same metric names as the WASM providers so they can share dashboards. - Collect per-flag resolve latency and reason in the fetch handler, deferred to ctx.wait_until to keep the hot path clean - Include telemetry deltas in WriteFlagLogsRequest via checkpoint() - Queue consumer accumulates cross-isolate deltas into a cumulative TelemetrySnapshot persisted in KV - Serve /metrics endpoint reading Prometheus text from KV - Add serde derives to TelemetrySnapshot and accumulate_delta() method for reconstructing flat histograms from compressed BucketSpans - Deployer auto-creates KV namespace (same pattern as queue creation) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Await KV put operations in update_prometheus_kv (were fire-and-forget) - Guard against negative/oversized BucketSpan offsets in accumulate_delta - Add race condition comment on KV read-modify-write - Add CORS headers to /metrics endpoint for consistency - Add unit tests for accumulate_delta: basic, negative offset, oversized Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow indexing_slicing and arithmetic_side_effects on the method since bounds are checked before every index. Use saturating_add for resize. Re-sync Go WASM module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously aggregate_batch discarded all telemetry data except the SDK field from the first message. Now it merges latency histograms, resolve rate counters, and gauge fields across all messages in the batch, so the Confidence backend receives aggregated telemetry matching what the WASM providers send. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use performance.now() instead of Date.now() for better timing resolution - Rename METRICS_KV binding to CONFIDENCE_METRICS_KV - Fallback to Date.now() if performance API is unavailable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…formance.now() Use web_sys::WorkerGlobalScope::performance() instead of js_sys::Reflect::get() to avoid dynamic JS lookups on the hot path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cloudflare Workers freeze all timer APIs (Date.now, performance.now) during synchronous CPU work as a Spectre mitigation, making it impossible to measure resolve latency internally. Instead, we query Cloudflare's GraphQL analytics API from the queue consumer to get real CPU time percentiles (p25/p50/p75/p90/p99) and distribute them into the same exponential histogram buckets used by all other providers. This ensures: - Same metric names (confidence_resolve_latency_microseconds) - Same histogram format (compatible with shared Grafana dashboards) - Real CPU time data (~1-2ms p50 for in-memory flag evaluation) The queue consumer uses a cursor stored in KV (cpu_time_cursor) to avoid double-counting across batches. The deployer now passes CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID as Worker env vars. Also removes the broken internal timer (performance.now/Date.now) and the web-sys dependency that was added for it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use floor() instead of round() for percentile band allocation to prevent assigning more observations than requests - Track actual observations placed into buckets for _count to ensure +Inf bucket >= all cumulative bucket counts - Revert sum to use weighted distribution across all percentile bands (correct representation of the latency distribution) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-request CF analytics data points have all percentiles identical (p25=p50=p75=p90=p99). Spreading them across 6 bands inflates the histogram. Now place a single observation at p50 for these cases. Also fix rounding: use floor() and track remaining to prevent over-allocation across bands. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rvals Store last fetch timestamp in KV (cpu_time_last_fetch_ms) and skip the GraphQL query if less than 10 seconds have elapsed. Prevents excessive API calls when queue batches arrive faster than analytics data updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…f extra KV key Compare cursor timestamp to now — skip CF analytics call if cursor is less than 10s old. No extra KV reads/writes needed; the cursor we already store doubles as the rate-limit check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the queue consumer fetches new CPU time data from CF analytics (rate-limited to every 10s via cursor age), the latency histogram delta is now included in the WriteFlagLogsRequest sent to the Confidence backend. When no new latency data is available (rate limit or no traffic), only resolve reasons are sent — no stale latency data is re-sent. This ensures the backend's VictoriaMetrics telemetry consumer receives the same latency distribution data as the /metrics endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new SDK ID for the Cloudflare resolver to all proto definitions and set it in the telemetry data sent by the CF resolver. Also syncs missing IDs 23 (PYTHON_LOCAL_PROVIDER) and 24 (RUST_LOCAL_PROVIDER) from epx-flags-resolver. Without this, the backend receives telemetry with sdk=null, making it invisible on the SDK telemetry Grafana dashboard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The backend's VictoriaMetricsClient skips resolve_rate metrics when resolver_version is empty. Without this, resolve rates from the CF resolver are silently dropped and don't appear on Grafana dashboards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move CLOUDFLARE_API_TOKEN from plaintext [vars] to encrypted Worker secret via `env.secret()` and `wrangler secret put` after deploy - Inject CF_SCRIPT_NAME from WORKER_NAME so prefixed deployments query the correct script in CF analytics - Merge update_prometheus_kv and update_cpu_time_kv into a single update_metrics_kv to avoid double KV read-modify-write of "snapshot" - Clamp u64→u32 truncation on sum/count with .min(u32::MAX as u64) - Cap histogram bucket index at BUCKET_COUNT-1 to prevent unbounded growth from unexpected CF analytics values - Add Cache-Control: no-store to /metrics response - Export BUCKET_COUNT as pub from telemetry module Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etry When CF analytics rate-limit fires, estimate latency from cached percentiles instead of sending nothing. This keeps the Grafana latency graph continuous between analytics fetches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scheduler.wait(0) unfreezes CF Workers' Spectre-mitigated timers with zero overhead, allowing Date.now() to reflect actual CPU time. This enables inline latency measurement identical to other providers, removing the need for the CF GraphQL analytics API dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the entire Cloudflare GraphQL analytics API machinery with inline latency measurement via scheduler.wait(0). This zero-overhead yield unfreezes CF Workers' Spectre-mitigated timers, enabling Date.now()-based CPU time measurement identical to other providers. Removes ~360 lines: GraphQL query, cursor-based pagination, rate-limiting, percentile-to-histogram distribution, cached percentile estimation, and the CLOUDFLARE_API_TOKEN dependency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove deployer vars (CLOUDFLARE_ACCOUNT_ID, CF_SCRIPT_NAME, CLOUDFLARE_API_TOKEN secret) that were only needed for the CF analytics API. Revert BUCKET_COUNT to non-pub. Remove unused imports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ebf57e3 to
6afd948
Compare
nicklasl
reviewed
May 12, 2026
- Auth-gate /metrics with ClientSecret header (#9) - Extract Prometheus content-type to named constant (#6) - Record 0μs sub-ms observations instead of dropping them (#7) - Use Option<u32> for elapsed_us: None when scheduler API unavailable, Some(0) for measured sub-ms resolves (#8) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document latency measurement (scheduler.wait(0), 1ms precision) - Document /metrics endpoint auth (ClientSecret header) - Document KV store usage and DISABLE_METRICS deployer option - Add DISABLE_METRICS env var to skip KV creation when scraping not needed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_METRICS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace DISABLE_METRICS with ENABLE_METRICS. The /metrics endpoint and KV store are now only created when explicitly enabled, reducing default resource usage for customers who don't need Prometheus scraping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace RESOLVE_LOGGER, ASSIGN_LOGGER, TELEMETRY, and LAST_FLUSHED statics with a single thread-local WriteFlagLogsRequest populated per-request. Host callbacks append directly via new public builder functions (build_flag_assigned, build_resolve_log, build_request_telemetry). All cross-request aggregation now happens in the queue consumer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
nicklasl
approved these changes
May 13, 2026
nicklasl
reviewed
May 13, 2026
Co-authored-by: Nicklas Lundin <nicklasl@spotify.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add telemetry collection to the Cloudflare resolver with the same metric names as all other providers, enabling shared Grafana dashboards.
Key discovery:
scheduler.wait(0)unfreezes CF Workers timersCF Workers freeze
Date.now()andperformance.now()during synchronous CPU work (Spectre mitigation), making internal latency measurement appear impossible. We discovered thatscheduler.wait(0)— a zero-delay yield to the runtime — unfreezes the clock with no measurable overhead. This enables inline CPU time measurement identical to other providers, without any external API dependency.What's included
scheduler.wait(0)+Date.now()(1ms resolution, Grafana interpolates to sub-ms)/metricsendpoint serving Prometheus exposition format (confidence_resolve_latency_microsecondshistogram +confidence_resolves_totalcounters)WriteFlagLogsRequestwith SDK ID (SDK_ID_CLOUDFLARE_RESOLVER = 25), resolver version, resolve rates, and latency histogramCache-Control: no-storeon/metricsresponseData flow
What was explored and removed
The initial approach sourced latency from Cloudflare's GraphQL Analytics API (
workersInvocationsAdaptivepercentiles). This worked but added complexity: cursor-based pagination, rate-limiting, percentile-to-histogram distribution, cached percentile estimation. Oncescheduler.wait(0)was proven to unfreeze timers, the entire GraphQL machinery was removed (~360 lines), leaving a clean inline timer identical to other providers.Test plan
cargo check -p confidence-cloudflare-resolvercompiles cleancargo test -p confidence_resolver -- telemetrypasses (25 tests)/metricsreturns valid Prometheus textscheduler.wait(0)adds no measurable latency to resolve path (0ms overhead in 19/20 samples)confidence_resolve_latency_microseconds,confidence_resolves_total)🤖 Generated with Claude Code