Skip to content

feat(cloudflare): add telemetry collection and /metrics endpoint#400

Open
vahidlazio wants to merge 36 commits into
mainfrom
feat/cloudflare-telemetry-analytics-engine
Open

feat(cloudflare): add telemetry collection and /metrics endpoint#400
vahidlazio wants to merge 36 commits into
mainfrom
feat/cloudflare-telemetry-analytics-engine

Conversation

@vahidlazio
Copy link
Copy Markdown
Contributor

@vahidlazio vahidlazio commented May 7, 2026

Summary

Add telemetry collection to the Cloudflare resolver with the same metric names as all other providers, enabling shared Grafana dashboards.

Key discovery: scheduler.wait(0) unfreezes CF Workers timers

CF Workers freeze Date.now() and performance.now() during synchronous CPU work (Spectre mitigation), making internal latency measurement appear impossible. We discovered that scheduler.wait(0) — a zero-delay yield to the runtime — unfreezes the clock with no measurable overhead. This enables inline CPU time measurement identical to other providers, without any external API dependency.

What's included

  • Inline resolve latency via scheduler.wait(0) + Date.now() (1ms resolution, Grafana interpolates to sub-ms)
  • Resolve rate telemetry by reason (MATCH, NO_SEGMENT_MATCH, ERROR, etc.)
  • /metrics endpoint serving Prometheus exposition format (confidence_resolve_latency_microseconds histogram + confidence_resolves_total counters)
  • Queue-based aggregation — per-isolate deltas accumulated into KV-backed cumulative snapshot
  • Backend telemetryWriteFlagLogsRequest with SDK ID (SDK_ID_CLOUDFLARE_RESOLVER = 25), resolver version, resolve rates, and latency histogram
  • Cache-Control: no-store on /metrics response
  • Deployer auto-creates KV namespace for metrics storage

Data flow

Hot path:     Date.now() → resolve → scheduler.wait(0) → Date.now() → elapsed_us
              Push {elapsed_us, reasons} to PENDING_METRICS
wait_until:   Drain metrics → TELEMETRY.record_latency_us / mark_resolve
              → checkpoint() → Queue
Queue consumer (batched, all isolates):
  ├─ KV: read cumulative → accumulate deltas → write snapshot + prometheus text
  └─ POST aggregated WriteFlagLogsRequest to Confidence backend
/metrics:     Read prometheus text from KV

What was explored and removed

The initial approach sourced latency from Cloudflare's GraphQL Analytics API (workersInvocationsAdaptive percentiles). This worked but added complexity: cursor-based pagination, rate-limiting, percentile-to-histogram distribution, cached percentile estimation. Once scheduler.wait(0) was proven to unfreeze timers, the entire GraphQL machinery was removed (~360 lines), leaving a clean inline timer identical to other providers.

Test plan

  • cargo check -p confidence-cloudflare-resolver compiles clean
  • cargo test -p confidence_resolver -- telemetry passes (25 tests)
  • Deploy to test account and verify /metrics returns valid Prometheus text
  • Verify resolve reasons accumulate correctly (MATCH, NO_SEGMENT_MATCH, ERROR)
  • Verify latency histogram matches warm resolve times (~1-2ms p50)
  • Verify telemetry appears on Grafana dashboard (resolve rate continuous, latency continuous)
  • Verify scheduler.wait(0) adds no measurable latency to resolve path (0ms overhead in 19/20 samples)
  • Verify metric names match other providers (confidence_resolve_latency_microseconds, confidence_resolves_total)

🤖 Generated with Claude Code

@vahidlazio vahidlazio marked this pull request as draft May 7, 2026 09:17
@vahidlazio vahidlazio marked this pull request as ready for review May 7, 2026 11:50
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
Comment thread confidence-cloudflare-resolver/deployer/script.sh Outdated
@spotify spotify deleted a comment from yyoyoian-pixel May 11, 2026
vahidlazio and others added 25 commits May 12, 2026 13:14
Add Prometheus-compatible telemetry to the Cloudflare resolver, matching
the same metric names as the WASM providers so they can share dashboards.

- Collect per-flag resolve latency and reason in the fetch handler,
  deferred to ctx.wait_until to keep the hot path clean
- Include telemetry deltas in WriteFlagLogsRequest via checkpoint()
- Queue consumer accumulates cross-isolate deltas into a cumulative
  TelemetrySnapshot persisted in KV
- Serve /metrics endpoint reading Prometheus text from KV
- Add serde derives to TelemetrySnapshot and accumulate_delta() method
  for reconstructing flat histograms from compressed BucketSpans
- Deployer auto-creates KV namespace (same pattern as queue creation)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Await KV put operations in update_prometheus_kv (were fire-and-forget)
- Guard against negative/oversized BucketSpan offsets in accumulate_delta
- Add race condition comment on KV read-modify-write
- Add CORS headers to /metrics endpoint for consistency
- Add unit tests for accumulate_delta: basic, negative offset, oversized

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allow indexing_slicing and arithmetic_side_effects on the method since
bounds are checked before every index. Use saturating_add for resize.
Re-sync Go WASM module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously aggregate_batch discarded all telemetry data except the SDK
field from the first message. Now it merges latency histograms, resolve
rate counters, and gauge fields across all messages in the batch, so the
Confidence backend receives aggregated telemetry matching what the WASM
providers send.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use performance.now() instead of Date.now() for better timing resolution
- Rename METRICS_KV binding to CONFIDENCE_METRICS_KV
- Fallback to Date.now() if performance API is unavailable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…formance.now()

Use web_sys::WorkerGlobalScope::performance() instead of js_sys::Reflect::get()
to avoid dynamic JS lookups on the hot path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cloudflare Workers freeze all timer APIs (Date.now, performance.now)
during synchronous CPU work as a Spectre mitigation, making it
impossible to measure resolve latency internally.

Instead, we query Cloudflare's GraphQL analytics API from the queue
consumer to get real CPU time percentiles (p25/p50/p75/p90/p99) and
distribute them into the same exponential histogram buckets used by
all other providers. This ensures:

- Same metric names (confidence_resolve_latency_microseconds)
- Same histogram format (compatible with shared Grafana dashboards)
- Real CPU time data (~1-2ms p50 for in-memory flag evaluation)

The queue consumer uses a cursor stored in KV (cpu_time_cursor) to
avoid double-counting across batches. The deployer now passes
CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID as Worker env vars.

Also removes the broken internal timer (performance.now/Date.now)
and the web-sys dependency that was added for it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use floor() instead of round() for percentile band allocation to
  prevent assigning more observations than requests
- Track actual observations placed into buckets for _count to ensure
  +Inf bucket >= all cumulative bucket counts
- Revert sum to use weighted distribution across all percentile bands
  (correct representation of the latency distribution)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single-request CF analytics data points have all percentiles identical
(p25=p50=p75=p90=p99). Spreading them across 6 bands inflates the
histogram. Now place a single observation at p50 for these cases.

Also fix rounding: use floor() and track remaining to prevent
over-allocation across bands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rvals

Store last fetch timestamp in KV (cpu_time_last_fetch_ms) and skip
the GraphQL query if less than 10 seconds have elapsed. Prevents
excessive API calls when queue batches arrive faster than analytics
data updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…f extra KV key

Compare cursor timestamp to now — skip CF analytics call if cursor
is less than 10s old. No extra KV reads/writes needed; the cursor
we already store doubles as the rate-limit check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the queue consumer fetches new CPU time data from CF analytics
(rate-limited to every 10s via cursor age), the latency histogram
delta is now included in the WriteFlagLogsRequest sent to the
Confidence backend. When no new latency data is available (rate
limit or no traffic), only resolve reasons are sent — no stale
latency data is re-sent.

This ensures the backend's VictoriaMetrics telemetry consumer
receives the same latency distribution data as the /metrics endpoint.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add new SDK ID for the Cloudflare resolver to all proto definitions
and set it in the telemetry data sent by the CF resolver. Also syncs
missing IDs 23 (PYTHON_LOCAL_PROVIDER) and 24 (RUST_LOCAL_PROVIDER)
from epx-flags-resolver.

Without this, the backend receives telemetry with sdk=null, making
it invisible on the SDK telemetry Grafana dashboard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The backend's VictoriaMetricsClient skips resolve_rate metrics when
resolver_version is empty. Without this, resolve rates from the CF
resolver are silently dropped and don't appear on Grafana dashboards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move CLOUDFLARE_API_TOKEN from plaintext [vars] to encrypted Worker
  secret via `env.secret()` and `wrangler secret put` after deploy
- Inject CF_SCRIPT_NAME from WORKER_NAME so prefixed deployments query
  the correct script in CF analytics
- Merge update_prometheus_kv and update_cpu_time_kv into a single
  update_metrics_kv to avoid double KV read-modify-write of "snapshot"
- Clamp u64→u32 truncation on sum/count with .min(u32::MAX as u64)
- Cap histogram bucket index at BUCKET_COUNT-1 to prevent unbounded
  growth from unexpected CF analytics values
- Add Cache-Control: no-store to /metrics response
- Export BUCKET_COUNT as pub from telemetry module

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etry

When CF analytics rate-limit fires, estimate latency from cached
percentiles instead of sending nothing. This keeps the Grafana
latency graph continuous between analytics fetches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
scheduler.wait(0) unfreezes CF Workers' Spectre-mitigated timers with
zero overhead, allowing Date.now() to reflect actual CPU time. This
enables inline latency measurement identical to other providers,
removing the need for the CF GraphQL analytics API dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the entire Cloudflare GraphQL analytics API machinery with
inline latency measurement via scheduler.wait(0). This zero-overhead
yield unfreezes CF Workers' Spectre-mitigated timers, enabling
Date.now()-based CPU time measurement identical to other providers.

Removes ~360 lines: GraphQL query, cursor-based pagination,
rate-limiting, percentile-to-histogram distribution, cached
percentile estimation, and the CLOUDFLARE_API_TOKEN dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove deployer vars (CLOUDFLARE_ACCOUNT_ID, CF_SCRIPT_NAME,
CLOUDFLARE_API_TOKEN secret) that were only needed for the CF
analytics API. Revert BUCKET_COUNT to non-pub. Remove unused imports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vahidlazio vahidlazio force-pushed the feat/cloudflare-telemetry-analytics-engine branch from ebf57e3 to 6afd948 Compare May 12, 2026 11:16
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
Comment thread confidence-cloudflare-resolver/src/lib.rs Outdated
vahidlazio and others added 7 commits May 12, 2026 14:10
- Auth-gate /metrics with ClientSecret header (#9)
- Extract Prometheus content-type to named constant (#6)
- Record 0μs sub-ms observations instead of dropping them (#7)
- Use Option<u32> for elapsed_us: None when scheduler API
  unavailable, Some(0) for measured sub-ms resolves (#8)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Document latency measurement (scheduler.wait(0), 1ms precision)
- Document /metrics endpoint auth (ClientSecret header)
- Document KV store usage and DISABLE_METRICS deployer option
- Add DISABLE_METRICS env var to skip KV creation when scraping not needed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_METRICS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace DISABLE_METRICS with ENABLE_METRICS. The /metrics endpoint
and KV store are now only created when explicitly enabled, reducing
default resource usage for customers who don't need Prometheus scraping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace RESOLVE_LOGGER, ASSIGN_LOGGER, TELEMETRY, and LAST_FLUSHED
statics with a single thread-local WriteFlagLogsRequest populated
per-request. Host callbacks append directly via new public builder
functions (build_flag_assigned, build_resolve_log, build_request_telemetry).
All cross-request aggregation now happens in the queue consumer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread confidence-cloudflare-resolver/deployer/README.md Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants