Skip to content

Trace latency: add observability and benchmark gates #282

@thorrester

Description

@thorrester

Parent: #281

Goal

Add the observability and benchmark gates needed to make the rest of the DataFusion latency work measurable.

This should be the first implementation step. Without request-level timings, object-store request counts, cache-hit metrics, and repeatable scenarios, later changes can look successful because one local query got faster while p95/p99 or cold-path behavior got worse.

Scope

  • Add trace/query spans that separate planning, Delta snapshot refresh, pruning, footer reads, object-store range reads, DataFusion execution, and result materialization.
  • Add object-store counters for get, get_range, head, and list, split by table or logical workload where practical.
  • Add cache metrics for RAM object-store range cache hits/misses and result-cache hits/misses.
  • Add a sentinel metric that increments if Delta refresh runs on the HTTP request path. This should normally stay at zero.
  • Add a benchmark harness that records p50/p95/p99, object-store request counts, bytes read, cache hit rates, table/file counts, and storage freshness state.
  • Include benchmark scenarios for cold trace lookup, repeated trace lookup, service-filtered dashboard reads, service-agnostic dashboard reads, small-file partitions, compacted partitions, and warmup behavior.

High-level design

Instrument first, tune second. The benchmark harness should be able to run against local object storage and cloud object storage, but the data it emits should have the same shape in both cases. Prefer JSON artifacts so results can be compared across branches.

Use a counting ObjectStore wrapper where possible instead of scattering counters through query code. That keeps object-store behavior visible across trace spans, summaries, GenAI tables, Bifrost tables, and future DataFusion-backed datasets.

Acceptance criteria

  • Benchmark output includes latency distribution, object-store operation counts, bytes read, and cache hit rates.
  • The refresh-on-request-path sentinel exists and is included in the benchmark output.
  • A synthetic request-path refresh increments the sentinel, proving the guard works.
  • A normal read workload keeps the sentinel at zero.
  • The harness can compare before/after results for the later maintenance, warmup, and cache work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions