This is the operator reference for load-test debugging. Use it when a run has unexpected throughput, latency, error, routing, or compaction behavior.
Every orchestrated run writes artifacts under:
loadtest/results/<utc-timestamp>-<scenario-name>/
Start with these files:
| File | Use |
|---|---|
summary.json |
Final collector rollup, artifact presence, reconciliation warnings, latest broker/process/host totals. |
client-summary.json |
Hammer totals, latency percentiles, per-broker routing counts, and client trace stage summaries. |
client-30s.jsonl |
Interval hammer counts, rates, latency summaries, and client trace stage summaries. |
metrics-30s.jsonl |
Interval broker JSON metrics, process samples, host samples, aggregate totals, and interval deltas. |
logs/http-hammer.log |
Hammer stdout/stderr and fatal client-side failures. |
logs/<broker-id>.log |
Broker stdout/stderr and request trace write failures. |
logs/http-hammer-request-trace.jsonl |
Per-request client trace records keyed by request_id. |
logs/<broker-id>-request-trace.jsonl |
Per-request broker trace records keyed by request_id. |
logs/<broker-id>-batch-flush-trace.jsonl |
Per-flush broker batch trace records with queue-wait summaries and nested writer stage totals. |
When a latency gap appears, join the two request trace files by request_id.
Client trace stages are in milliseconds. Broker trace stages and Prometheus
histograms are in seconds.
| Artifact | Producer | Notes |
|---|---|---|
resolved-config.json |
loadtest.orchestrate |
Effective scenario config for the run. |
stack.json |
loadtest.orchestrate |
Generated stack resources and paths. |
processes.json |
loadtest.orchestrate |
Process metadata, PIDs, running status, and log paths. |
brokers.json |
loadtest.orchestrate |
Broker IDs and HTTP addresses. |
ports.json |
loadtest.orchestrate |
Reserved local TCP ports. |
environment.json |
loadtest.orchestrate |
Captured environment subset used by the run. |
git.json |
loadtest.orchestrate |
Repo revision metadata. |
routing-manifest.json |
loadtest.routing |
Deterministic topic-partition owner sets. |
partition-manifest.json |
loadtest.precreate |
Partition pre-create attempts and per-entry status. |
client-30s.jsonl |
loadtest/http_hammer.go |
One JSON object per hammer interval. |
client-summary.json |
loadtest/http_hammer.go |
Final hammer summary. |
remap-events.jsonl |
loadtest/http_hammer.go |
Routing remaps emitted by owner-set routing. |
metrics-30s.jsonl |
loadtest.metrics_collector |
One JSON object per collector interval. |
summary.json |
loadtest.metrics_collector |
Final collector summary and reconciliation view. |
logs/http-hammer-request-trace.jsonl |
loadtest/http_hammer.go |
Per-request client spans and errors. |
logs/<broker-id>-request-trace.jsonl |
server.leaderless_log_broker |
Per-request broker spans and response summaries. |
logs/<broker-id>-batch-flush-trace.jsonl |
server.leaderless_log_writer |
Per-flush batch trace with waiter queue-wait stats, group shape, and nested writer stage totals. |
logs/http-hammer.log |
loadtest.orchestrate |
Hammer process log. |
logs/metrics-collector.log |
loadtest.orchestrate |
Collector process log. |
logs/<broker-id>.log |
loadtest.orchestrate |
Broker process log. |
logs/minio.log |
loadtest.orchestrate |
MinIO process log. |
logs/oxia-standalone.log |
loadtest.orchestrate |
Standalone Oxia process log. |
logs/oxia-coordinator.log |
loadtest.orchestrate |
Clustered Oxia coordinator log. |
logs/oxia-dataserver-<n>.log |
loadtest.orchestrate |
Clustered Oxia data server logs. |
logs/oxia-settle.log |
loadtest.orchestrate |
Oxia assignment settle progress. |
logs/precreate.log |
loadtest.orchestrate |
Partition pre-create subprocess output. |
logs/prometheus-target-sync-<reason>.log |
loadtest.orchestrate |
Host-loopback Prometheus target sync output, when available. |
Hammer sends X-Llog-Request-Id on every produce request. The same ID appears
in both request trace files.
Client request trace fields:
| Field | Meaning |
|---|---|
request_id |
Correlation key shared with the broker trace. |
broker_id |
Target broker selected by routing. |
started_at_ms |
Client-side request start time. |
status_code |
HTTP status, or 0 when the request failed before a response. |
latency_ms |
Total client-observed request latency. |
topic_partition_count |
Number of topic-partitions in the request group. |
record_count |
Records attempted by this request. |
payload_bytes |
Intended payload bytes. |
stages |
Client-side span timings in milliseconds. |
error_type, error |
Present when the request failed. |
Client trace stages:
| Stage | Meaning |
|---|---|
get_conn |
Time from httptrace.GetConn to GotConn. |
tcp_connect |
TCP connect time for new connections. Absent for reused connections. |
got_conn_to_wrote_headers |
Delay after connection acquisition before headers are written. |
request_write |
Time from headers written to full request written. |
request_write_to_first_byte |
Time from request write completion to first response byte. |
response_body_read |
Time to read the response body after headers arrived. |
round_trip |
Time inside the Go HTTP request. |
client_total |
Total measured by the hammer worker around the request call. |
Broker request trace fields:
| Field | Meaning |
|---|---|
request_id |
Correlation key shared with the client trace. |
broker_id |
Broker process that handled the request. |
method, path, status, result |
HTTP summary. |
started_at_ms |
Broker handler start time. |
duration_seconds |
Broker-observed request duration. |
stages |
Broker-side span timings in seconds. |
request |
Content length, topic-partition count, record count, and payload bytes. |
response |
Success and error counts when a produce response was built. |
error_type, error |
Present for validation or handler errors. |
Broker request stages:
| Stage | Meaning |
|---|---|
read_body |
Read the HTTP request body. |
json_decode |
Decode the JSON request body. |
parse_batches |
Validate and convert JSON into produce batches. |
initialize_partitions |
Run partition initialization for all batches. |
append_shared |
Enqueue and wait for the shared WAL append path. |
produce_total |
Total produce service time; this includes initialization and append time. |
response_encode |
Encode the JSON response. |
response_write |
Write HTTP status, headers, and response bytes. |
request_total |
Full broker handler duration. |
Batch append stages:
| Stage | Meaning |
|---|---|
enqueue_lock_wait |
Time waiting to enter the batcher condition lock. |
enqueue_lock_hold |
Time spent inside the batcher condition lock while enqueueing. |
future_wait |
Time waiting for enqueued requests to complete after leaving the lock. |
Partition initialization results:
| Result | Meaning |
|---|---|
cached |
The topic-partition was already initialized in this broker process. |
success |
The writer initialized the topic-partition during this request. |
error |
Initialization raised an exception. |
Broker and compactor services expose:
GET /metrics
GET /metrics/prometheus
/metrics is the full JSON process-local snapshot. It contains these main
sections:
| Section | Contents |
|---|---|
broker |
Broker identity, host, port, start time, and batch settings. |
http |
Produce request counts, status counts, in-flight requests, accepted records/bytes, request duration totals, and request-stage histograms. |
batching |
Batch enqueue counts, buffer depth, flush counts/durations, partition initialization counts/durations, queue wait, append-stage durations, backpressure, and shared WAL blob totals. |
oxia |
Offset assignment, index writes, control key operations, recoveries, client errors, operation counts, and operation duration totals. |
s3 |
Put counts, uploaded bytes, failures, operation counts, and operation duration totals. |
compactor |
Service identity, discovery state, worker counts, scheduler/claim state, claim counters, and discovery/attempt durations. |
compaction |
Compaction run counts, durations, lag, pending state, recovery counts, bytes, records, stage timings, candidate shape, and index key deletion counts. |
/metrics/prometheus is the scrape surface for Prometheus and Grafana Explore.
Gauges:
| Metric | Labels | Use |
|---|---|---|
leaderless_log_inflight_requests |
none | Current in-flight broker produce requests. |
leaderless_log_batch_waiters_current |
none | Current waiters buffered in the batcher. |
leaderless_log_batch_buffer_payload_bytes |
none | Current uncommitted batch payload bytes, including an in-flight sealed flush plus queued future batches. |
leaderless_log_batch_queued_payload_bytes_current |
none | Current queued payload bytes still waiting for a flush to start. |
leaderless_log_batch_inflight_flush_bytes_current |
none | Current payload bytes in the batch actively being flushed. |
leaderless_log_batch_pending_batches_current |
none | Current count of pending batches not yet handed to the underlying writer. |
leaderless_log_batch_sealed_batches_current |
none | Current count of sealed batches waiting for the flusher. |
leaderless_log_batch_oldest_pending_batch_age_seconds |
none | Age of the oldest batch still pending in the batcher. |
leaderless_log_compactor_known_targets |
none | Known compactor target count. |
leaderless_log_compactor_discovered_targets |
none | Discovered target count beyond static config. |
leaderless_log_compactor_inflight_partitions |
none | Topic-partitions currently held by compactor workers. |
leaderless_log_compactor_workers |
none | Configured compactor worker count. |
leaderless_log_compactor_consecutive_errors |
none | Current consecutive compactor attempt errors. |
leaderless_log_compactions_inflight |
none | Current compact_once runs. |
leaderless_log_compaction_lag_offsets |
topic, partition |
Per-partition lag, only when partition metrics are enabled. |
leaderless_log_compaction_avg_lag_offsets |
none | Average compaction lag across known partitions. |
leaderless_log_compaction_max_lag_offsets |
none | Maximum compaction lag across known partitions. |
leaderless_log_compaction_last_success_timestamp_seconds |
none | Unix timestamp of last successful or recovered compaction. |
leaderless_log_compaction_pending_current |
state |
Current pending compactions by state. |
leaderless_log_compaction_oldest_pending_age_seconds |
none | Age of the oldest pending compaction. |
Counters:
| Metric | Labels | Use |
|---|---|---|
leaderless_log_produce_requests_total |
result |
Accepted vs rejected produce requests. |
leaderless_log_batch_backpressure_rejections_total |
none | Append attempts rejected because total uncommitted payload bytes would exceed the batch buffer cap. |
leaderless_log_batch_flushes_total |
trigger, result |
Batch flush outcomes by trigger. |
leaderless_log_oxia_operations_total |
operation, result |
Oxia operation outcomes. |
leaderless_log_s3_operations_total |
operation, result |
S3 operation outcomes. |
leaderless_log_compactor_attempts_total |
result |
Compactor attempt outcomes. |
leaderless_log_compactor_discovery_scans_total |
none | Compactor discovery scans. |
leaderless_log_compactor_discovery_additions_total |
none | Targets added by discovery. |
leaderless_log_compactor_claim_attempts_total |
result |
Distributed claim outcomes. |
leaderless_log_compaction_runs_total |
result |
Compaction run outcomes. |
leaderless_log_compaction_recoveries_total |
kind |
Compaction recovery events. |
leaderless_log_compacted_records_total |
none | Records included in compacted blobs. |
leaderless_log_compaction_input_bytes_total |
none | Bytes read by compaction. |
leaderless_log_compaction_output_bytes_total |
none | Bytes written by compaction. |
leaderless_log_compaction_index_keys_deleted_total |
none | Index keys deleted by compaction cleanup. |
Histograms and summaries:
| Metric | Labels | Use |
|---|---|---|
leaderless_log_request_stage_duration_seconds |
stage, result |
Broker-side produce request stage timing. |
leaderless_log_batch_queue_wait_seconds |
none | Time waiting in the batcher before flush starts. |
leaderless_log_batch_seal_to_flush_start_seconds |
none | Time between sealing a batch and starting its flush. |
leaderless_log_batch_append_stage_duration_seconds |
stage |
Request-side batch append timing. |
leaderless_log_partition_initialize_duration_seconds |
result |
Partition initialization timing by result. |
leaderless_log_partition_initialize_lock_wait_seconds |
result |
Time waiting for the initialization underlying lock. |
leaderless_log_batch_flush_duration_seconds |
trigger |
Time inside one batch flush. |
leaderless_log_oxia_operation_duration_seconds |
operation, result |
Oxia operation latency. |
leaderless_log_s3_operation_duration_seconds |
operation, result |
S3 operation latency. |
leaderless_log_compactor_scheduler_pass_duration_seconds |
none | Compactor scheduler pass duration. |
leaderless_log_compactor_discovery_duration_seconds |
none | Compactor discovery scan duration. |
leaderless_log_compactor_attempt_duration_seconds |
result |
Compactor attempt duration. |
leaderless_log_compactor_claim_hold_duration_seconds |
none | Time a distributed compactor claim was held. |
leaderless_log_compaction_duration_seconds |
result |
End-to-end compaction duration. |
leaderless_log_compaction_stage_duration_seconds |
stage, result |
Per-stage compaction duration. |
leaderless_log_compaction_candidate_entries |
none | Index entries selected for a compaction candidate. |
leaderless_log_compaction_candidate_records |
none | Records selected for a compaction candidate. |
leaderless_log_compaction_candidate_span_offsets |
none | Offset span selected for a compaction candidate. |
leaderless_log_shared_wal_blob_bytes |
none | Summary of encoded shared WAL blob size. |
The source dashboard is:
grafana/leaderless-log-observability.json
On the host monitoring stack, the provisioned dashboard copy is:
/home/ops/services/monitoring/grafana/provisioning/dashboards/leaderless-log-observability.json
The dashboard includes a REQUEST PATH row with avg and p99 panels for the
request-stage, batch append-stage, partition initialization, and partition
initialization lock-wait histograms. Use Grafana Explore or Prometheus directly
for ad hoc label filtering beyond those panels.
Useful PromQL while investigating the request-path gap:
histogram_quantile(
0.99,
sum by (instance, le, stage, result) (
rate(leaderless_log_request_stage_duration_seconds_bucket{job=~"$job",instance=~"$instance"}[$rate_interval])
)
)
sum by (instance, stage) (
rate(leaderless_log_batch_append_stage_duration_seconds_sum{job=~"$job",instance=~"$instance"}[$rate_interval])
)
/
sum by (instance, stage) (
rate(leaderless_log_batch_append_stage_duration_seconds_count{job=~"$job",instance=~"$instance"}[$rate_interval])
)
sum by (instance, result) (
rate(leaderless_log_partition_initialize_lock_wait_seconds_sum{job=~"$job",instance=~"$instance"}[$rate_interval])
)
/
sum by (instance, result) (
rate(leaderless_log_partition_initialize_lock_wait_seconds_count{job=~"$job",instance=~"$instance"}[$rate_interval])
)
The request trace JSONL files are not in Grafana today unless a separate log shipping path is added.
- Open
summary.jsonand checkwarnings,artifact_presence,reconciliation,latest_totals, andcumulative_totals. - Open
client-summary.jsonand compare throughput, latency,connect_error_count,timeout_count, remaps, unroutable partitions, andclient_trace_stage_ms. - Open
metrics-30s.jsonland compare broker totals and interval deltas against the client intervals. - If broker HTTP latency is high, inspect
leaderless_log_request_stage_duration_secondsbystageandresult. - If
append_shareddominates, inspectleaderless_log_batch_append_stage_duration_seconds,leaderless_log_batch_queue_wait_seconds,leaderless_log_batch_seal_to_flush_start_seconds,leaderless_log_batch_flush_duration_seconds,leaderless_log_batch_pending_batches_current,leaderless_log_batch_sealed_batches_current,leaderless_log_batch_queued_payload_bytes_current,leaderless_log_batch_inflight_flush_bytes_current,leaderless_log_oxia_operation_duration_seconds, andleaderless_log_s3_operation_duration_seconds. Then openlogs/<broker-id>-batch-flush-trace.jsonlto correlate slow flushes with waiter queue-wait, group count, payload bytes, and nested writer stage totals. - If
initialize_partitionsdominates, inspectleaderless_log_partition_initialize_duration_secondsandleaderless_log_partition_initialize_lock_wait_seconds. - If client latency is higher than broker
request_total, joinlogs/http-hammer-request-trace.jsonlandlogs/<broker-id>-request-trace.jsonlbyrequest_idand compare clientrequest_write_to_first_byteagainst brokerrequest_total. - If errors or remaps spike, inspect
remap-events.jsonl,logs/http-hammer.log,logs/<broker-id>.log, and Oxia logs.