Skip to content

Latest commit

 

History

History
197 lines (157 loc) · 14.9 KB

File metadata and controls

197 lines (157 loc) · 14.9 KB

Loadtest

This directory is reserved for the local-process load-test harness described in ../plans/in-progress/plan.md.

Expected layout:

  • HOWRUN.md: loadtest-only operator commands
  • quick_observability.py: one-command local Prometheus and Grafana wrapper for quick broker reviews
  • orchestrate.py: local stack entrypoint for up, status, and down
  • ports.py: deterministic local TCP port reservation helpers
  • processes.py: reusable subprocess lifecycle helpers with per-process logs
  • routing.py: deterministic owner-set routing and routing-manifest.json generation
  • http_hammer.go: stdlib-only HTTP producer load generator
  • run-configs/: committed scenario configs
  • RUN_NOTES.md: human summaries of important runs, hypotheses, and next metrics
  • results/: generated per-run artifacts
  • precreate.py: idempotent topic-partition pre-create tool
  • metrics_collector.py: broker/process/host sampling into metrics-30s.jsonl
  • Python orchestration and metrics code
  • Go HTTP load generator code

Run artifact convention:

  • each run should write to loadtest/results/<utc-timestamp>-<scenario-name>/

Current deterministic topic layout rules:

  • topic names are derived from the scenario name as <scenario_name>-topic-<zero-padded-topic-index>
  • if partitions_per_topic is set, every topic gets that many partitions
  • if only total_partition_count is set, partitions are spread evenly and the remainder is assigned to the lowest topic indexes

The pre-create step writes partition-manifest.json into the run artifact directory with:

  • attempted, created, and already-existing counts
  • per-topic partition counts
  • the full flat topic-partition list and per-entry status

Current storage guidance:

  • harness-managed MinIO is now intended for smoke tests and low-scale local runs, not for authoritative storage performance measurements
  • when LLOG_LOADTEST_MINIO_DATA_DIR_ROOT is unset, orchestrate.py will prefer /dev/shm/llog-tests for small scenarios whose expected produced bytes fit in current tmpfs free space with 1.5x headroom; otherwise it falls back to the run directory on disk
  • set LLOG_LOADTEST_MINIO_DATA_DIR_ROOT=/some/path to pin a specific local MinIO data root and bypass that auto-selection
  • use a real external S3/object-store target for serious performance runs; on this host, local MinIO backed by / can hit filesystem or writeback stalls that are not a reliable proxy for object-store behavior
  • set LLOG_LOADTEST_EXTERNAL_S3=1 plus LLOG_S3_BUCKET to make the harness skip managed MinIO and point brokers/compactors at the external object store; LLOG_S3_ENDPOINT_URL is optional for non-AWS S3-compatible services

The routing manifest writes routing-manifest.json into the run artifact directory with:

  • the scenario name and routing seed
  • the canonical broker id list for the run
  • the effective owner-set size
  • one deterministic owner set per topic-partition

Current orchestrator state:

  • orchestrate.py up can now:
    • create the run directory
    • generate broker ids and full stack port assignments
    • write Oxia cluster/config files under the run artifact directory
    • write routing-manifest.json
    • start MinIO, clustered or standalone Oxia, an optional compactor service, broker processes, and optional read-only consumer processes
    • wait for MinIO health, create the bucket, wait for Oxia assignments, and wait for broker and consumer /health
    • run partition pre-create automatically when precreate_partitions is enabled
  • orchestrate.py run now:
    • starts the full stack
    • starts metrics_collector.py
    • runs http_hammer.go
    • writes a final collector snapshot and summary
    • runs the generic host-loopback Prometheus sync immediately after broker startup when the sync script is present
    • tears the stack down unless keep_stack_running is enabled
  • orchestrate.py precreate re-runs the idempotent partition bootstrap against an existing run directory
  • orchestrate.py status reads stack.json and processes.json and reports live process status
  • orchestrate.py down stops tracked process groups from processes.json
  • orchestrate.py run and orchestrate.py down re-run the same Prometheus sync after teardown so short-lived runs do not leave stale file_sd targets or bridge proxies behind
  • ports.py reserves local TCP ports so future up/run commands do not guess
  • processes.py provides the process-group-aware lifecycle primitives the real stack runner will use
  • the orchestrator writes stack.json, processes.json, brokers.json, ports.json, environment.json, git.json, and routing-manifest.json
  • brokers started by orchestrate.py also export static loadtest scenario metrics so Grafana can show the active run shape live
  • set compactor_enabled=true plus compactor_count / compactor_workers in a scenario to have orchestrate.py start one or more managed server.leaderless_log_compactor_server processes and record them in stack.json, ports.json, and processes.json
  • set consumer_enabled=true or consumer_count>0 in a scenario to have orchestrate.py start one or more managed server.leaderless_log_broker --role read processes and record them in stack.json, ports.json, and processes.json

Current quick observability state:

  • quick_run.sh now delegates to quick_observability.py
  • the quick flow starts Prometheus and Grafana from loadtest/observability/docker-compose.yml
  • Grafana is provisioned with the repo dashboard and a Prometheus datasource with uid prometheus
  • the quick wrapper writes the active broker ports into loadtest/observability/runtime/prometheus-targets.json so Prometheus scrapes the host brokers through host.docker.internal
  • on Linux hosts where the brokers bind loopback-only, the quick wrapper also starts temporary host-network socat bridge containers so Prometheus can reach those broker ports through the Docker host gateway
  • the quick flow leaves the broker stack and the observability containers running after the hammer exits so the dashboard can be inspected immediately
  • ./quick_run.sh down stops the tracked run and tears the observability containers down

Current collector state:

  • metrics_collector.py can scrape every broker GET /metrics
  • it samples tracked local processes with ps
  • it writes metrics-30s.jsonl and summary.json
  • it tolerates missing broker/process/host samples by recording warnings instead of aborting the run
  • summary.json now includes artifact presence, request-path trace presence, batch-flush trace presence, request-path metric rollups, and client-vs-broker reconciliation warnings
  • broker S3 billing estimates are rolled into the sampled totals, including summed request-cost estimates plus max bucket-footprint and monthly-storage-cost gauges
  • Linux host CPU, memory, disk-write, and network totals are sampled directly from /proc
  • Darwin host CPU, memory, and network totals are sampled, but disk-write bytes/sec is still unavailable without extra tooling

Current S3 summary fields include:

  • broker_s3_estimated_request_cost_usd
  • bucket_stored_bytes
  • bucket_stored_gb
  • bucket_object_count
  • estimated_monthly_storage_cost_usd

Aggregation rule:

  • sum request-cost totals across brokers
  • take max for shared bucket-footprint and storage-cost gauges

Current hammer state:

  • http_hammer.go reads resolved-config.json, brokers.json, and routing-manifest.json from one run directory
  • it uses persistent HTTP connections through the Go stdlib client
  • it emits deterministic multi-topic-partition POST /produce traffic
  • it supports single_hot_partition, uniform_spread, and zipf_skew
  • it supports round_robin, least_inflight, and explicit unrestricted_fanout
  • it now paces logical request dispatch from target_mib_per_sec_per_broker instead of tying pacing to synchronous worker loops
  • http_clients_per_broker now acts as the per-broker in-flight request cap for the hammer
  • it writes client-30s.jsonl, client-summary.json, remap-events.jsonl, and logs/http-hammer-request-trace.jsonl
  • the client artifacts now expose configured request bytes, pacing lag, in-flight request counts, request-path trace stages, and achieved-vs-target throughput per broker
  • client-summary.json now reports the configured steady-state window by default and keeps a full_window section with the warmup-inclusive traffic totals for raw debugging
  • broker-side load-test runs now also emit logs/<broker-id>-batch-flush-trace.jsonl with one record per batch flush, including trigger, waiter queue-wait summary, group counts, payload bytes, and nested writer stage totals

Current consume hammer state:

  • consume_hammer/main.go reads resolved-config.json, stack.json, and routing-manifest.json from one run directory
  • when stack.json lists dedicated consumers, the consume hammer targets those read processes; otherwise it falls back to broker /consume endpoints
  • consume scenarios share LLOG_TAIL_CACHE_DIR=<run-dir>/tail-cache across writer brokers and read-only consumers so hot-tail reads can stay on the shared file-backed cache before falling back to Oxia/S3
  • scenarios with broker_locality_mode=paired_broker instead use per-broker cache groups under <run-dir>/tail-cache/<broker-id>; consumer-N is paired with broker-N, the producer hammer samples partitions from that broker's local owner-set pool, and the consume hammer reads from the same paired pool to mimic node-local cache locality
  • before timed traffic starts, the consume hammer bootstraps each active partition to the current tail and then keeps the next fetch offset only in local process memory
  • the consume hammer allows only one in-flight request per topic-partition, so a partition cannot be read concurrently from overlapping local offsets and measured consume throughput cannot outrun current production just by duplicating reads
  • consume-client-summary.json now also reports the steady-state window by default and preserves a full_window section for warmup-inclusive debugging

The complete load-test metrics, logs, traces, and debugging flow are documented in ../PERF_DEBUGGING.md.

Concurrency sizing rule of thumb:

  • first compute request payload bytes as record_size_bytes * topic_partitions_per_request * records_per_topic_partition_per_request
  • then compute target requests/sec per broker as (target_mib_per_sec_per_broker * 1048576) / request_payload_bytes
  • then compute the approximate in-flight requests needed per broker as target_requests_per_sec_per_broker * p50_latency_seconds
  • if client-summary.json or client-30s.jsonl shows inflight_requests.per_broker.<broker>.peak pinned near http_clients_per_broker while pacing_lag_ms keeps growing, the hammer is still concurrency-capped and http_clients_per_broker is too low for that scenario
  • if achieved throughput is still low after the in-flight cap is comfortably above that estimate, the next bottleneck is more likely in the broker, Oxia, MinIO, routing churn, or another downstream path

Example for single-broker-32mib:

  • producer request payload = 1024 * 4 * 32 = 131072 bytes = 128 KiB
  • target producer rate = (32 MiB/s * 1048576) / 131072 = 256 requests/sec
  • at about 590 ms p50 latency, the hammer needs roughly 256 * 0.590 = 151 in-flight requests, and p95 latency needs more than 200
  • the committed scenario uses 512 producer clients to leave headroom above that estimate
  • the consume hammer requests up to 1 MiB per selected partition; at the tail of a 32-partition, 32 MiB/s run, each partition fills at about 1 MiB/s, so the committed scenario uses consume_max_wait_ms=1000 to avoid forcing partial tail reads
  • consume scenarios also provision the file-backed tail cache for about 64 seconds of aggregate producer throughput so the separate read processes can stay on the hot tail without repeated Oxia/S3 fallback; override with LLOG_TAIL_CACHE_MAX_BYTES when testing on a disk-constrained host

Committed run configs:

  • clustered-2b-2c-10m.json
  • committed scenarios now keep rough total topic-partitions at or below 32 * broker_count
  • sanity-clustered.json
  • standalone-control.json
  • single-broker-32mib.json
  • public-s3-demo-5m.json
  • public-s3-demo-5m-scaleout.json
  • scaleout-3b-3c-3comp-100mib-locality.json
  • scaleout-4-brokers-128mib.json
  • density-16-brokers-512mib.json
  • fanout-negative-control.json

Current runtime assumptions:

  • Oxia startup needs either a cached binary at .cache/loadtest/bin/oxia, an oxia binary on PATH, LLOG_LOADTEST_OXIA_BIN=/abs/path/to/oxia, or a sibling oxia-server/ checkout plus go
  • MinIO startup needs either minio on PATH, docker on PATH, or LLOG_LOADTEST_MINIO_BIN=/abs/path/to/minio
  • low-scale local runs will default MinIO data into /dev/shm/llog-tests when the estimated payload footprint fits; larger runs stay on disk unless LLOG_LOADTEST_MINIO_DATA_DIR_ROOT explicitly overrides the location
  • broker and bucket setup use LLOG_LOADTEST_PYTHON when set and otherwise prefer .venv-loadtest/bin/python before falling back to the current interpreter
  • use oxia>=0.2.0; the current repo baseline is oxia==0.2.1 because older 0.1.x clients lacked the session keepalive/recreation behavior the compaction path relies on
  • external S3 runs should set LLOG_LOADTEST_EXTERNAL_S3=1; the harness will validate LLOG_S3_BUCKET access before starting brokers and will keep cleanup scoped to LLOG_ROOT_PREFIX
  • process metrics collection assumes ps is available on the host
  • host Prometheus bridge sync is best-effort and can be overridden or disabled with LLOG_LOADTEST_PROMETHEUS_SYNC_SCRIPT

EC2-oriented workflow:

  • python3 -m loadtest.ec2 bootstrap creates .venv-loadtest, installs requirements.txt, and builds cached Go binaries into .cache/loadtest/bin/ when go is available
  • python3 -m loadtest.ec2 preflight --scenario ... --bucket ... validates the repo Python runtime, launcher resolution, S3 bucket access, and optional observability prerequisites
  • python3 -m loadtest.ec2 smoke --bucket ... runs a short external-S3 smoke scenario and cleans its prefix up by default unless --keep-artifacts is set
  • python3 -m loadtest.ec2 run --scenario ... --bucket ... runs the harness in external-S3 mode without manual export steps when the needed values are already present in .env, the shell environment, or the CLI flags
  • python3 -m loadtest.ec2 report latest prints a concise JSON summary from the latest run directory
  • python3 -m loadtest.ec2 bundle latest creates one tar.gz artifact bundle from the run directory while skipping transient Oxia and MinIO data trees
  • python3 -m loadtest.ec2 cleanup latest cleans only the recorded run prefix and now relies on the current boto3 credential chain instead of stored AWS secrets from the original run

Still pending:

  • real end-to-end validation on a machine with go, oxia, and minio or docker
  • public external-S3 demo checklist and exact commands are tracked in PUBLIC_S3_DEMO.md