[WIP]: InferenceX agentic benchmark v0.3#964
Conversation
Bump the SemiAnalysis Weka loader's HF dataset target from the 042026 full-subagent corpus (657 MB / 739 traces) to the no-subagents variant (2.77 GB / 949 traces / 136k requests). Subagent entries are stripped: only top-level main-agent turns remain, so each row produces exactly one conversation downstream (no parent/child fan-out). Plugins.yaml, loader docstrings, and two fixture/constant references in the unit tests follow the rename. The registered tag and class are unchanged, so the inferencex-agentx-mvp scenario binding continues to resolve. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Two concurrent aiperf processes that miss the cache on the same key
both pay the expensive tokenize+reconstruct cost today. The atomic
os.replace at the end of populate() means only one set of bytes ends
up canonical, but both sides do the work. On a shared filesystem cache
(Lustre / NFS) with N concurrent agentic jobs, that's N redundant
tokenizations.
Add a cross-process populate lock that serializes the miss path:
- lookup -> HIT -> use it (no lock, fast path)
- lookup -> MISS -> acquire flock -> re-lookup -> ...
-> HIT (someone populated while we waited) -> use it
-> MISS -> do the tokenize + populate -> release
Implementation mirrors huggingface_hub's WeakFileLock pattern:
- filelock.FileLock with mode=0o664 so multiple users sharing a cache
directory contend correctly
- SoftFileLock fallback on filesystems without flock support
- INFO log every 10s while waiting so a waiter is visible, not silent
- thread_local=False so release on a different asyncio worker thread
(acquire vs release end up on distinct asyncio.to_thread workers)
still actually drops the OS-level lock
- asyncio.to_thread for the blocking acquire so the event loop is not
blocked
Code lives in a new mmap_cache_lock module to keep mmap_cache.py under
the 500-line file-size budget; mmap_cache re-exports acquire_cache_lock
with cache_dir pre-bound so callers see one entry point.
DatasetManager._do_profile_configure now wraps the miss path in the
lock, with the double-checked re-lookup factored into a small helper
to keep the function under the 80-line size budget.
Three new unit tests in test_mmap_cache.py cover: concurrent acquires
on the same key serialize; acquires on different keys run in parallel;
holder beyond timeout causes the waiter to raise filelock.Timeout.
filelock is added as an explicit project dependency. It was already a
transitive via huggingface-hub; declared here so a future HF drop
doesn't silently break the cache lock.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The single 'seq isl_avg=... osl_avg=...' line hid the distribution
shape that's the main reason to watch sequence lengths mid-run
(spotting long-tail agentic prompts or response truncation). Replace
it with two percentile rows matching the latency / per-user-throughput
row format already used above:
isl p50=123,952 p75=245,124 p90=391,085 p99=720,485 (tokens)
osl p50=261 p75=664 p90=1,614 p99=7,013 (tokens)
Reads p90 off the existing JsonMetricResult schema (already populated
by the accumulator); no extra plumbing. Rows are skipped entirely when
both ISL/OSL metrics are absent so the renderer stays compact on
non-tokenizing endpoints.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The TQDMProgressUI + Textual dashboard already wire @on_warmup_progress to a tqdm bar, but those UIs are disabled under non-TTY execution (the CI/SLURM bench-driver case), so users have no visibility into warmup progress until the phase either completes or hangs. Add a per-return INFO log inside AgenticReplayStrategy.handle_credit_return under CreditPhase.WARMUP that fires once per returning warmup credit (success or error), of the shape: WARMUP 7/40 returned [ok] (lane=6, trace_id=abc123...) This gives operators a line per completion in benchmark.log even when no UI is active. The lambda log form keeps the per-completion work off the hot path when log level filters out INFO. Bump ergonomics file-size baseline: the addition pushes agentic_replay.py from 500 -> 510 lines. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The tin (prefill_throughput_per_user) and tout (output_token_throughput_per_user) rows in the realtime block are literally just 1/prefill_time and 1/ITL percentiled across requests, which is the same information conveyed more clearly as the "interactivity" metric (1/tpot) familiar from LLM serving literature. Drop the two per-user-throughput rows and emit a single intvty p50=9 p75=10 p95=15 p99=20 (1/tpot tok/s) row using the same backing metric (output_token_throughput_per_user = 1 / inter_token_latency per request). Aggregate tput_in/tput_out on line 1 are unchanged. Same numbers, clearer label, less noise. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Surfaces the server's own view of input/output token throughput in the realtime srv row, computed as a running average from the vLLM prometheus counters (delta / elapsed): vllm:prompt_tokens_total -> tput_in_srv=N/s vllm:generation_tokens_total -> tput_out_srv=N/s Useful contrast with the client-side aggregate tput_in/tput_out on line 1: server-side counts work in flight that hasn't returned a response yet, so it tracks the engine's actual ingest/emit rate rather than the rate of completions aiperf has parsed. Suppressed entirely when the counters aren't exposed (SGLang, non-vLLM servers, or runs where /metrics isn't scraped) so the row stays clean. Bump ergonomics baseline: file +5 lines (526->531), realtime_snapshot +10 lines (80->90); both fall under existing advisory caps that already accept similar-sized siblings. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…enarios Context-overflow errors mid-trajectory are already handled by agentic_replay.handle_credit_return via the separate CreditReturn message path: the trajectory is terminated, the conversation is recycled, and a fresh trajectory spawned in its place. Emitting an error MetricRecordsMessage for the same event double-counts it -- the record shows up in failure tallies, dragging down the per-run success rate even though the overflow is an expected and intentionally-tolerated end-of-trajectory signal in this scenario. When the active scenario's timing mode is AGENTIC_REPLAY, drop context-overflow records before they enter the metrics pipeline. The filter is gated on scenario timing mode (cached at __init__) and the RequestRecord.context_overflow flag the parser already sets per InferenceX AgentX RFC §7. The check is a no-op for: - non-scenario runs (user_config.scenario is None) - non-agentic scenarios (any future timing_mode != agentic_replay) - records that aren't context-overflow events (flag stays False) The existing ContextOverflowCountMetric continues to work for diagnostic purposes outside agentic scenarios. Signed-off-by: Cam Quilici <cameron@semianalysis.com>
When the active PROFILING-phase failure rate (error_records / total_records) exceeds the user-supplied threshold after a grace floor of max(concurrency, 10) records, broadcast ProfileCancelCommand on the message bus to terminate the run early. The existing cancel handlers in records_manager, timing_manager, server_metrics manager, and gpu_telemetry manager stop their work cleanly; the run finalizes with cancelled=True and exits non-zero via the standard cancel flow. The grace floor exists so a single early failure can't trip a tiny-N threshold (e.g., 1/1 = 100% > 0.5 would abort instantly without this guard). Pairs naturally with the AGENTIC_REPLAY context-overflow drop in record_processor_service: context overflows aren't counted as errors in agentic scenarios, so the threshold measures real failures only (server 5xx, parse errors, malformed responses) and won't trip on the expected end-of-trajectory overflow signal. Default is None (disabled, matching existing behavior). Valid range [0.0, 1.0]. Idempotent: the abort fires at most once per run; if the ProfileCancelCommand publish fails for any reason, the trigger is reset so the next record re-evaluates and re-attempts. Signed-off-by: Cam Quilici <cameron@semianalysis.com>
…warmup log
Two coupled changes for AGENTIC_REPLAY scenarios:
1. Configurable trajectory start range. Previously each trajectory's
k_i (start turn) was sampled uniformly from [0, int(0.7 * n)] with
the 0.7 hardcoded in TrajectorySource. Now exposed as two CLI flags
on the LoadGenerator group:
--trajectory-start-min-ratio (default 0.0)
--trajectory-start-max-ratio (default 0.7, preserves prior behavior)
Sampling becomes uniform on [int(min_ratio * n), int(max_ratio * n)],
both clamped to n-2 so the trajectory always retains at least one
profiling turn after warmup. Validated cross-field so min <= max.
Plumbed: loadgen_config -> timing.config.TimingConfig -> phase_orchestrator
-> TrajectorySource constructor. RNG seed still derived from
--random-seed via SHA-256 salt with trace_id, so k_i remains
deterministic per (seed, trace_id).
2. Per-trajectory warmup completion log. The previous WARMUP info line
reported only lane + trace_id. It now also reports start_turn=k_i/N
and the percent of the trace that was warmed.
Per-request token count (ISL) is intentionally NOT included in this
commit -- it would require plumbing prompt_tokens through CreditReturn
(or subscribing AgenticReplayStrategy to MetricRecordsMessage for the
WARMUP phase). Leaving that as a follow-up.
Signed-off-by: Cam Quilici <cameron@semianalysis.com>
Adds a one-block info log emitted by TrajectorySource at the end of
__init__, before any dispatch fires. Shows the configured start-range
plus the actual per-trajectory (lane, k_i, num_turns, pct) so the
operator can sanity-check that the configured
--trajectory-start-{min,max}-ratio produced a sensible distribution.
Example:
TrajectorySource: built 14 trajectories from 949 traces
range cfg=[0.25, 0.75] observed pct: min=27% median=51% max=72%
lane=00 start_turn= 6/24 (25%) trace_id=abc...
lane=01 start_turn= 15/22 (68%) trace_id=def...
...
Complements the existing per-trajectory warmup-completion lines. Logged
once at build time so the full distribution is visible upfront without
correlating across credit-return events.
Signed-off-by: Cam Quilici <cameron@semianalysis.com>
The previous "[RecordProcessor] Drop context-overflow records for
AGENTIC_REPLAY scenarios" commit returned early from
record_processor_service._on_inference_results before pushing the
MetricRecordsMessage, which broke the records-side <-> credit-side
counter invariant: RecordsTracker.total_records is compared for
equality against the credit-side final_requests_completed at
end-of-phase, and the drop made the records-side lag the credit-side
by one for every overflow event. The completion barrier never
converged, hanging the run for the full benchmark_grace_period before
timing out + cancelling in-flight credits.
Symptom in the log:
NOTICE All requests have completed, please wait for the results to be
processed (currently 423 of 424 records processed)...
... (30s timeout) ...
WARNING Phase profiling timed out, cancelling all credits.
Fix: don't drop the record. Add a context_overflow_skip flag to
MetricRecordMetadata. RecordProcessor sets it when the record is
context-overflow AND scenario is AGENTIC_REPLAY. RecordsManager
recognizes the flag and:
- Counts the record toward total_records (preserves the invariant)
- Classifies as success in RecordsTracker (so error counters stay at 0)
- Skips error_tracker.increment_error_count_for_phase
- Skips _send_record_to_accumulators (latency/throughput/etc. unaffected)
- Skips _maybe_trigger_failed_request_abort (overflow is not a real
failure for threshold purposes)
Net behavior matches the original intent ("nothing about the context
overload is counted towards metrics whatsoever") without breaking the
end-of-phase completion barrier. End-of-phase completion now matches
credit-side: total_records = success + 0 errors, where success includes
the overflow-skip records.
Signed-off-by: Cam Quilici <cameron@semianalysis.com>
…o cjq/weka-live-assistant-responses # Conflicts: # src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py
Cancel path was incomplete: PhaseRunner.cancel() cancelled the credit-issuance _execution_task but never set all_credits_sent_event / all_credits_returned_event. The runner's outer _wait_for_sending_complete / _wait_for_returning_complete awaits keep blocking on the unset events until the phase timeout elapses (= --benchmark-duration, default 1800s for profiling). Empirical: a --failed-request-threshold-triggered ProfileCancelCommand at T=170s into a 1800s profiling phase causes the run to hang for the remaining ~1630s before _wait_for_sending_complete finally returns via its own timeout. The graceful "if self._was_cancelled: return" branch is reached, but only 27 min later. Set both events at the top of cancel() so the runner's awaits wake immediately. Mirrors the event-set order already present in the except-Exception recovery path (runner.py:363-373) — same correctness guarantee, just on the external-cancel path too. Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Two log-ergonomics changes to make long agentic-replay runs in
captured-stdout contexts (CI, srt-slurm, scripted invocations)
readable:
1) EventLoopMonitor "Event loop ... taking too long to run. Overhead:
XXms" downgraded from warning to debug. At sustained conc>=32 on
agentic workloads this fires dozens of times per minute and is not
actionable (inherent async-scheduler overhead under load).
2) callback_handler "Credit return after phase {phase} complete,
credit_id=N, worker=W" downgraded from warning to debug. Every
cancel-triggered phase shutdown (e.g. --failed-request-threshold
trip) emits one such line per in-flight credit — up to
concurrency-many = thousands — flooding the log without being
actionable; the late return is expected under the cancel race.
3) service_config.validate_ui_type now picks UIType.TQDM (not NONE)
when stdout isn't a tty. tqdm's progress bars still render
usefully in tailed logs (carriage returns are preserved by
typical pipe sinks), so users running aiperf from srt-slurm /
gha runners get a visible progress indicator. Explicit
--ui-type none still opts back into the previous silence.
All three changes are config/severity adjustments only; no behavioral
changes to phase orchestration, credit accounting, or UI rendering
beyond the visibility surface.
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
8aad400 introduced an AttributeError on every aiperf invocation in non-TTY contexts (CI, srt-slurm benchmarks). UIType is an extensible enum with members SIMPLE/DASHBOARD/NONE; "TQDM" was never registered. SIMPLE is the tqdm-backed UI per the field docstring on `ui_type`. Repro: any `aiperf profile ...` call where stdout is captured (e.g. redirected to a file) crashes immediately in ServiceConfig.validate_ui_type with `AttributeError: 'UIType' has no attribute 'TQDM'`. Surfaced via InferenceX R26 1p6d shards where the agentic benchmark wrapper saw aiperf exit 1 within seconds of invocation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This reverts commit a6812b0.
This reverts commit 8aad400.
TrajectorySource._log_trajectory_summary previously reported each lane's
start position only as turn index + percentage. With agentic-replay on
weka cc-traces, the useful question is "how many tokens of context does
warmup start at?" — and the answer was invisible until you correlated
trace_ids back to the raw dataset.
Threads input_length (proxy-tokenizer "in" field) end-to-end:
WekaNormalRequest.input_length (already in scope at construction)
-> Turn.input_length (new optional field)
-> TurnMetadata.input_length (new optional field; propagated by
Turn.metadata() and
Conversation.metadata())
-> TrajectorySource summary log
New log shape:
TrajectorySource: built 192 trajectories from 949 traces
range cfg=[0.25, 0.75]
observed pct: min= 0% median= 46% max= 75%
observed tokens: min= 0 median= 58,431 max=187,294
lane=00 start_turn= 15/27 ( 56%) start_tokens= 42,580 trace_id=...
Backward-compatible: input_length is Optional everywhere, defaults to
None. Loaders that don't populate it (synthetic, raw_payload, sharegpt,
etc.) keep working unchanged and the log shows "-" for those lanes.
Only weka_trace's normal-turn path sets it for now.
Split _log_trajectory_summary into _build_trajectory_rows +
_format_observed_stats + _median helper to stay under the
80-line function size cap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Swaps the HF dataset slug from cc-traces-weka-no-subagents-051226
(949 traces) to cc-traces-weka-no-subagents-051826 (98 traces).
051826 is a stricter filter of the same source:
- v5-only (drops legacy trace_version=4 rows)
- CC ≥ 2.1.139 (drops rows from older CLI versions whose tool-use
semantics differ)
- ≥20 main-agent turns per trace post-strip
- subagent blocks stripped (same as 051226)
InferenceX R29 surfaced a delta-encoding edge case where two of the
949 traces (turns 555-918 and 4752-4753) produced empty delta_messages,
triggering 99.5% of the 366 HTTP-400 validation rejections. The new
051826 corpus should not contain those two traces.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…mary log" This reverts commit 61a9ed8.
The preemptions counter rarely tells you anything actionable in the live log — it's either 0 (boring) or steadily climbing (which the kv_usage + queue=r/w fields already telegraph more usefully). Removing it tightens the per-tick server-side row to the metrics that actually inform intervention decisions: cache hit rates, KV usage, queue depth, and server-side token throughput. The accumulator still scrapes vllm:num_preemptions and sglang:num_retracted_reqs and exposes num_preemptions on the snapshot dict, so downstream consumers (export jsonl, future analysis) keep working unchanged. Just the log surface is trimmed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0f04eb61753858f93f8aef8a5b9f1ae6341b57d5Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0f04eb61753858f93f8aef8a5b9f1ae6341b57d5Last updated for commit: |
for more information, see https://pre-commit.ci
| "Broadcasting ProfileCancelCommand to terminate the run." | ||
| ) | ||
| try: | ||
| await self.publish(ProfileCancelCommand(service_id=self.service_id)) |
There was a problem hiding this comment.
Publishing ProfileCancelCommand with this service's own service_id means RecordsManager ignores the broadcast in CommandHandlerMixin and never runs _on_profile_cancel_command, so threshold aborts cancel other services without marking records cancelled or producing the intended non-zero partial result. Fix: invoke the local cancel/finalization path explicitly, or send the cancel through a controller/command sender that targets RecordsManager instead of self.
| """ | ||
| self._was_cancelled = True | ||
| self._lifecycle.cancel() | ||
| self._progress.all_credits_sent_event.set() |
There was a problem hiding this comment.
Forcing the phase events during cancel makes run() immediately take the cancelled fast path and freeze completed counts before CancelCredits returns drain, so in-flight cancelled requests can be dropped from final credit stats. Fix: unblock sending but let the normal return-drain path wait for returns or the existing drain timeout before freezing final completed counts.
| Usage: --public-dataset semianalysis_cc_traces_weka | ||
| metadata: | ||
| hf_dataset_name: semianalysisai/cc-traces-weka-042026 | ||
| hf_dataset_name: semianalysisai/cc-traces-weka-no-subagents-051826 |
There was a problem hiding this comment.
The existing semianalysis_cc_traces_weka public dataset key now points to the no-subagents 051826 corpus, silently changing an existing dataset alias from the full 042026 traces to a filtered 98-trace subset. Fix: keep semianalysis_cc_traces_weka mapped to semianalysisai/cc-traces-weka-042026 and use only semianalysis_cc_traces_weka_no_subagents or a new key for the filtered corpus.
| Trajectory( | ||
| conversation_id=source.conversation_id, start_turn_index=k_i | ||
| ) | ||
| Trajectory(conversation_id=source.conversation_id, start_turn_index=k_i) |
There was a problem hiding this comment.
Wrap-filled trajectories still sample start turns from the hardcoded 0..70% range, so when concurrency exceeds the trace pool the extra lanes ignore --trajectory-start-min-ratio and --trajectory-start-max-ratio. Fix: compute k_min and k_max from self._start_min_ratio and self._start_max_ratio in _wrap_fill_lanes just like _build_trajectories.
| # the run would hang. Classify as success so error counters stay at | ||
| # zero (the original "don't count as failure" intent) while keeping | ||
| # the invariant intact. | ||
| if getattr(record_data.metadata, "context_overflow_skip", False): |
There was a problem hiding this comment.
Returning before _send_record_to_accumulators drops AGENTIC_REPLAY context-overflow responses from context_overflow_count and the total response metrics used by submission_valid, so runs with >1% overflows can be exported as valid. Fix: keep these records out of failed-request/error metrics while still feeding a context-overflow counter and total-response path for scenario submission metadata.
Rebases the contents of #886 onto
ajc/inferencex-agentx-mvp(head of #875) instead ofmain.Recreated after #886 was auto-closed when the source branch was renamed from
cjq/weka-live-assistant-responses→cjq/agentx-v0.3.🤖 Generated with Claude Code