[WIP]: InferenceX agentic benchmark v0.3 by cquil11 · Pull Request #964 · ai-dynamo/aiperf

cquil11 · 2026-05-19T22:46:01Z

Rebases the contents of #886 onto ajc/inferencex-agentx-mvp (head of #875) instead of main.

Recreated after #886 was auto-closed when the source branch was renamed from cjq/weka-live-assistant-responses → cjq/agentx-v0.3.

🤖 Generated with Claude Code

Bump the SemiAnalysis Weka loader's HF dataset target from the 042026 full-subagent corpus (657 MB / 739 traces) to the no-subagents variant (2.77 GB / 949 traces / 136k requests). Subagent entries are stripped: only top-level main-agent turns remain, so each row produces exactly one conversation downstream (no parent/child fan-out). Plugins.yaml, loader docstrings, and two fixture/constant references in the unit tests follow the rename. The registered tag and class are unchanged, so the inferencex-agentx-mvp scenario binding continues to resolve. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Two concurrent aiperf processes that miss the cache on the same key both pay the expensive tokenize+reconstruct cost today. The atomic os.replace at the end of populate() means only one set of bytes ends up canonical, but both sides do the work. On a shared filesystem cache (Lustre / NFS) with N concurrent agentic jobs, that's N redundant tokenizations. Add a cross-process populate lock that serializes the miss path: - lookup -> HIT -> use it (no lock, fast path) - lookup -> MISS -> acquire flock -> re-lookup -> ... -> HIT (someone populated while we waited) -> use it -> MISS -> do the tokenize + populate -> release Implementation mirrors huggingface_hub's WeakFileLock pattern: - filelock.FileLock with mode=0o664 so multiple users sharing a cache directory contend correctly - SoftFileLock fallback on filesystems without flock support - INFO log every 10s while waiting so a waiter is visible, not silent - thread_local=False so release on a different asyncio worker thread (acquire vs release end up on distinct asyncio.to_thread workers) still actually drops the OS-level lock - asyncio.to_thread for the blocking acquire so the event loop is not blocked Code lives in a new mmap_cache_lock module to keep mmap_cache.py under the 500-line file-size budget; mmap_cache re-exports acquire_cache_lock with cache_dir pre-bound so callers see one entry point. DatasetManager._do_profile_configure now wraps the miss path in the lock, with the double-checked re-lookup factored into a small helper to keep the function under the 80-line size budget. Three new unit tests in test_mmap_cache.py cover: concurrent acquires on the same key serialize; acquires on different keys run in parallel; holder beyond timeout causes the waiter to raise filelock.Timeout. filelock is added as an explicit project dependency. It was already a transitive via huggingface-hub; declared here so a future HF drop doesn't silently break the cache lock. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The single 'seq isl_avg=... osl_avg=...' line hid the distribution shape that's the main reason to watch sequence lengths mid-run (spotting long-tail agentic prompts or response truncation). Replace it with two percentile rows matching the latency / per-user-throughput row format already used above: isl p50=123,952 p75=245,124 p90=391,085 p99=720,485 (tokens) osl p50=261 p75=664 p90=1,614 p99=7,013 (tokens) Reads p90 off the existing JsonMetricResult schema (already populated by the accumulator); no extra plumbing. Rows are skipped entirely when both ISL/OSL metrics are absent so the renderer stays compact on non-tokenizing endpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The TQDMProgressUI + Textual dashboard already wire @on_warmup_progress to a tqdm bar, but those UIs are disabled under non-TTY execution (the CI/SLURM bench-driver case), so users have no visibility into warmup progress until the phase either completes or hangs. Add a per-return INFO log inside AgenticReplayStrategy.handle_credit_return under CreditPhase.WARMUP that fires once per returning warmup credit (success or error), of the shape: WARMUP 7/40 returned [ok] (lane=6, trace_id=abc123...) This gives operators a line per completion in benchmark.log even when no UI is active. The lambda log form keeps the per-completion work off the hot path when log level filters out INFO. Bump ergonomics file-size baseline: the addition pushes agentic_replay.py from 500 -> 510 lines. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The tin (prefill_throughput_per_user) and tout (output_token_throughput_per_user) rows in the realtime block are literally just 1/prefill_time and 1/ITL percentiled across requests, which is the same information conveyed more clearly as the "interactivity" metric (1/tpot) familiar from LLM serving literature. Drop the two per-user-throughput rows and emit a single intvty p50=9 p75=10 p95=15 p99=20 (1/tpot tok/s) row using the same backing metric (output_token_throughput_per_user = 1 / inter_token_latency per request). Aggregate tput_in/tput_out on line 1 are unchanged. Same numbers, clearer label, less noise. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Surfaces the server's own view of input/output token throughput in the realtime srv row, computed as a running average from the vLLM prometheus counters (delta / elapsed): vllm:prompt_tokens_total -> tput_in_srv=N/s vllm:generation_tokens_total -> tput_out_srv=N/s Useful contrast with the client-side aggregate tput_in/tput_out on line 1: server-side counts work in flight that hasn't returned a response yet, so it tracks the engine's actual ingest/emit rate rather than the rate of completions aiperf has parsed. Suppressed entirely when the counters aren't exposed (SGLang, non-vLLM servers, or runs where /metrics isn't scraped) so the row stays clean. Bump ergonomics baseline: file +5 lines (526->531), realtime_snapshot +10 lines (80->90); both fall under existing advisory caps that already accept similar-sized siblings. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

…enarios Context-overflow errors mid-trajectory are already handled by agentic_replay.handle_credit_return via the separate CreditReturn message path: the trajectory is terminated, the conversation is recycled, and a fresh trajectory spawned in its place. Emitting an error MetricRecordsMessage for the same event double-counts it -- the record shows up in failure tallies, dragging down the per-run success rate even though the overflow is an expected and intentionally-tolerated end-of-trajectory signal in this scenario. When the active scenario's timing mode is AGENTIC_REPLAY, drop context-overflow records before they enter the metrics pipeline. The filter is gated on scenario timing mode (cached at __init__) and the RequestRecord.context_overflow flag the parser already sets per InferenceX AgentX RFC §7. The check is a no-op for: - non-scenario runs (user_config.scenario is None) - non-agentic scenarios (any future timing_mode != agentic_replay) - records that aren't context-overflow events (flag stays False) The existing ContextOverflowCountMetric continues to work for diagnostic purposes outside agentic scenarios. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

When the active PROFILING-phase failure rate (error_records / total_records) exceeds the user-supplied threshold after a grace floor of max(concurrency, 10) records, broadcast ProfileCancelCommand on the message bus to terminate the run early. The existing cancel handlers in records_manager, timing_manager, server_metrics manager, and gpu_telemetry manager stop their work cleanly; the run finalizes with cancelled=True and exits non-zero via the standard cancel flow. The grace floor exists so a single early failure can't trip a tiny-N threshold (e.g., 1/1 = 100% > 0.5 would abort instantly without this guard). Pairs naturally with the AGENTIC_REPLAY context-overflow drop in record_processor_service: context overflows aren't counted as errors in agentic scenarios, so the threshold measures real failures only (server 5xx, parse errors, malformed responses) and won't trip on the expected end-of-trajectory overflow signal. Default is None (disabled, matching existing behavior). Valid range [0.0, 1.0]. Idempotent: the abort fires at most once per run; if the ProfileCancelCommand publish fails for any reason, the trigger is reset so the next record re-evaluates and re-attempts. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

…warmup log Two coupled changes for AGENTIC_REPLAY scenarios: 1. Configurable trajectory start range. Previously each trajectory's k_i (start turn) was sampled uniformly from [0, int(0.7 * n)] with the 0.7 hardcoded in TrajectorySource. Now exposed as two CLI flags on the LoadGenerator group: --trajectory-start-min-ratio (default 0.0) --trajectory-start-max-ratio (default 0.7, preserves prior behavior) Sampling becomes uniform on [int(min_ratio * n), int(max_ratio * n)], both clamped to n-2 so the trajectory always retains at least one profiling turn after warmup. Validated cross-field so min <= max. Plumbed: loadgen_config -> timing.config.TimingConfig -> phase_orchestrator -> TrajectorySource constructor. RNG seed still derived from --random-seed via SHA-256 salt with trace_id, so k_i remains deterministic per (seed, trace_id). 2. Per-trajectory warmup completion log. The previous WARMUP info line reported only lane + trace_id. It now also reports start_turn=k_i/N and the percent of the trace that was warmed. Per-request token count (ISL) is intentionally NOT included in this commit -- it would require plumbing prompt_tokens through CreditReturn (or subscribing AgenticReplayStrategy to MetricRecordsMessage for the WARMUP phase). Leaving that as a follow-up. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

Adds a one-block info log emitted by TrajectorySource at the end of __init__, before any dispatch fires. Shows the configured start-range plus the actual per-trajectory (lane, k_i, num_turns, pct) so the operator can sanity-check that the configured --trajectory-start-{min,max}-ratio produced a sensible distribution. Example: TrajectorySource: built 14 trajectories from 949 traces range cfg=[0.25, 0.75] observed pct: min=27% median=51% max=72% lane=00 start_turn= 6/24 (25%) trace_id=abc... lane=01 start_turn= 15/22 (68%) trace_id=def... ... Complements the existing per-trajectory warmup-completion lines. Logged once at build time so the full distribution is visible upfront without correlating across credit-return events. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

The previous "[RecordProcessor] Drop context-overflow records for AGENTIC_REPLAY scenarios" commit returned early from record_processor_service._on_inference_results before pushing the MetricRecordsMessage, which broke the records-side <-> credit-side counter invariant: RecordsTracker.total_records is compared for equality against the credit-side final_requests_completed at end-of-phase, and the drop made the records-side lag the credit-side by one for every overflow event. The completion barrier never converged, hanging the run for the full benchmark_grace_period before timing out + cancelling in-flight credits. Symptom in the log: NOTICE All requests have completed, please wait for the results to be processed (currently 423 of 424 records processed)... ... (30s timeout) ... WARNING Phase profiling timed out, cancelling all credits. Fix: don't drop the record. Add a context_overflow_skip flag to MetricRecordMetadata. RecordProcessor sets it when the record is context-overflow AND scenario is AGENTIC_REPLAY. RecordsManager recognizes the flag and: - Counts the record toward total_records (preserves the invariant) - Classifies as success in RecordsTracker (so error counters stay at 0) - Skips error_tracker.increment_error_count_for_phase - Skips _send_record_to_accumulators (latency/throughput/etc. unaffected) - Skips _maybe_trigger_failed_request_abort (overflow is not a real failure for threshold purposes) Net behavior matches the original intent ("nothing about the context overload is counted towards metrics whatsoever") without breaking the end-of-phase completion barrier. End-of-phase completion now matches credit-side: total_records = success + 0 errors, where success includes the overflow-skip records. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

…o cjq/weka-live-assistant-responses # Conflicts: # src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py

Cancel path was incomplete: PhaseRunner.cancel() cancelled the credit-issuance _execution_task but never set all_credits_sent_event / all_credits_returned_event. The runner's outer _wait_for_sending_complete / _wait_for_returning_complete awaits keep blocking on the unset events until the phase timeout elapses (= --benchmark-duration, default 1800s for profiling). Empirical: a --failed-request-threshold-triggered ProfileCancelCommand at T=170s into a 1800s profiling phase causes the run to hang for the remaining ~1630s before _wait_for_sending_complete finally returns via its own timeout. The graceful "if self._was_cancelled: return" branch is reached, but only 27 min later. Set both events at the top of cancel() so the runner's awaits wake immediately. Mirrors the event-set order already present in the except-Exception recovery path (runner.py:363-373) — same correctness guarantee, just on the external-cancel path too. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Two log-ergonomics changes to make long agentic-replay runs in captured-stdout contexts (CI, srt-slurm, scripted invocations) readable: 1) EventLoopMonitor "Event loop ... taking too long to run. Overhead: XXms" downgraded from warning to debug. At sustained conc>=32 on agentic workloads this fires dozens of times per minute and is not actionable (inherent async-scheduler overhead under load). 2) callback_handler "Credit return after phase {phase} complete, credit_id=N, worker=W" downgraded from warning to debug. Every cancel-triggered phase shutdown (e.g. --failed-request-threshold trip) emits one such line per in-flight credit — up to concurrency-many = thousands — flooding the log without being actionable; the late return is expected under the cancel race. 3) service_config.validate_ui_type now picks UIType.TQDM (not NONE) when stdout isn't a tty. tqdm's progress bars still render usefully in tailed logs (carriage returns are preserved by typical pipe sinks), so users running aiperf from srt-slurm / gha runners get a visible progress indicator. Explicit --ui-type none still opts back into the previous silence. All three changes are config/severity adjustments only; no behavioral changes to phase orchestration, credit accounting, or UI rendering beyond the visibility surface. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

8aad400 introduced an AttributeError on every aiperf invocation in non-TTY contexts (CI, srt-slurm benchmarks). UIType is an extensible enum with members SIMPLE/DASHBOARD/NONE; "TQDM" was never registered. SIMPLE is the tqdm-backed UI per the field docstring on `ui_type`. Repro: any `aiperf profile ...` call where stdout is captured (e.g. redirected to a file) crashes immediately in ServiceConfig.validate_ui_type with `AttributeError: 'UIType' has no attribute 'TQDM'`. Surfaced via InferenceX R26 1p6d shards where the agentic benchmark wrapper saw aiperf exit 1 within seconds of invocation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This reverts commit a6812b0.

This reverts commit 8aad400.

TrajectorySource._log_trajectory_summary previously reported each lane's start position only as turn index + percentage. With agentic-replay on weka cc-traces, the useful question is "how many tokens of context does warmup start at?" — and the answer was invisible until you correlated trace_ids back to the raw dataset. Threads input_length (proxy-tokenizer "in" field) end-to-end: WekaNormalRequest.input_length (already in scope at construction) -> Turn.input_length (new optional field) -> TurnMetadata.input_length (new optional field; propagated by Turn.metadata() and Conversation.metadata()) -> TrajectorySource summary log New log shape: TrajectorySource: built 192 trajectories from 949 traces range cfg=[0.25, 0.75] observed pct: min= 0% median= 46% max= 75% observed tokens: min= 0 median= 58,431 max=187,294 lane=00 start_turn= 15/27 ( 56%) start_tokens= 42,580 trace_id=... Backward-compatible: input_length is Optional everywhere, defaults to None. Loaders that don't populate it (synthetic, raw_payload, sharegpt, etc.) keep working unchanged and the log shows "-" for those lanes. Only weka_trace's normal-turn path sets it for now. Split _log_trajectory_summary into _build_trajectory_rows + _format_observed_stats + _median helper to stay under the 80-line function size cap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Swaps the HF dataset slug from cc-traces-weka-no-subagents-051226 (949 traces) to cc-traces-weka-no-subagents-051826 (98 traces). 051826 is a stricter filter of the same source: - v5-only (drops legacy trace_version=4 rows) - CC ≥ 2.1.139 (drops rows from older CLI versions whose tool-use semantics differ) - ≥20 main-agent turns per trace post-strip - subagent blocks stripped (same as 051226) InferenceX R29 surfaced a delta-encoding edge case where two of the 949 traces (turns 555-918 and 4752-4753) produced empty delta_messages, triggering 99.5% of the 366 HTTP-400 validation rejections. The new 051826 corpus should not contain those two traces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…mary log" This reverts commit 61a9ed8.

The preemptions counter rarely tells you anything actionable in the live log — it's either 0 (boring) or steadily climbing (which the kv_usage + queue=r/w fields already telegraph more usefully). Removing it tightens the per-tick server-side row to the metrics that actually inform intervention decisions: cache hit rates, KV usage, queue depth, and server-side token throughput. The accumulator still scrapes vllm:num_preemptions and sglang:num_retracted_reqs and exposes num_preemptions on the snapshot dict, so downstream consumers (export jsonl, future analysis) keep working unchanged. Just the log surface is trimmed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

copy-pr-bot · 2026-05-19T22:46:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-19T22:46:11Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0f04eb61753858f93f8aef8a5b9f1ae6341b57d5

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0f04eb61753858f93f8aef8a5b9f1ae6341b57d5

Last updated for commit: 0f04eb6 • Browse code

for more information, see https://pre-commit.ci

dynamo-ops · 2026-05-19T22:58:43Z

+            "Broadcasting ProfileCancelCommand to terminate the run."
+        )
+        try:
+            await self.publish(ProfileCancelCommand(service_id=self.service_id))


Publishing ProfileCancelCommand with this service's own service_id means RecordsManager ignores the broadcast in CommandHandlerMixin and never runs _on_profile_cancel_command, so threshold aborts cancel other services without marking records cancelled or producing the intended non-zero partial result. Fix: invoke the local cancel/finalization path explicitly, or send the cancel through a controller/command sender that targets RecordsManager instead of self.

dynamo-ops · 2026-05-19T22:58:43Z

+        """
        self._was_cancelled = True
        self._lifecycle.cancel()
+        self._progress.all_credits_sent_event.set()


Forcing the phase events during cancel makes run() immediately take the cancelled fast path and freeze completed counts before CancelCredits returns drain, so in-flight cancelled requests can be dropped from final credit stats. Fix: unblock sending but let the normal return-drain path wait for returns or the existing drain timeout before freezing final completed counts.

dynamo-ops · 2026-05-19T22:58:43Z

      Usage: --public-dataset semianalysis_cc_traces_weka
    metadata:
-      hf_dataset_name: semianalysisai/cc-traces-weka-042026
+      hf_dataset_name: semianalysisai/cc-traces-weka-no-subagents-051826


The existing semianalysis_cc_traces_weka public dataset key now points to the no-subagents 051826 corpus, silently changing an existing dataset alias from the full 042026 traces to a filtered 98-trace subset. Fix: keep semianalysis_cc_traces_weka mapped to semianalysisai/cc-traces-weka-042026 and use only semianalysis_cc_traces_weka_no_subagents or a new key for the filtered corpus.

dynamo-ops · 2026-05-19T22:58:43Z

-                Trajectory(
-                    conversation_id=source.conversation_id, start_turn_index=k_i
-                )
+                Trajectory(conversation_id=source.conversation_id, start_turn_index=k_i)


Wrap-filled trajectories still sample start turns from the hardcoded 0..70% range, so when concurrency exceeds the trace pool the extra lanes ignore --trajectory-start-min-ratio and --trajectory-start-max-ratio. Fix: compute k_min and k_max from self._start_min_ratio and self._start_max_ratio in _wrap_fill_lanes just like _build_trajectories.

dynamo-ops · 2026-05-19T23:09:23Z

+        # the run would hang. Classify as success so error counters stay at
+        # zero (the original "don't count as failure" intent) while keeping
+        # the invariant intact.
+        if getattr(record_data.metadata, "context_overflow_skip", False):


Returning before _send_record_to_accumulators drops AGENTIC_REPLAY context-overflow responses from context_overflow_count and the total response metrics used by submission_valid, so runs with >1% overflows can be exported as valid. Fix: keep these records out of failed-request/error metrics while still feeding a context-overflow counter and total-response path for scenario submission metadata.

cquil11 and others added 21 commits May 12, 2026 15:12

Merge remote-tracking branch 'upstream/ajc/inferencex-agentx-mvp' int…

929aa76

…o cjq/weka-live-assistant-responses # Conflicts: # src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py

Revert "fix: UIType.TQDM does not exist — use UIType.SIMPLE"

4eb1a73

This reverts commit a6812b0.

Revert "quiet noisy phase-shutdown warnings + tqdm default in non-tty"

2f30ea8

This reverts commit 8aad400.

Revert "trajectory_source: surface per-lane start-token counts in sum…

90c93ab

…mary log" This reverts commit 61a9ed8.

[pre-commit.ci] auto fixes from pre-commit.com hooks

0f04eb6

for more information, see https://pre-commit.ci

cquil11 changed the title ~~Cjq/weka live assistant responses~~ [WIP]: InferenceX agentic benchmark v0.3 May 19, 2026

dynamo-ops reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]: InferenceX agentic benchmark v0.3#964

[WIP]: InferenceX agentic benchmark v0.3#964
cquil11 wants to merge 22 commits into
ai-dynamo:ajc/inferencex-agentx-mvpfrom
cquil11:cjq/agentx-v0.3

cquil11 commented May 19, 2026

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

dynamo-ops May 19, 2026

Uh oh!

dynamo-ops May 19, 2026

Uh oh!

dynamo-ops May 19, 2026

Uh oh!

dynamo-ops May 19, 2026

Uh oh!

dynamo-ops May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cquil11 commented May 19, 2026

Uh oh!

copy-pr-bot Bot commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

dynamo-ops May 19, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 19, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 19, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 19, 2026

Choose a reason for hiding this comment

Uh oh!

dynamo-ops May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 19, 2026 •

edited

Loading