Skip to content

[WIP]: InferenceX agentic benchmark v0.3#964

Open
cquil11 wants to merge 22 commits into
ai-dynamo:ajc/inferencex-agentx-mvpfrom
cquil11:cjq/agentx-v0.3
Open

[WIP]: InferenceX agentic benchmark v0.3#964
cquil11 wants to merge 22 commits into
ai-dynamo:ajc/inferencex-agentx-mvpfrom
cquil11:cjq/agentx-v0.3

Conversation

@cquil11
Copy link
Copy Markdown

@cquil11 cquil11 commented May 19, 2026

Rebases the contents of #886 onto ajc/inferencex-agentx-mvp (head of #875) instead of main.

Recreated after #886 was auto-closed when the source branch was renamed from cjq/weka-live-assistant-responsescjq/agentx-v0.3.

🤖 Generated with Claude Code

cquil11 and others added 21 commits May 12, 2026 15:12
Bump the SemiAnalysis Weka loader's HF dataset target from the 042026
full-subagent corpus (657 MB / 739 traces) to the no-subagents variant
(2.77 GB / 949 traces / 136k requests). Subagent entries are stripped:
only top-level main-agent turns remain, so each row produces exactly
one conversation downstream (no parent/child fan-out).

Plugins.yaml, loader docstrings, and two fixture/constant references in
the unit tests follow the rename. The registered tag and class are
unchanged, so the inferencex-agentx-mvp scenario binding continues to
resolve.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Two concurrent aiperf processes that miss the cache on the same key
both pay the expensive tokenize+reconstruct cost today. The atomic
os.replace at the end of populate() means only one set of bytes ends
up canonical, but both sides do the work. On a shared filesystem cache
(Lustre / NFS) with N concurrent agentic jobs, that's N redundant
tokenizations.

Add a cross-process populate lock that serializes the miss path:

  - lookup -> HIT -> use it (no lock, fast path)
  - lookup -> MISS -> acquire flock -> re-lookup -> ...
      -> HIT (someone populated while we waited) -> use it
      -> MISS -> do the tokenize + populate -> release

Implementation mirrors huggingface_hub's WeakFileLock pattern:
  - filelock.FileLock with mode=0o664 so multiple users sharing a cache
    directory contend correctly
  - SoftFileLock fallback on filesystems without flock support
  - INFO log every 10s while waiting so a waiter is visible, not silent
  - thread_local=False so release on a different asyncio worker thread
    (acquire vs release end up on distinct asyncio.to_thread workers)
    still actually drops the OS-level lock
  - asyncio.to_thread for the blocking acquire so the event loop is not
    blocked

Code lives in a new mmap_cache_lock module to keep mmap_cache.py under
the 500-line file-size budget; mmap_cache re-exports acquire_cache_lock
with cache_dir pre-bound so callers see one entry point.

DatasetManager._do_profile_configure now wraps the miss path in the
lock, with the double-checked re-lookup factored into a small helper
to keep the function under the 80-line size budget.

Three new unit tests in test_mmap_cache.py cover: concurrent acquires
on the same key serialize; acquires on different keys run in parallel;
holder beyond timeout causes the waiter to raise filelock.Timeout.

filelock is added as an explicit project dependency. It was already a
transitive via huggingface-hub; declared here so a future HF drop
doesn't silently break the cache lock.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The single 'seq isl_avg=... osl_avg=...' line hid the distribution
shape that's the main reason to watch sequence lengths mid-run
(spotting long-tail agentic prompts or response truncation). Replace
it with two percentile rows matching the latency / per-user-throughput
row format already used above:

    isl  p50=123,952 p75=245,124 p90=391,085 p99=720,485 (tokens)
    osl  p50=261     p75=664     p90=1,614   p99=7,013   (tokens)

Reads p90 off the existing JsonMetricResult schema (already populated
by the accumulator); no extra plumbing. Rows are skipped entirely when
both ISL/OSL metrics are absent so the renderer stays compact on
non-tokenizing endpoints.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The TQDMProgressUI + Textual dashboard already wire @on_warmup_progress
to a tqdm bar, but those UIs are disabled under non-TTY execution (the
CI/SLURM bench-driver case), so users have no visibility into warmup
progress until the phase either completes or hangs.

Add a per-return INFO log inside AgenticReplayStrategy.handle_credit_return
under CreditPhase.WARMUP that fires once per returning warmup credit
(success or error), of the shape:

  WARMUP 7/40 returned [ok] (lane=6, trace_id=abc123...)

This gives operators a line per completion in benchmark.log even when
no UI is active. The lambda log form keeps the per-completion work off
the hot path when log level filters out INFO.

Bump ergonomics file-size baseline: the addition pushes
agentic_replay.py from 500 -> 510 lines.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The tin (prefill_throughput_per_user) and tout
(output_token_throughput_per_user) rows in the realtime block are
literally just 1/prefill_time and 1/ITL percentiled across requests,
which is the same information conveyed more clearly as the
"interactivity" metric (1/tpot) familiar from LLM serving literature.

Drop the two per-user-throughput rows and emit a single

  intvty p50=9      p75=10     p95=15     p99=20     (1/tpot tok/s)

row using the same backing metric (output_token_throughput_per_user
= 1 / inter_token_latency per request). Aggregate tput_in/tput_out on
line 1 are unchanged.

Same numbers, clearer label, less noise.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Surfaces the server's own view of input/output token throughput in
the realtime srv row, computed as a running average from the vLLM
prometheus counters (delta / elapsed):

  vllm:prompt_tokens_total      -> tput_in_srv=N/s
  vllm:generation_tokens_total  -> tput_out_srv=N/s

Useful contrast with the client-side aggregate tput_in/tput_out on
line 1: server-side counts work in flight that hasn't returned a
response yet, so it tracks the engine's actual ingest/emit rate
rather than the rate of completions aiperf has parsed.

Suppressed entirely when the counters aren't exposed (SGLang,
non-vLLM servers, or runs where /metrics isn't scraped) so the row
stays clean.

Bump ergonomics baseline: file +5 lines (526->531), realtime_snapshot
+10 lines (80->90); both fall under existing advisory caps that
already accept similar-sized siblings.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…enarios

Context-overflow errors mid-trajectory are already handled by
agentic_replay.handle_credit_return via the separate CreditReturn
message path: the trajectory is terminated, the conversation is
recycled, and a fresh trajectory spawned in its place. Emitting an
error MetricRecordsMessage for the same event double-counts it -- the
record shows up in failure tallies, dragging down the per-run success
rate even though the overflow is an expected and intentionally-tolerated
end-of-trajectory signal in this scenario.

When the active scenario's timing mode is AGENTIC_REPLAY, drop
context-overflow records before they enter the metrics pipeline. The
filter is gated on scenario timing mode (cached at __init__) and the
RequestRecord.context_overflow flag the parser already sets per
InferenceX AgentX RFC §7.

The check is a no-op for:
  - non-scenario runs (user_config.scenario is None)
  - non-agentic scenarios (any future timing_mode != agentic_replay)
  - records that aren't context-overflow events (flag stays False)

The existing ContextOverflowCountMetric continues to work for
diagnostic purposes outside agentic scenarios.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
When the active PROFILING-phase failure rate (error_records /
total_records) exceeds the user-supplied threshold after a grace
floor of max(concurrency, 10) records, broadcast ProfileCancelCommand
on the message bus to terminate the run early. The existing cancel
handlers in records_manager, timing_manager, server_metrics manager,
and gpu_telemetry manager stop their work cleanly; the run finalizes
with cancelled=True and exits non-zero via the standard cancel flow.

The grace floor exists so a single early failure can't trip a tiny-N
threshold (e.g., 1/1 = 100% > 0.5 would abort instantly without this
guard).

Pairs naturally with the AGENTIC_REPLAY context-overflow drop in
record_processor_service: context overflows aren't counted as errors
in agentic scenarios, so the threshold measures real failures only
(server 5xx, parse errors, malformed responses) and won't trip on the
expected end-of-trajectory overflow signal.

Default is None (disabled, matching existing behavior). Valid range
[0.0, 1.0]. Idempotent: the abort fires at most once per run; if the
ProfileCancelCommand publish fails for any reason, the trigger is
reset so the next record re-evaluates and re-attempts.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
…warmup log

Two coupled changes for AGENTIC_REPLAY scenarios:

1. Configurable trajectory start range. Previously each trajectory's
   k_i (start turn) was sampled uniformly from [0, int(0.7 * n)] with
   the 0.7 hardcoded in TrajectorySource. Now exposed as two CLI flags
   on the LoadGenerator group:

     --trajectory-start-min-ratio  (default 0.0)
     --trajectory-start-max-ratio  (default 0.7, preserves prior behavior)

   Sampling becomes uniform on [int(min_ratio * n), int(max_ratio * n)],
   both clamped to n-2 so the trajectory always retains at least one
   profiling turn after warmup. Validated cross-field so min <= max.
   Plumbed: loadgen_config -> timing.config.TimingConfig -> phase_orchestrator
   -> TrajectorySource constructor. RNG seed still derived from
   --random-seed via SHA-256 salt with trace_id, so k_i remains
   deterministic per (seed, trace_id).

2. Per-trajectory warmup completion log. The previous WARMUP info line
   reported only lane + trace_id. It now also reports start_turn=k_i/N
   and the percent of the trace that was warmed.

   Per-request token count (ISL) is intentionally NOT included in this
   commit -- it would require plumbing prompt_tokens through CreditReturn
   (or subscribing AgenticReplayStrategy to MetricRecordsMessage for the
   WARMUP phase). Leaving that as a follow-up.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
Adds a one-block info log emitted by TrajectorySource at the end of
__init__, before any dispatch fires. Shows the configured start-range
plus the actual per-trajectory (lane, k_i, num_turns, pct) so the
operator can sanity-check that the configured
--trajectory-start-{min,max}-ratio produced a sensible distribution.

Example:

    TrajectorySource: built 14 trajectories from 949 traces
      range cfg=[0.25, 0.75]  observed pct: min=27% median=51% max=72%
        lane=00  start_turn=  6/24  (25%)  trace_id=abc...
        lane=01  start_turn= 15/22  (68%)  trace_id=def...
        ...

Complements the existing per-trajectory warmup-completion lines. Logged
once at build time so the full distribution is visible upfront without
correlating across credit-return events.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
The previous "[RecordProcessor] Drop context-overflow records for
AGENTIC_REPLAY scenarios" commit returned early from
record_processor_service._on_inference_results before pushing the
MetricRecordsMessage, which broke the records-side <-> credit-side
counter invariant: RecordsTracker.total_records is compared for
equality against the credit-side final_requests_completed at
end-of-phase, and the drop made the records-side lag the credit-side
by one for every overflow event. The completion barrier never
converged, hanging the run for the full benchmark_grace_period before
timing out + cancelling in-flight credits.

Symptom in the log:

  NOTICE All requests have completed, please wait for the results to be
         processed (currently 423 of 424 records processed)...
  ... (30s timeout) ...
  WARNING Phase profiling timed out, cancelling all credits.

Fix: don't drop the record. Add a context_overflow_skip flag to
MetricRecordMetadata. RecordProcessor sets it when the record is
context-overflow AND scenario is AGENTIC_REPLAY. RecordsManager
recognizes the flag and:

- Counts the record toward total_records (preserves the invariant)
- Classifies as success in RecordsTracker (so error counters stay at 0)
- Skips error_tracker.increment_error_count_for_phase
- Skips _send_record_to_accumulators (latency/throughput/etc. unaffected)
- Skips _maybe_trigger_failed_request_abort (overflow is not a real
  failure for threshold purposes)

Net behavior matches the original intent ("nothing about the context
overload is counted towards metrics whatsoever") without breaking the
end-of-phase completion barrier. End-of-phase completion now matches
credit-side: total_records = success + 0 errors, where success includes
the overflow-skip records.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
…o cjq/weka-live-assistant-responses

# Conflicts:
#	src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py
Cancel path was incomplete: PhaseRunner.cancel() cancelled the
credit-issuance _execution_task but never set
all_credits_sent_event / all_credits_returned_event. The runner's
outer _wait_for_sending_complete / _wait_for_returning_complete
awaits keep blocking on the unset events until the phase timeout
elapses (= --benchmark-duration, default 1800s for profiling).

Empirical: a --failed-request-threshold-triggered
ProfileCancelCommand at T=170s into a 1800s profiling phase causes
the run to hang for the remaining ~1630s before _wait_for_sending_complete
finally returns via its own timeout. The graceful "if self._was_cancelled:
return" branch is reached, but only 27 min later.

Set both events at the top of cancel() so the runner's awaits wake
immediately. Mirrors the event-set order already present in the
except-Exception recovery path (runner.py:363-373) — same correctness
guarantee, just on the external-cancel path too.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Two log-ergonomics changes to make long agentic-replay runs in
captured-stdout contexts (CI, srt-slurm, scripted invocations)
readable:

1) EventLoopMonitor "Event loop ... taking too long to run. Overhead:
   XXms" downgraded from warning to debug. At sustained conc>=32 on
   agentic workloads this fires dozens of times per minute and is not
   actionable (inherent async-scheduler overhead under load).

2) callback_handler "Credit return after phase {phase} complete,
   credit_id=N, worker=W" downgraded from warning to debug. Every
   cancel-triggered phase shutdown (e.g. --failed-request-threshold
   trip) emits one such line per in-flight credit — up to
   concurrency-many = thousands — flooding the log without being
   actionable; the late return is expected under the cancel race.

3) service_config.validate_ui_type now picks UIType.TQDM (not NONE)
   when stdout isn't a tty. tqdm's progress bars still render
   usefully in tailed logs (carriage returns are preserved by
   typical pipe sinks), so users running aiperf from srt-slurm /
   gha runners get a visible progress indicator. Explicit
   --ui-type none still opts back into the previous silence.

All three changes are config/severity adjustments only; no behavioral
changes to phase orchestration, credit accounting, or UI rendering
beyond the visibility surface.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
8aad400 introduced an AttributeError on every aiperf invocation in
non-TTY contexts (CI, srt-slurm benchmarks). UIType is an extensible
enum with members SIMPLE/DASHBOARD/NONE; "TQDM" was never registered.
SIMPLE is the tqdm-backed UI per the field docstring on `ui_type`.

Repro: any `aiperf profile ...` call where stdout is captured (e.g.
redirected to a file) crashes immediately in
ServiceConfig.validate_ui_type with
`AttributeError: 'UIType' has no attribute 'TQDM'`. Surfaced via
InferenceX R26 1p6d shards where the agentic benchmark wrapper saw
aiperf exit 1 within seconds of invocation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TrajectorySource._log_trajectory_summary previously reported each lane's
start position only as turn index + percentage. With agentic-replay on
weka cc-traces, the useful question is "how many tokens of context does
warmup start at?" — and the answer was invisible until you correlated
trace_ids back to the raw dataset.

Threads input_length (proxy-tokenizer "in" field) end-to-end:
  WekaNormalRequest.input_length  (already in scope at construction)
    -> Turn.input_length            (new optional field)
      -> TurnMetadata.input_length  (new optional field; propagated by
                                     Turn.metadata() and
                                     Conversation.metadata())
        -> TrajectorySource summary log

New log shape:

  TrajectorySource: built 192 trajectories from 949 traces
    range cfg=[0.25, 0.75]
      observed pct:    min=  0% median= 46% max= 75%
      observed tokens: min=     0 median= 58,431 max=187,294
      lane=00  start_turn= 15/27  ( 56%)  start_tokens= 42,580  trace_id=...

Backward-compatible: input_length is Optional everywhere, defaults to
None. Loaders that don't populate it (synthetic, raw_payload, sharegpt,
etc.) keep working unchanged and the log shows "-" for those lanes.
Only weka_trace's normal-turn path sets it for now.

Split _log_trajectory_summary into _build_trajectory_rows +
_format_observed_stats + _median helper to stay under the
80-line function size cap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Swaps the HF dataset slug from cc-traces-weka-no-subagents-051226
(949 traces) to cc-traces-weka-no-subagents-051826 (98 traces).

051826 is a stricter filter of the same source:
  - v5-only (drops legacy trace_version=4 rows)
  - CC ≥ 2.1.139 (drops rows from older CLI versions whose tool-use
    semantics differ)
  - ≥20 main-agent turns per trace post-strip
  - subagent blocks stripped (same as 051226)

InferenceX R29 surfaced a delta-encoding edge case where two of the
949 traces (turns 555-918 and 4752-4753) produced empty delta_messages,
triggering 99.5% of the 366 HTTP-400 validation rejections. The new
051826 corpus should not contain those two traces.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The preemptions counter rarely tells you anything actionable in the live
log — it's either 0 (boring) or steadily climbing (which the kv_usage +
queue=r/w fields already telegraph more usefully). Removing it tightens
the per-tick server-side row to the metrics that actually inform
intervention decisions: cache hit rates, KV usage, queue depth, and
server-side token throughput.

The accumulator still scrapes vllm:num_preemptions and
sglang:num_retracted_reqs and exposes num_preemptions on the snapshot
dict, so downstream consumers (export jsonl, future analysis) keep
working unchanged. Just the log surface is trimmed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0f04eb61753858f93f8aef8a5b9f1ae6341b57d5

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0f04eb61753858f93f8aef8a5b9f1ae6341b57d5

Last updated for commit: 0f04eb6Browse code

@cquil11 cquil11 changed the title Cjq/weka live assistant responses [WIP]: InferenceX agentic benchmark v0.3 May 19, 2026
"Broadcasting ProfileCancelCommand to terminate the run."
)
try:
await self.publish(ProfileCancelCommand(service_id=self.service_id))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Publishing ProfileCancelCommand with this service's own service_id means RecordsManager ignores the broadcast in CommandHandlerMixin and never runs _on_profile_cancel_command, so threshold aborts cancel other services without marking records cancelled or producing the intended non-zero partial result. Fix: invoke the local cancel/finalization path explicitly, or send the cancel through a controller/command sender that targets RecordsManager instead of self.

"""
self._was_cancelled = True
self._lifecycle.cancel()
self._progress.all_credits_sent_event.set()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing the phase events during cancel makes run() immediately take the cancelled fast path and freeze completed counts before CancelCredits returns drain, so in-flight cancelled requests can be dropped from final credit stats. Fix: unblock sending but let the normal return-drain path wait for returns or the existing drain timeout before freezing final completed counts.

Usage: --public-dataset semianalysis_cc_traces_weka
metadata:
hf_dataset_name: semianalysisai/cc-traces-weka-042026
hf_dataset_name: semianalysisai/cc-traces-weka-no-subagents-051826
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing semianalysis_cc_traces_weka public dataset key now points to the no-subagents 051826 corpus, silently changing an existing dataset alias from the full 042026 traces to a filtered 98-trace subset. Fix: keep semianalysis_cc_traces_weka mapped to semianalysisai/cc-traces-weka-042026 and use only semianalysis_cc_traces_weka_no_subagents or a new key for the filtered corpus.

Trajectory(
conversation_id=source.conversation_id, start_turn_index=k_i
)
Trajectory(conversation_id=source.conversation_id, start_turn_index=k_i)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap-filled trajectories still sample start turns from the hardcoded 0..70% range, so when concurrency exceeds the trace pool the extra lanes ignore --trajectory-start-min-ratio and --trajectory-start-max-ratio. Fix: compute k_min and k_max from self._start_min_ratio and self._start_max_ratio in _wrap_fill_lanes just like _build_trajectories.

# the run would hang. Classify as success so error counters stay at
# zero (the original "don't count as failure" intent) while keeping
# the invariant intact.
if getattr(record_data.metadata, "context_overflow_skip", False):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning before _send_record_to_accumulators drops AGENTIC_REPLAY context-overflow responses from context_overflow_count and the total response metrics used by submission_valid, so runs with >1% overflows can be exported as valid. Fix: keep these records out of failed-request/error metrics while still feeding a context-overflow counter and total-response path for scenario submission metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants