Skip to content

Cjq/weka live assistant responses#886

Closed
cquil11 wants to merge 136 commits into
ai-dynamo:mainfrom
cquil11:cjq/weka-live-assistant-responses
Closed

Cjq/weka live assistant responses#886
cquil11 wants to merge 136 commits into
ai-dynamo:mainfrom
cquil11:cjq/weka-live-assistant-responses

Conversation

@cquil11
Copy link
Copy Markdown

@cquil11 cquil11 commented May 5, 2026

No description provided.

ajcasagrande and others added 13 commits May 1, 2026 15:11
…y, DAG benchmarks, and reproducible payload replay

Adds four major, interlocking input/timing/scenario capabilities to AIPerf, plus
the supporting plumbing (DAG conversation branching, scenario validator
framework, prerequisite system, hash-ID-aware synthesis, ISL-budget
compensation, raw-payload endpoint, and an extensive test corpus).

This is the first AIPerf implementation of the SemiAnalysis InferenceX
AgentX-MVP benchmark and lays the foundation for byte-exact replay of the
public Weka KV-cache-tester corpus.

Headline features
-----------------

1. InferenceX AgentX MVP scenario (`--scenario inferencex-agentx-mvp`)
   - Single-flag preset that bundles every AgentX-MVP submission rule:
     fixed-schedule replay, no early stop, cache warmup, inter-turn delay
     cap (60s), Weka trace input, and submission-validity stamping.
   - Locks conflicting flags at config time via a new scenario-validator
     framework (`src/aiperf/common/scenario/`, `src/aiperf/common/validators/`).
   - Stamps a `submission_valid` field onto both per-run `profile_export.json`
     and aggregate output (`--num-profile-runs >= 2`) so reviewers can verify
     compliance at a glance.
   - Documented end-to-end in `docs/tutorials/agentx-mvp.md`.

2. Weka agentic-coding trace loader
   - New `weka_trace` input type replays real Claude Code coding sessions
     captured by the public Weka kv-cache-tester project, preserving:
     - Per-request timing (`type: "n"` and `type: "s"` streaming requests).
     - Cache-block hash IDs for KV-cache-aware replay.
     - Subagent topology (`type: "subagent"`) as nested SPAWN children.
   - Per-trace model rewriting: traces map onto whatever `--model` values
     the user supplies, regardless of the trace's recorded model names.
   - Parallel JSON->DAG conversion path (`weka_parallel_convert.py`,
     `weka_synth_buf.py`) for fast load of large corpora.
   - Byte-exact corpus tests + adversarial coverage
     (`tests/unit/dataset/loader/test_weka_trace*.py`,
     `tests/component_integration/dataset/test_weka_trace*`).
   - Auxiliary tooling: `tools/weka_byte_exact_verify.py`,
     `tools/weka_trace_inspect.py`, `tools/weka_loader_drift_audit.py`.
   - Companion loader for SemiAnalysis CC traces
     (`semianalysis_cc_traces_weka.py`).
   - Full tutorial: `docs/tutorials/weka-trace.md`.

3. DAG benchmarks: branching conversations
   - New `dag_jsonl` input type expressing one conversation per JSONL line
     with `forks` (KV-cache-locality children that inherit history and pin
     to the parent worker) and `spawns` (independent subagent children with
     fresh history that may land on any worker).
   - Branch orchestrator + trajectory source
     (`src/aiperf/timing/branch_orchestrator.py`,
     `src/aiperf/timing/trajectory_source.py`) drive the conversation graph,
     including SPAWN_JOIN prerequisites that gate parent turns on completed
     subagent trees.
   - Prerequisites/branch metadata layer
     (`src/aiperf/common/models/prerequisites.py`,
     `branch.py`, `branch_stats.py`).
   - Tutorial + reference: `docs/benchmark-modes/dag.md`.
   - Comprehensive test pyramid: unit, adversarial, fan-in, multi-gate,
     pre-session, pathological-timing, full-topology integration, and
     spawn-mode integration suites.

4. Reproducible payload replay
   - `inputs_json` loader replays the `inputs.json` artifact AIPerf already
     produces, enabling byte-faithful re-runs and cross-server comparisons
     (`docs/tutorials/inputs-json-replay.md`).
   - `raw_payload` loader + `raw` endpoint replay arbitrary pre-built JSONL
     request bodies verbatim, including multi-turn directory mode for
     non-standard APIs (`docs/tutorials/raw-payload-replay.md`,
     `src/aiperf/endpoints/raw_endpoint.py`).

Supporting changes
------------------

- New timing strategies: `agentic_replay` (drives the trajectory source for
  Weka/DAG/AgentX runs) and `cache_bust` (collision-free cache-busting prefix
  injection with adversarial coverage).
- New `coding_content` synthetic generator + `hash_ids_synthesis` loader for
  KV-cache-aware synthetic workloads matching real coding-session shapes.
- ISL budget compensation
  (`docs/reference/isl-budget-compensation.md`,
  `tests/unit/dataset/composer/test_isl_budget_compensation.py`) and ISL
  tokenization reference (`docs/reference/isl-tokenization.md`).
- New `context_overflow_count` metric with runtime gating; ergonomics and
  ruff baselines refreshed.
- Inter-turn delay cap utility (`_delay_cap.py`) shared across multi-turn
  loaders so AgentX-MVP and Weka traces honor the 60s cap consistently.
- Endpoints gain a shared `response_mixin` and an explicit `raw` endpoint;
  parser/exporter touch-ups for the new `submission_valid` field and DAG
  metadata tagging.
- Credit pipeline: per-session field plumbing, join-turn dispatch, sticky
  router updates, and structs depth tests for branched topologies.
- Records manager / inference client / worker updated for cache-bust
  injection paths and DAG/Weka metadata tagging.
- Three-file sync (CLAUDE.md, copilot-instructions, cursor rules) refreshed
  alongside the docs index, plugin registry, and CLI/env-var generated docs.

Tests
-----

- ~150 new test files spanning unit, component-integration, and full
  integration tiers — covering DAG topology pathology, cache-bust collision
  freedom, AgentX scenario lockdown, Weka byte-exact drift, prerequisite
  metadata adversarial cases, raw-payload adversarial inputs, trajectory
  source extended adversarial cases, and orchestrator-validator integration.
- New fixture corpora under `tests/fixtures/dag/`,
  `tests/fixtures/weka_traces*/`.

Docs
----

- Four new tutorials: AgentX-MVP, Weka trace replay, inputs-json replay,
  raw-payload replay.
- New benchmark-mode page for DAG conversations.
- New reference pages for ISL-budget compensation and ISL tokenization.
- Architecture, plugin-system, timing-modes, trace-replay, conversation-
  context-mode, benchmark-datasets, genai-perf-feature-comparison, CLI
  options, and environment-variables docs all refreshed for the new
  scenarios, loaders, strategies, and metrics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Makefile: add `not slow` to unit/integration/component-integration/coverage/CI test selectors so slow-marked corpus tests don't OOM xdist workers.
- DatasetManager: capture raw_payload state BEFORE `_preformat_payloads` runs via `_all_turns_source_loaded_payloads`; otherwise synthesized turns gain raw_payload and falsely trip the mooncake payload-mode inputs.json skip.
- test_server_metrics: the mock's vllm/sglang endpoints don't expose labelled series, so assert label-column discovery is well-formed rather than non-empty.
- DAG mock helpers (`_start_pre`, `_start`) accept `**_kw` so the new `cache_bust_marker` / `cache_bust_target` kwargs from `BranchOrchestrator` no longer break ~75 timing tests across four files.
- test_weka_trace_v1_adversarial.py, test_weka_trace_integration.py: MagicMock `prompt_generator` now exposes real `_corpus_size`, `_tokenized_corpus`, and `tokenizer.decode`, and the user_config carries a real `inter_turn_delay_cap_seconds` so the partial-tail synth + delay-cap paths can run.
- test_weka_trace_byte_exact_drift.py: `real_qwen_tokenizer` cleanly `pytest.skip`s when the Qwen3-0.6B HF cache is missing; the module-scoped `real_prompt_generator` self-seeds RNG (function-scoped reset_random_generator runs after module fixtures).
- test_cli_help.py::test_profile_help_does_not_show_parameters: new `Groups.SCENARIO` adopts `--scenario` / `--unsafe-override` so they leave the default Parameters group.
- test_cache_bust*.py: pass `target=` as a keyword now that `build_cache_bust_marker` is keyword-only past trace_id.
- tokenizer._is_hf_cached requires at least one tokenizer file (`tokenizer.json` / `tokenizer.model` / `vocab.json` / `tokenizer_config.json`) in the snapshot so a partial cache from an interrupted download stops triggering offline-only loads.
- inference_client._send_request_to_transport validates pre-serialised `payload_bytes` via `orjson.loads` so invalid JSON is converted to an error RequestRecord before the transport call.
- ConversationSource.start_branch_child and cache_bust.build_cache_bust_marker: added `*,` separator to satisfy the keyword-only-args ergonomics rule.
- tests/integration/conftest.py: when pre-cache fails, also purge the partial `models--<name>/` directory so subsequent subprocess calls don't short-circuit into offline mode against a broken snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ealtime stats overhaul

Builds the second half of the InferenceX AgentX MVP work on top of the
initial commit. Three large additive themes plus a series of correctness
fixes, all of which preserve the public output contract that
``profile_export_aiperf*`` consumers care about.

Replaces the multi-stage ``MetricResultsProcessor`` +
``TimesliceMetricResultsProcessor`` pipeline with a single
``MetricsAccumulator`` that owns a NaN-sparse columnar store keyed by
``session_num``:

- ``aiperf.metrics.column_store.ColumnStore`` — numpy-backed numeric +
  string + categorical columns, ragged list-valued columns delegated to
  a configurable backend (``RaggedSeries`` for exact replay,
  ``TDigestListMetricAggregator`` for bounded-memory sketch). Selected
  via ``Environment.METRICS.LIST_BACKEND`` (default ``ragged``;
  ``tdigest`` is the opt-in escape hatch for >1M-request runs).
- ``aiperf.metrics.ragged_series.RaggedSeries`` — flat values + offsets +
  record_indices arrays; supports per-record replay, which the ICL-aware
  sweepline algorithms consume.
- ``aiperf.metrics.accumulator.MetricsAccumulator`` — ingests every
  RECORD / AGGREGATE / DERIVED metric, computes time-windowed results
  for both the overall run and per-timeslice, and emits a typed
  ``AccumulatorMetricsSummary``.
- ``aiperf.common.accumulator_protocols`` — ``AccumulatorProtocol``,
  ``StreamExporterProtocol``, ``AnalyzerProtocol``, ``SummaryContext``,
  ``ExportContext``. Replaces the old ``RESULTS_PROCESSOR`` plugin
  category split by ``accumulator`` / ``stream_exporter`` / ``analyzer``
  in ``plugins.yaml``.
- ``records_manager`` rewired: per-record fan-out to
  ``_metric_record_accumulators`` / ``_metric_record_stream_exporters``,
  static plugin lookup once at init, summarize/finalize/analyzer pipeline
  driving ``ProcessRecordsResultMessage`` + ``ProcessAllResultsMessage``.

``aiperf.analysis.sweepline*`` adds vectorized step-function algorithms
producing two parallel families of time-weighted aggregate metrics:

- ``effective_*`` — time-weighted over the entire run window:
  ``effective_concurrency``, ``effective_decode_concurrency``,
  ``effective_prefill_concurrency``, ``effective_decode_throughput``,
  ``effective_prefill_throughput``, ``effective_total_throughput``,
  ``effective_decode_throughput_per_user``,
  ``effective_prefill_throughput_per_user``, ``tokens_in_flight``.
- ``active_*`` — time-weighted only over segments where the relevant
  phase has at least one in-flight request, surfacing intensity while
  the phase is happening:
  ``active_decode_throughput``, ``active_prefill_throughput``,
  ``active_total_throughput``, ``active_decode_throughput_per_user``,
  ``active_prefill_throughput_per_user``.

ICL-aware variants of throughput and tokens-in-flight use per-chunk
inter-chunk-latency timing when the configured backend exposes per-record
replay (i.e. ``RaggedSeries``); otherwise they fall back to request-level
(start_ns, generation_start_ns, end_ns) timing.

- ``credit_to_start_latency`` (``request_start_ns − credit_issued_ns``)
  — controller-queue wait, surfaces saturation.
- ``effective_latency`` (``end_ns − credit_issued_ns``) —
  coordinated-omission-aware latency a saturating user actually
  perceives.
- ``completed_request_count`` — successful + failed completions, distinct
  from ``request_count`` which only counts successes used for latency
  distributions.
- ``request_error_rate`` — % of completed requests that errored. Pairs
  with the new ``adj_*`` percentile band on flagged latency metrics.
- ``context_overflow_count`` — runtime context-overflow detections, used
  by the InferenceX AgentX scenario to flip ``submission_valid=false``
  when the rate exceeds 1%.
- ``adj_<tag>`` family — error-adjusted percentile band for any latency
  metric flagged with
  ``MetricFlags.PERCENTILE_INCLUDES_FAILED_REQUESTS``. Errored requests
  contribute ``+inf`` to the success-only distribution so the band flips
  to inf at the failure rate.

Multiple correctness fixes against the DAG benchmark mode added in the
initial commit:

- ``branch_orchestrator``: restored informative validator error messages,
  enforced duplicate-prereq rejection at validator load time, fixed
  pre-session over-count (truncation, not error), fixed gather-result
  triage to distinguish exception/saturated/none, restored
  ``on_child_stopped`` so the request-count cap unwinds the parent join.
- Agentic replay: warmup target uses ``trajectory_count`` not
  concurrency; concurrency > trajectory count is rejected up front;
  cross-phase trajectory state continuity is exercised end-to-end.
- ``AIPERF_DAG_FAIL_FAST`` moved to ``Environment.DAG.FAIL_FAST``.
- Weka delta-encoded conversations with per-turn ``reset_context``
  flagging non-monotonic LCP cuts; per-trace cache + RNG reset prevent
  cross-trace hash-id aliasing inflating KV-cache hit rates.

- ``aiperf-mock-server`` returns orjson responses; sentencepiece SWIG
  warning silenced.
- Smoke harness uses ``profile_export_aiperf.json`` filename.

- Always-on realtime metrics with a reusable
  ``ConsoleMetricsExporter``-backed table renderer; replaces the prior
  Rich realtime table with a compact 4-line log block under non-dashboard
  UIs.
- ``--stats-interval`` CLI flag overrides
  ``REALTIME_METRICS_INTERVAL``; the env var auto-defaults by UI type.
- Realtime-metrics filter via ``NO_CONSOLE`` flag (was a hardcoded
  allowlist).
- Suppressed under ``--ui dashboard`` to avoid double-rendering.

- Auto-disable on incompatible endpoint; probe ``/prometheus/metrics``
  fallback path.
- Catch concurrent-scrape race in the data collector; translate probe
  failures so an unreachable Prometheus endpoint no longer aborts the
  run.

- ``audio_duration``: read from ``RecordContext``, not stripped turns.
- ``DerivedSumMetric`` simplified to read scalar sum from results dict.
- Timeslice results consolidated from ``dict[int, ...]`` to
  ``list[TimesliceResult]`` with window bounds + completion flag bundled.
- t-digest aggregator port for ``InterChunkLatencyMetric`` (ai-dynamo#865 merged
  in).
- Pytest: ``slow`` marker is opt-in, global ``--timeout=300``.
- Plugin schema/yaml updates + regenerated CLI/env docs.

Unit tests: 10705 passed, 3 skipped (``-n auto``, 28s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Expands AIPerf's API-usage-field metrics to cover the full surface area
exposed by every major LLM provider — OpenAI, Anthropic, Google Gemini /
Vertex AI, AWS Bedrock, DeepSeek, Mistral, Cohere v1 + v2, IBM watsonx —
plus shape-compatible passthroughs (vLLM, Together, Fireworks, Cerebras,
AI21, SambaNova, Groq, Bailian/DashScope, Replicate, Azure OpenAI, xAI
Grok REST). Adds a MetricConsoleGroup enum for rendering related metrics
in dedicated console sections, and updates the mock server to emit the
full OpenAI-shape usage so the new metrics can be exercised end-to-end.

OpenAI extras:
- usage_prompt_audio_tokens, usage_completion_audio_tokens
- usage_accepted_prediction_tokens, usage_rejected_prediction_tokens
- TotalUsageReasoningTokensMetric (per-record existed; total was missing)

Cache split (replaces unshipped `usage_prompt_cached_tokens`):
- usage_prompt_cache_read_tokens — unifies OpenAI nested
  prompt_tokens_details.cached_tokens, Anthropic cache_read_input_tokens,
  DeepSeek prompt_cache_hit_tokens, Gemini cachedContentTokenCount,
  Bedrock cacheReadInputTokens, Cohere v1+v2 cached_tokens.
- usage_prompt_cache_write_tokens — Anthropic cache_creation_input_tokens
  + Bedrock cacheWriteInputTokens. LARGER_IS_BETTER intentionally omitted:
  writes cost more than ordinary input tokens but enable cheap reads later.
- usage_prompt_cache_miss_tokens — DeepSeek-specific.

Vendor-specific outliers:
- usage_tool_use_prompt_tokens — Gemini's toolUsePromptTokenCount.
- usage_prompt_audio_seconds — Mistral's prompt_audio_seconds (duration
  in seconds, MetricTimeUnit.SECONDS, distinct from usage_prompt_audio_tokens).

Usage.__init__ unwraps recognized vendor wrappers on construction:
- Gemini / Vertex AI: usageMetadata envelope (camelCase wire).
- Cohere v1: meta envelope (response root) — lifts meta.tokens.* and
  meta.cached_tokens.
- Cohere v2: top-level tokens sub-dict (no meta wrapper).

Synonym precedence centralized in 9 ClassVar *_KEYS lists ordered by
descending vendor count (most common first).

billed_units intentionally NOT modelled — Cohere-specific accounting
filter, not what the model processed; preserved on the dict for billing
reconciliation.

- ParsedResponseRecord.final_usage (cached_property) — single walkback
  over the responses, returns the last non-empty Usage chunk. Cached so
  every metric reads it once; one walk per record total.
- find_last_non_empty_usage helper shared between the record-level
  cached_property and InferenceResultParser._compute_server_token_counts
  (which now walks responses ONCE instead of three times, with
  cross-field consistency).
- BaseUsageRecordMetric[T] — declarative base class. Subclasses set
  usage_field and missing_message; no per-metric _parse_record loop.
  Net -175 lines of repetition removed.

Streaming contract: "last non-empty chunk wins" instead of per-key
merge. Real vendors don't change shape mid-stream and don't explicitly
null fields they previously set, so per-field walkback is solving a
non-problem.

Adds MetricConsoleGroup enum (USAGE, CACHE, PREDICTION, AUDIO, REASONING,
DEFAULT, NONE) replacing the legacy MetricFlags.NO_CONSOLE binary
visibility flag with a richer grouping mechanism that lets the console
exporter render related metrics in dedicated sections. Every metric now
sets a console_group class attribute and a display_order for stable
ordering. Usage metrics graduate from hidden to MetricConsoleGroup.USAGE
(visible in their own console section).

Also introduces MetricFlags.TOKENIZES_INPUT_ONLY for prompt-side metrics
(parallel to PRODUCES_TOKENS_ONLY for completion-side metrics).

TokenizedText.create_usage now populates the OpenAI-compatible usage
shape so the new metrics can be exercised end-to-end:

  prompt_tokens_details:
    cached_tokens         (30-60% of prompt, deterministic from prompt hash)
  completion_tokens_details:
    reasoning_tokens      (the actual budget allocated by
                           _generate_reasoning_tokens — non-zero only for
                           reasoning models like gpt-oss / qwen)
    accepted_prediction_tokens  (5-20% of completion when present)
    rejected_prediction_tokens  (2-10% of completion when present)

audio_tokens intentionally omitted from both details objects — the mock
has no audio generation pipeline. All values derived deterministically
from the prompt text hash so a given input yields the same usage on
every run.

End-to-end smoke run (30 requests, mock-model): Usage Prompt Cache Read
~250 avg (range 164-333), Usage Accepted Prediction Tokens ~66 avg
(22-138), Usage Rejected Prediction Tokens ~36 avg (9-64), plus their
Total* aggregates. Reasoning model (gpt-oss with reasoning_effort=high,
max_tokens=400) produces reasoning_tokens=400 (capped budget).

Cross-checked every modelled synonym against the actual Python SDK
source for: openai-python, anthropic-sdk-python, google-genai,
groq-python, together-python, cohere-python (v1+v2), client-python
(Mistral), vllm, ai21-python, cerebras-cloud-sdk-python, sambanova-
python, dashscope-sdk-python (Bailian), python-aiplatform (Vertex),
fw-ai-external/python-sdk (Fireworks), xai-sdk-python. AWS Bedrock and
DeepSeek verified via official API docs.

Three real bugs found and fixed during verification: Cohere v1
meta.cached_tokens not lifted, Cohere v2 envelope not handled, Mistral
prompt_audio_seconds: {} sentinel could crash float().

| Module | Responsibility |
|---|---|
| usage_metrics.py | Basic + reasoning + audio + prediction (per-record) |
| usage_cache_metrics.py | Cache read / write / miss |
| usage_extras_metrics.py | Vendor-specific outliers (tool-use, audio seconds) |
| usage_total_metrics.py | All DerivedSumMetric totals |
| base_usage_record_metric.py | Declarative metric base |

Auto-registration via metric_registry types/*.py glob — no plugin-
registry edits.

- 9120 unit tests passing (267 net new tests over baseline 8861).
- Three layers of usage coverage:
  - Happy-path metric tests in tests/unit/metrics/test_usage_metrics.py.
  - Usage model tests in tests/unit/endpoints/test_usage_parsing.py.
  - Adversarial / property-based / direct-access tests in a new
    tests/unit/common/models/test_usage_models_adversarial.py:
    14 verbatim vendor fixtures, envelope edge cases, type pollution,
    synonym precedence, mutability/copy/pickle round-trips, hypothesis
    property tests, streaming edge cases, cross-record isolation,
    end-to-end metric dispatch.
- Mock-server tests updated for new usage shape (3 new tests covering
  determinism, cache-hit proportionality, zero-completion edge case).

Microbenchmarked the metric-extraction hot path: ~4.7 μs per record
(cold) / ~160 ns per metric (warm); ~470 ms total over a 100K-record
benchmark — invisible against per-request HTTP cost.

- make check-ergonomics: 0 new violations
- make check-ruff-baselined: 0 new violations
- pre-commit: clean
- LLM-ergonomics review findings addressed
- All build-matrix CI green (Ubuntu/macOS × Python 3.10-3.13)
- Integration tests passing on Python 3.10-3.13

- docs/metrics-reference.md updated with every new metric and its
  vendor-source notes.
- New docs/reference/vendor-usage-fields.md — comprehensive vendor
  reference catalogue with direct GitHub links to every cited SDK file,
  per-vendor verification details, exhaustive unmodelled-extras catalog,
  an 8-step "adding a new vendor" checklist, and dated change history.
- Four-file agent-instructions sync (AGENTS.md, CLAUDE.md, copilot,
  cursor) for the new console_group convention.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
(cherry picked from commit 87c8e00)
Introduces two new MetricConsoleGroup values to give the sweep-line
analyzer and derived-latency outputs their own console sections:
- EFFECTIVE: full-window time-weighted (effective_*, tokens_in_flight,
  effective_latency)
- ACTIVE: phase-active-only weighted (active_*_throughput variants)

Analyzer-injected MetricResults aren't in MetricRegistry, so the
cherry-pick's console-grouping logic was hiding them entirely (regression
vs. pre-cherry HEAD). Adds an optional `console_group` field on
MetricResult (excluded from JSON dumps) so analyzer outputs can carry
their group inline; the console exporter falls back to it when the tag
isn't registered, defaulting to DEFAULT for unknown plugin tags.

Also flips the credit-related metrics to console_group=NONE
(credit_to_start_latency analyzer output, CreditDropLatencyMetric) —
they're internal diagnostics, not part of the user-facing console table.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…lic dumps

Two console-grouping followups merged together:

- Render the DEFAULT group last in the end-of-run table sequence so the
  most-important metrics anchor the bottom of the screen (after the
  EFFECTIVE / ACTIVE / USAGE / CACHE / PREDICTION / AUDIO / REASONING
  sections).

- Make MetricResult.console_group survive the records-manager -> exporter
  IPC while still excluding it from user-facing JSON / CSV / REST API
  dumps. The field originally had exclude=True so it stayed out of public
  dumps, but model_dump() is also what IPC uses to serialize, so the
  field was silently dropped before reaching the console exporter and
  analyzer-injected results all collapsed into the DEFAULT table instead
  of EFFECTIVE/ACTIVE. Switched to a context-aware @model_serializer
  that drops the field unless context={"include_internal": True} is
  passed. BaseMessage.to_json_bytes opts in for ZMQ message passing;
  metrics_json_exporter (via to_json_result), accumulator CSV (via
  model_dump), and the REST /metrics route (via model_dump) all drop
  the field.

Verified with `aiperf profile` against the in-repo mock server:
profile_export_aiperf.json and profile_export.jsonl contain zero
console_group occurrences while the EFFECTIVE/ACTIVE console tables
still render correctly with DEFAULT last.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…-mvp

Brings in 4 commits from main:
- t-digest aggregator for InterChunkLatencyMetric (ai-dynamo#865)
- cyclopts ForwardRef on 3.14, cap requires-python <3.14 (ai-dynamo#878/ai-dynamo#879)
- expose count and sum in profile_export_aiperf.json (schema 1.1) (ai-dynamo#881)
- server-metrics auto-disable on incompatible endpoint + /prometheus/metrics fallback (ai-dynamo#877)

Conflict resolutions:
- Took main's `_CLI_GROUP` removal (cyclopts 3.14 fix), preserving HEAD's
  scenario-validator helpers (`_use_think_time_only_explicitly_set`,
  `_inter_turn_delay_cap_explicitly_set`) on InputConfig / LoadGeneratorConfig.
  Auto-merge missed three stale `_CLI_GROUP` refs in
  prompt_config.py / service_config.py / tokenizer_config.py — inlined.
- Kept HEAD's `MetricAggregator = MetricSeriesProtocol` alias in metric_dicts.py
  (API-compatible with main's new Protocol class definition; avoids redefining).
- Kept HEAD's full TDigestListMetricAggregator (extra `add_for_record`,
  `__len__`, `SUPPORTS_PER_RECORD_REPLAY` are required by our metrics-accumulator
  pipeline and the ColumnStore list-backend contract).
- Kept HEAD's classmethod `DerivedSumMetric._derive_value` returning the scalar
  directly. Our pipeline stores the running sum scalar in `scalar_dict[tag]`
  (MetricsAccumulator._collect_scalars_and_arrays at accumulator.py:264/304/315),
  so main's MetricAggregator-based version always raised
  "X is not a MetricAggregator" inside `_resolve_derived_metrics` — silently
  swallowed by the broad `except Exception`, so every Total* metric dropped out
  of the export.
- Confirmed deletion of post_processors/metric_results_processor.py and its test
  (replaced by the metrics-accumulator pipeline in 9616b77).
- Kept HEAD's two new fields under _MetricsSettings (LIST_BACKEND,
  EXPORT_FLUSH_INTERVAL) and both new doc index entries (Vendor Usage Field +
  JSON Export Schema).

Verified end-to-end against the in-repo mock server: total_isl, total_osl,
total_output_tokens, total_token_throughput populate correctly in
profile_export_aiperf.json. 11,023 unit tests pass.

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Design doc for ajc/dag5 — a DAG-only branch from origin/main that combines
the advanced DAG framework (fan-in, multi-gate, K-delayed-join, pre-session
SPAWN, prereqs) from ajc/inferencex-agentx-mvp with the targeted endpoint
and behavior refinements from ajc/dag4, deliberately omitting AgentX, Weka,
cache-bust, and agentic-replay.

Captures branch baseline, in/out-of-scope boundary, five-layer architecture,
FORK/SPAWN data flow, --request-count semantics decision (dag4 wins), error
handling, tier-subfolder testing strategy, and plugin/docs touchpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Markdown-accuracy audit caught two attribution inversions and several path
errors:

- TimingManager._on_dataset_configuration_failed + _wait_for_dataset_or_failure
  is on CURRENT (src/aiperf/timing/manager.py:95/:140), not dag4.
- DagJsonlLoader.can_load non-dict guard is on CURRENT
  (dag_jsonl.py:122-127), not dag4.
- BranchStats lives in common/models/branch_stats.py, not branch.py.
- ConversationBranchMode lives in common/enums/enums.py, not branch.py.
- DagSpawn lives in dataset/loader/dag_jsonl_models.py.
- prerequisite.py → prerequisites.py (plural).
- request_rate at src/aiperf/timing/strategies/request_rate.py.
- TimingManager file is timing/manager.py, not timing_manager.py.
- trajectory_source.py at src/aiperf/timing/, not dataset/loader/.
- agentic_replay.py at src/aiperf/timing/strategies/.
- worker.py in workers/ (plural).
- test_branch_orchestrator_* are unit tests on current (tests/unit/timing/),
  not component_integration; count is 9 (8 named + base).
- test_dag_hard_cap.py and test_dag_multi_root_payload_bytes.py live at
  tests/component_integration/timing/.
- Three-file sync per project CLAUDE.md, not four (AGENTS.md noted as
  optional).
- Clarify Conversation.metadata() projection is the differing call (not
  Turn.metadata, which projects on both branches).
- Clarify dispatch_timing default is "post"; "pre" reserved for SPAWN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Plan 1 of 2 for ajc/dag5. 19 TDD-style tasks porting data models, endpoint
refactor, and sister loaders (inputs_json, raw_payload, mooncake_trace
payload mode) onto a fresh branch from origin/main. Plan 2 will cover the
DAG runtime (BranchOrchestrator, ConversationSource child builders, FORK
pin refcount, request-count semantics, dag_jsonl loader).

Includes spec-coverage table mapping every in-scope item to a task, plus a
"Deferred to Plan 2" section enumerating runtime items intentionally
excluded.

Notable adaptations from the spec (called out in their tasks):
- Task 6 ConversationBranchInfo replaces dag4's is_background+subagent_type
  shape with the spec-mandated dispatch_timing Literal; SubagentType is the
  AgentX surface, out of scope.
- Task 7 BranchStats uses spec-mandated joins_suppressed counter (dag4
  used children_truncated).
- Task 9 defers the RecordContext/RequestInfo split to Plan 2; Plan 1 only
  adds Turn.raw_payload since the loaders need it.
- Task 11 bundles build_assistant_turn + build_messages skeleton +
  extract_payload_inputs; they share imports and the chat/responses
  refactors depend on all three.
- Task 14 adds EndpointType.RAW enum value (required for plugin discovery).
- Task 15 adds BaseRawPayloadLoader and get_preferred_sampling_strategy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ages

Sharpen wording around cache-bust marker placement (system message vs first
user turn), weka_trace loader invocation (`--custom-dataset-type` is required),
warmup turn sampling, and `--unsafe-override` semantics. Update troubleshooting
error strings to match current exception messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace user-facing `--timing-mode agentic_replay` references with the
scenario-driven phrasing ("agentic_replay timing mode, set today by
--scenario inferencex-agentx-mvp") across docs, the cache-bust validator
error, and the scenario-lock violation message. Also tighten several
adjacent docs accuracy fixes: --num-conversations rename + aliases,
warmup heads-up callout for trajectory-based warmup, concurrency_burst
arrival-pattern note, weka-trace SPAWN/SPAWN_JOIN nuances, marker probe
4-tuple, multimodal cache-bust injection details, and refreshed sample
output blocks reflecting current NOTICE phase logging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Pre-canned mode preserves recorded hash_ids byte-for-byte but emits the
trace's synthesized assistant text on every turn, which invalidates the
server's just-generated KV blocks at each turn boundary and artificially
deflates measured cache-hit rate.

Set AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1 to flip the trade-off:
the loader emits user-only deltas and the worker threads the server's
live assistant response back into the session via DELTAS_WITHOUT_RESPONSES.
The synth_buf still tracks assistant segments internally for LCP and
truncation correctness; only the wire emission filters them out.

Trade-off: hash-id fidelity past turn 0 is no longer preserved (server
output length drifts from the trace's recorded output_length, shifting
subsequent user-turn block alignment), but cache reuse pattern matches
what a real agentic user would experience.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@a61553fdc3586f7f0ea9f934ef90f77d8e5e14ae

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@a61553fdc3586f7f0ea9f934ef90f77d8e5e14ae

Last updated for commit: a61553fBrowse code

cquil11 and others added 15 commits May 5, 2026 15:34
When a non-final turn returns with a context-length error from the server,
recycle the trajectory immediately instead of dispatching subsequent turns.
The cumulative prompt only grows as the conversation continues — once
turn N has overflowed, every later turn's prompt will too. Continuing to
dispatch them just wastes compute and inflates the run's overflow rate.

Mirrors kv-cache-tester's "user truncated" semantics: the first
context-length error removes the user from the active pool.

Implementation:
- TimingStrategyProtocol.handle_credit_return now accepts an `error`
  keyword (default None). Other strategies (request_rate, fixed_schedule,
  user_centric_rate) accept and ignore it.
- AgenticReplayStrategy uses the existing AgentX context-overflow
  classifier (is_context_overflow_response) to detect the terminal case
  and routes to the same _spawn_from_recycle_or_id path the final-turn
  branch uses.
- CreditCallbackHandler threads credit_return.error through both
  strategy dispatch sites.

Final-turn overflows already recycled; the new short-circuit only fires
on non-final turns. Generic 500-class errors don't trigger it. WARMUP
returns remain no-ops at the strategy level.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The realtime-stats reporter background task crashed every 30s with
AttributeError because _render_realtime_block was annotated as taking
CreditPhaseStats but the caller (_report_realtime_metrics) actually
passes PhaseRecordsStats from create_stats_for_phase.

Use the records-side fields that exist on PhaseRecordsStats:
- requests_completed -> total_records
- requests_elapsed_time -> records_elapsed_time
- request_errors -> error_records

Drop in_flight_requests from the rendered line since it's a credit-side
concept the records side doesn't have access to. Replaced with an "ok="
counter (success_records) so the line still shows where errors came from.

Tests updated to construct PhaseRecordsStats instead of CreditPhaseStats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…SL/OSL avg

The kv-cache-tester assessment block tracked input AND output throughput
plus ISL/OSL means at every poll. aiperf's realtime line was missing
both — only output_token_throughput and request rates were shown.

Adds:
- tput_in=N/s on the headline (alongside tput_out)
- new "seq isl_avg=N osl_avg=N" row when MetricResults publish those
  per-record metrics

Both pull from MetricResults the records_manager already aggregates;
no new plumbing or inter-process state required. Unchanged behavior
when the underlying metric isn't published (renders "-").

Server-side cache-hit and KV-usage live tracking would require the
records_manager to subscribe to ServerMetricsRecordMessage events from
the server-metrics manager — separate change, not in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The realtime block was pointing at the per-record list metric
``inter_chunk_latency`` (gaps between every streamed chunk), and
list-valued metrics don't aggregate into displayable percentiles in
the realtime path. Result: the ``itl`` row showed ``-`` for p50/p95/p99
mid-run even when the per-record JSONL had real values.

Switch to the scalar ``inter_token_latency`` metric (avg gap across the
response, computed per record). Same number that lands in
profile_export_aiperf.json's aggregate; just the realtime fast path
needs the scalar version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
aiperf doesn't aggregate input throughput as its own metric, so
``tput_in=-`` rendered dashes mid-run. Switch to ``total_token_throughput``
(input + output tokens / wall-clock) and rename the field to
``tput_total``. Matches the "Total throughput" assessment field
kv-cache-tester showed; output throughput stays alongside it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
aiperf had system-level total_token_throughput and output_token_throughput
but no input-side analog, so prefill TPS could only be inferred via
subtraction. For long-context agentic workloads (this corpus has
80k-token avg ISL) input throughput dominates the bandwidth and tracking
it as its own metric matches what most production benchmarks expect.

New ``input_token_throughput`` metric mirrors ``output_token_throughput``
exactly:
  input_token_throughput = TotalInputSequenceLength / BenchmarkDuration

Auto-discovered via the standard types/ scan. Realtime renderer flips
the dashed ``tput_in=-`` field back to populated (``tput_in=N/s``) since
the metric is now collected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Adds a "srv" row to the realtime stats block showing live server-side
metrics scraped from the inference server's /metrics endpoint:
- prefix_cache_hit_rate (cumulative across the run)
- external_prefix_cache_hit_rate (when CPU offload is active)
- kv_cache_usage_pct (latest gauge, max across endpoints)
- num_preemptions (cumulative count)

Implementation:
- New ServerMetricsAccumulator.realtime_snapshot() walks the live
  hierarchy and returns a flat dict of these metrics.
- _render_realtime_block accepts an optional server_snapshot kwarg and
  renders a "srv" row when keys are present.
- _report_realtime_metrics queries the accumulator best-effort
  (silent fallback when --no-server-metrics or no records yet).

Matches the cumulative server-metrics display kv-cache-tester showed in
its assessment block. The rest of the metrics needed for full parity
(per-interval rates, GPU/CPU offload bandwidth) are still available in
server_metrics_export.json for offline analysis; the realtime row
focuses on the most actionable signals at glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…group)

The metric was being computed but ``filter_display_metrics`` dropped it
because I set ``console_group = MetricConsoleGroup.NONE`` while copying
the ``total_token_throughput`` template. Realtime rendered ``tput_in=-``
even though the value was available.

Removed the override so the metric uses the default console group
(DEFAULT) and flows through the realtime display path. Mirrors
``OutputTokenThroughputMetric``, which also relies on the default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…e tail

`CreditCallbackHandler._maybe_signal_dag_completion` (deferred guard at
end of `on_credit_return`) and `_credit_will_dispatch_children` (inline
defer) only run inside credit-return callbacks. Under `--concurrency
>= 2`, the orchestrator's last drain step (`_handle_child_done`
decrement, `dispatch_join_turn` returning False under cap, all-children-
rolled-back path in `_spawn_children_and_register_gates`) can land
between credit returns, leaving `all_credits_returned_event` unset and
forcing the phase to wait for either the pre-wait short-circuit on the
next entry into `_wait_for_returning_complete` or the drain-timeout
backstop.

The pre-wait short-circuit catches it most of the time, but when it
doesn't (e.g. runner enters `_wait_for_returning_complete` BEFORE all
credits return, then awaits the event), the drain-timeout backstop is
the only escape — costing a full grace period of wall-clock per
occurrence.

Fix mirrors the dag5 branch's commit 7cd4180b7: orchestrator publishes
`set_drain_observer(callback)`. The three drain points —
`_handle_child_done`, `_handle_child_errored_fail_fast`, and the
no-children-landed path inside `_spawn_children_and_register_gates`
(where every dispatch_first_turn refused at the cap) — call
`_notify_drain()` after their state mutations are complete.
`CreditCallbackHandler.set_branch_orchestrator` registers
`_on_orchestrator_drain`, which iterates active phase handlers and
re-runs the deferred check for each. Idempotent (no-ops on already-set
event or disagreeing predicate). Observer detached in `cleanup()`.

Net effect: the race is closed at the source. The pre-wait short-
circuit and drain-timeout backstop become true safety nets rather than
the primary completion path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Adds six unit tests for the drain-observer wiring on
CreditCallbackHandler.set_branch_orchestrator and the closure it
registers via BranchOrchestrator.set_drain_observer:

- registration: attach calls set_drain_observer with a callable; detach
  calls set_drain_observer(None)
- happy path: callback fires + counters say all returned + orchestrator
  predicate clean → event set
- defer when has_pending_branch_work=True
- defer when check_all_returned_or_cancelled=False
- skip phase handlers whose lifecycle.is_complete (event already
  finalized via normal end-of-phase path)
- idempotent: multiple callback invocations after event already set
  remain a no-op

These tests would have failed pre-fix (the observer wasn't registered)
and lock in the contract for the closure logic in
CreditCallbackHandler._on_orchestrator_drain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
The except block bound ``Exception`` without a name but the lazy log
lambda referenced ``exc``. In production (INFO level) the lambda is
never invoked so this would silently swallow the underlying error; in
debug builds it raises NameError when formatting, masking the actual
exception from realtime_snapshot.

Bind ``exc`` correctly via ``except Exception as exc`` and capture it
in the lambda default-arg so debug builds surface the real failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Registers ``faulthandler.SIGUSR1`` on every aiperf process (system
controller + every subprocess service via bootstrap_and_run_service).
``kill -USR1 <pid>`` from outside writes the full Python traceback for
every thread to stderr without disturbing the process otherwise.

Crucial for debugging hangs in production-like deployments where py-spy
isn't available on the runner host. Procedure for next hang:

  ssh <node>
  pgrep -af aiperf       # list every aiperf process
  for pid in $(pgrep -f 'aiperf|system_controller|worker_'); do
      kill -USR1 $pid
  done
  cat /workspace/results/server.log /workspace/results/trace_replay/logs/*.log

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
ajcasagrande and others added 27 commits May 11, 2026 19:39
The post-build 'concurrency > pool' guard is replaced by automatic
wrap-fill in TrajectorySource. The error class is no longer reachable;
remove it and convert each test that asserted it into a positive
assertion that wrap-fill produces the requested concurrency.

Empty-pool cases still raise EmptyTracePoolError (separate class).

Also fix test_recycle_pass_dict_grows_only_to_pool_size: it seeded the
double-recycle guard with a trace_id, but the guard now keys on
correlation_id (Task 5). Update the seed to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Component-integration test for pool < concurrency: 1-trace pool, 4-way
concurrency. Validates wrap-fill activates, lanes get decorrelated k_i,
per-lane cache-bust markers are distinct, and the recycle loop completes
without tripping the double-recycle guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ation helpers

Task 5 re-keyed _in_flight_recycled from trace_id to correlation_id. Five
component-integration tests had _make_credit helpers that defaulted
x_correlation_id='xcorr' across all credits — that worked under trace-id
keying but now correctly trips the guard when the same correlation_id is
fed in twice.

Fix the helpers to mint a unique correlation_id per credit (uuid4 if no
explicit value is passed). Preserves the deliberate "same correlation_id
twice" assertions that test the guard itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
A ~3-page external-facing whitepaper for ML perf practitioners explaining
the EFFECTIVE and ACTIVE MetricConsoleGroup families and the CO-aware
effective_latency metric. Includes a real worked example from a mock-server
run showing the ~19x gap between effective_prefill_throughput and
active_prefill_throughput when prefill is sparse in the run window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Register a new --public-dataset tag pointing at the same
SemiAnalysisCCTracesWekaLoader class, sourced from
semianalysisai/cc-traces-weka-no-subagents-051226 (949 traces,
136.1k requests). Same WekaTrace schema as the 042026 corpus, just
with all WekaSubagentEntry blocks stripped, so reconstruction goes
through the existing WekaTraceLoader path unchanged.

Smoke test (5 rows) confirms all rows validate as WekaTrace, 519
requests with type='s' only, 0 subagent entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…26 variant

Cut the InferenceX AgentX-MVP scenario over from the 042026 corpus (739
traces with full subagent fan-out) to the 051226 no-subagents variant
(949 traces, main-agent linear streams only). The new variant is now
the AgentX MVP default everywhere:

- Scenario require_loader: ("semianalysis_cc_traces_weka_no_subagents",
  "weka_trace") instead of ("semianalysis_cc_traces_weka", "weka_trace").
- agentx-mvp.md tutorial: dataset name, trace count (739 -> 949),
  --num-dataset-entries default, Subagents section rewritten to note
  the variant strips them, troubleshooting sections updated.
- weka-trace.md: HF section now documents both variants with a
  selection table; no-subagents flagged as AgentX MVP default.
- README tutorial index: both HF variants mentioned, no-subagents
  flagged as AgentX MVP default.
- Loader docstring (module + class): generalized to describe both
  variants registered against the same class.
- Scenario validator tests, scenario registry test, public-composer
  fixture, and loader test fixture all updated to the new tag/HF name.
- Loader test gains a separate pin assertion so the 042026 plugin
  entry stays anchored to its HF name (and the no-subagents entry to
  its own) under future registry edits.

The legacy semianalysis_cc_traces_weka plugin entry remains registered
unchanged - it's just no longer the AgentX MVP default. Anyone wanting
to replay the with-subagents corpus can still pass --public-dataset
semianalysis_cc_traces_weka explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Bump the SemiAnalysis Weka loader's HF dataset target from the 042026
full-subagent corpus (657 MB / 739 traces) to the no-subagents variant
(2.77 GB / 949 traces / 136k requests). Subagent entries are stripped:
only top-level main-agent turns remain, so each row produces exactly
one conversation downstream (no parent/child fan-out).

Plugins.yaml, loader docstrings, and two fixture/constant references in
the unit tests follow the rename. The registered tag and class are
unchanged, so the inferencex-agentx-mvp scenario binding continues to
resolve.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Two concurrent aiperf processes that miss the cache on the same key
both pay the expensive tokenize+reconstruct cost today. The atomic
os.replace at the end of populate() means only one set of bytes ends
up canonical, but both sides do the work. On a shared filesystem cache
(Lustre / NFS) with N concurrent agentic jobs, that's N redundant
tokenizations.

Add a cross-process populate lock that serializes the miss path:

  - lookup -> HIT -> use it (no lock, fast path)
  - lookup -> MISS -> acquire flock -> re-lookup -> ...
      -> HIT (someone populated while we waited) -> use it
      -> MISS -> do the tokenize + populate -> release

Implementation mirrors huggingface_hub's WeakFileLock pattern:
  - filelock.FileLock with mode=0o664 so multiple users sharing a cache
    directory contend correctly
  - SoftFileLock fallback on filesystems without flock support
  - INFO log every 10s while waiting so a waiter is visible, not silent
  - thread_local=False so release on a different asyncio worker thread
    (acquire vs release end up on distinct asyncio.to_thread workers)
    still actually drops the OS-level lock
  - asyncio.to_thread for the blocking acquire so the event loop is not
    blocked

Code lives in a new mmap_cache_lock module to keep mmap_cache.py under
the 500-line file-size budget; mmap_cache re-exports acquire_cache_lock
with cache_dir pre-bound so callers see one entry point.

DatasetManager._do_profile_configure now wraps the miss path in the
lock, with the double-checked re-lookup factored into a small helper
to keep the function under the 80-line size budget.

Three new unit tests in test_mmap_cache.py cover: concurrent acquires
on the same key serialize; acquires on different keys run in parallel;
holder beyond timeout causes the waiter to raise filelock.Timeout.

filelock is added as an explicit project dependency. It was already a
transitive via huggingface-hub; declared here so a future HF drop
doesn't silently break the cache lock.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The single 'seq isl_avg=... osl_avg=...' line hid the distribution
shape that's the main reason to watch sequence lengths mid-run
(spotting long-tail agentic prompts or response truncation). Replace
it with two percentile rows matching the latency / per-user-throughput
row format already used above:

    isl  p50=123,952 p75=245,124 p90=391,085 p99=720,485 (tokens)
    osl  p50=261     p75=664     p90=1,614   p99=7,013   (tokens)

Reads p90 off the existing JsonMetricResult schema (already populated
by the accumulator); no extra plumbing. Rows are skipped entirely when
both ISL/OSL metrics are absent so the renderer stays compact on
non-tokenizing endpoints.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The TQDMProgressUI + Textual dashboard already wire @on_warmup_progress
to a tqdm bar, but those UIs are disabled under non-TTY execution (the
CI/SLURM bench-driver case), so users have no visibility into warmup
progress until the phase either completes or hangs.

Add a per-return INFO log inside AgenticReplayStrategy.handle_credit_return
under CreditPhase.WARMUP that fires once per returning warmup credit
(success or error), of the shape:

  WARMUP 7/40 returned [ok] (lane=6, trace_id=abc123...)

This gives operators a line per completion in benchmark.log even when
no UI is active. The lambda log form keeps the per-completion work off
the hot path when log level filters out INFO.

Bump ergonomics file-size baseline: the addition pushes
agentic_replay.py from 500 -> 510 lines.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
The tin (prefill_throughput_per_user) and tout
(output_token_throughput_per_user) rows in the realtime block are
literally just 1/prefill_time and 1/ITL percentiled across requests,
which is the same information conveyed more clearly as the
"interactivity" metric (1/tpot) familiar from LLM serving literature.

Drop the two per-user-throughput rows and emit a single

  intvty p50=9      p75=10     p95=15     p99=20     (1/tpot tok/s)

row using the same backing metric (output_token_throughput_per_user
= 1 / inter_token_latency per request). Aggregate tput_in/tput_out on
line 1 are unchanged.

Same numbers, clearer label, less noise.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Surfaces the server's own view of input/output token throughput in
the realtime srv row, computed as a running average from the vLLM
prometheus counters (delta / elapsed):

  vllm:prompt_tokens_total      -> tput_in_srv=N/s
  vllm:generation_tokens_total  -> tput_out_srv=N/s

Useful contrast with the client-side aggregate tput_in/tput_out on
line 1: server-side counts work in flight that hasn't returned a
response yet, so it tracks the engine's actual ingest/emit rate
rather than the rate of completions aiperf has parsed.

Suppressed entirely when the counters aren't exposed (SGLang,
non-vLLM servers, or runs where /metrics isn't scraped) so the row
stays clean.

Bump ergonomics baseline: file +5 lines (526->531), realtime_snapshot
+10 lines (80->90); both fall under existing advisory caps that
already accept similar-sized siblings.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
…enarios

Context-overflow errors mid-trajectory are already handled by
agentic_replay.handle_credit_return via the separate CreditReturn
message path: the trajectory is terminated, the conversation is
recycled, and a fresh trajectory spawned in its place. Emitting an
error MetricRecordsMessage for the same event double-counts it -- the
record shows up in failure tallies, dragging down the per-run success
rate even though the overflow is an expected and intentionally-tolerated
end-of-trajectory signal in this scenario.

When the active scenario's timing mode is AGENTIC_REPLAY, drop
context-overflow records before they enter the metrics pipeline. The
filter is gated on scenario timing mode (cached at __init__) and the
RequestRecord.context_overflow flag the parser already sets per
InferenceX AgentX RFC §7.

The check is a no-op for:
  - non-scenario runs (user_config.scenario is None)
  - non-agentic scenarios (any future timing_mode != agentic_replay)
  - records that aren't context-overflow events (flag stays False)

The existing ContextOverflowCountMetric continues to work for
diagnostic purposes outside agentic scenarios.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
When the active PROFILING-phase failure rate (error_records /
total_records) exceeds the user-supplied threshold after a grace
floor of max(concurrency, 10) records, broadcast ProfileCancelCommand
on the message bus to terminate the run early. The existing cancel
handlers in records_manager, timing_manager, server_metrics manager,
and gpu_telemetry manager stop their work cleanly; the run finalizes
with cancelled=True and exits non-zero via the standard cancel flow.

The grace floor exists so a single early failure can't trip a tiny-N
threshold (e.g., 1/1 = 100% > 0.5 would abort instantly without this
guard).

Pairs naturally with the AGENTIC_REPLAY context-overflow drop in
record_processor_service: context overflows aren't counted as errors
in agentic scenarios, so the threshold measures real failures only
(server 5xx, parse errors, malformed responses) and won't trip on the
expected end-of-trajectory overflow signal.

Default is None (disabled, matching existing behavior). Valid range
[0.0, 1.0]. Idempotent: the abort fires at most once per run; if the
ProfileCancelCommand publish fails for any reason, the trigger is
reset so the next record re-evaluates and re-attempts.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
…warmup log

Two coupled changes for AGENTIC_REPLAY scenarios:

1. Configurable trajectory start range. Previously each trajectory's
   k_i (start turn) was sampled uniformly from [0, int(0.7 * n)] with
   the 0.7 hardcoded in TrajectorySource. Now exposed as two CLI flags
   on the LoadGenerator group:

     --trajectory-start-min-ratio  (default 0.0)
     --trajectory-start-max-ratio  (default 0.7, preserves prior behavior)

   Sampling becomes uniform on [int(min_ratio * n), int(max_ratio * n)],
   both clamped to n-2 so the trajectory always retains at least one
   profiling turn after warmup. Validated cross-field so min <= max.
   Plumbed: loadgen_config -> timing.config.TimingConfig -> phase_orchestrator
   -> TrajectorySource constructor. RNG seed still derived from
   --random-seed via SHA-256 salt with trace_id, so k_i remains
   deterministic per (seed, trace_id).

2. Per-trajectory warmup completion log. The previous WARMUP info line
   reported only lane + trace_id. It now also reports start_turn=k_i/N
   and the percent of the trace that was warmed.

   Per-request token count (ISL) is intentionally NOT included in this
   commit -- it would require plumbing prompt_tokens through CreditReturn
   (or subscribing AgenticReplayStrategy to MetricRecordsMessage for the
   WARMUP phase). Leaving that as a follow-up.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
Adds a one-block info log emitted by TrajectorySource at the end of
__init__, before any dispatch fires. Shows the configured start-range
plus the actual per-trajectory (lane, k_i, num_turns, pct) so the
operator can sanity-check that the configured
--trajectory-start-{min,max}-ratio produced a sensible distribution.

Example:

    TrajectorySource: built 14 trajectories from 949 traces
      range cfg=[0.25, 0.75]  observed pct: min=27% median=51% max=72%
        lane=00  start_turn=  6/24  (25%)  trace_id=abc...
        lane=01  start_turn= 15/22  (68%)  trace_id=def...
        ...

Complements the existing per-trajectory warmup-completion lines. Logged
once at build time so the full distribution is visible upfront without
correlating across credit-return events.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
The previous "[RecordProcessor] Drop context-overflow records for
AGENTIC_REPLAY scenarios" commit returned early from
record_processor_service._on_inference_results before pushing the
MetricRecordsMessage, which broke the records-side <-> credit-side
counter invariant: RecordsTracker.total_records is compared for
equality against the credit-side final_requests_completed at
end-of-phase, and the drop made the records-side lag the credit-side
by one for every overflow event. The completion barrier never
converged, hanging the run for the full benchmark_grace_period before
timing out + cancelling in-flight credits.

Symptom in the log:

  NOTICE All requests have completed, please wait for the results to be
         processed (currently 423 of 424 records processed)...
  ... (30s timeout) ...
  WARNING Phase profiling timed out, cancelling all credits.

Fix: don't drop the record. Add a context_overflow_skip flag to
MetricRecordMetadata. RecordProcessor sets it when the record is
context-overflow AND scenario is AGENTIC_REPLAY. RecordsManager
recognizes the flag and:

- Counts the record toward total_records (preserves the invariant)
- Classifies as success in RecordsTracker (so error counters stay at 0)
- Skips error_tracker.increment_error_count_for_phase
- Skips _send_record_to_accumulators (latency/throughput/etc. unaffected)
- Skips _maybe_trigger_failed_request_abort (overflow is not a real
  failure for threshold purposes)

Net behavior matches the original intent ("nothing about the context
overload is counted towards metrics whatsoever") without breaking the
end-of-phase completion barrier. End-of-phase completion now matches
credit-side: total_records = success + 0 errors, where success includes
the overflow-skip records.

Signed-off-by: Cam Quilici <cameron@semianalysis.com>
…o cjq/weka-live-assistant-responses

# Conflicts:
#	src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py
Cancel path was incomplete: PhaseRunner.cancel() cancelled the
credit-issuance _execution_task but never set
all_credits_sent_event / all_credits_returned_event. The runner's
outer _wait_for_sending_complete / _wait_for_returning_complete
awaits keep blocking on the unset events until the phase timeout
elapses (= --benchmark-duration, default 1800s for profiling).

Empirical: a --failed-request-threshold-triggered
ProfileCancelCommand at T=170s into a 1800s profiling phase causes
the run to hang for the remaining ~1630s before _wait_for_sending_complete
finally returns via its own timeout. The graceful "if self._was_cancelled:
return" branch is reached, but only 27 min later.

Set both events at the top of cancel() so the runner's awaits wake
immediately. Mirrors the event-set order already present in the
except-Exception recovery path (runner.py:363-373) — same correctness
guarantee, just on the external-cancel path too.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Two log-ergonomics changes to make long agentic-replay runs in
captured-stdout contexts (CI, srt-slurm, scripted invocations)
readable:

1) EventLoopMonitor "Event loop ... taking too long to run. Overhead:
   XXms" downgraded from warning to debug. At sustained conc>=32 on
   agentic workloads this fires dozens of times per minute and is not
   actionable (inherent async-scheduler overhead under load).

2) callback_handler "Credit return after phase {phase} complete,
   credit_id=N, worker=W" downgraded from warning to debug. Every
   cancel-triggered phase shutdown (e.g. --failed-request-threshold
   trip) emits one such line per in-flight credit — up to
   concurrency-many = thousands — flooding the log without being
   actionable; the late return is expected under the cancel race.

3) service_config.validate_ui_type now picks UIType.TQDM (not NONE)
   when stdout isn't a tty. tqdm's progress bars still render
   usefully in tailed logs (carriage returns are preserved by
   typical pipe sinks), so users running aiperf from srt-slurm /
   gha runners get a visible progress indicator. Explicit
   --ui-type none still opts back into the previous silence.

All three changes are config/severity adjustments only; no behavioral
changes to phase orchestration, credit accounting, or UI rendering
beyond the visibility surface.

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
8aad400 introduced an AttributeError on every aiperf invocation in
non-TTY contexts (CI, srt-slurm benchmarks). UIType is an extensible
enum with members SIMPLE/DASHBOARD/NONE; "TQDM" was never registered.
SIMPLE is the tqdm-backed UI per the field docstring on `ui_type`.

Repro: any `aiperf profile ...` call where stdout is captured (e.g.
redirected to a file) crashes immediately in
ServiceConfig.validate_ui_type with
`AttributeError: 'UIType' has no attribute 'TQDM'`. Surfaced via
InferenceX R26 1p6d shards where the agentic benchmark wrapper saw
aiperf exit 1 within seconds of invocation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TrajectorySource._log_trajectory_summary previously reported each lane's
start position only as turn index + percentage. With agentic-replay on
weka cc-traces, the useful question is "how many tokens of context does
warmup start at?" — and the answer was invisible until you correlated
trace_ids back to the raw dataset.

Threads input_length (proxy-tokenizer "in" field) end-to-end:
  WekaNormalRequest.input_length  (already in scope at construction)
    -> Turn.input_length            (new optional field)
      -> TurnMetadata.input_length  (new optional field; propagated by
                                     Turn.metadata() and
                                     Conversation.metadata())
        -> TrajectorySource summary log

New log shape:

  TrajectorySource: built 192 trajectories from 949 traces
    range cfg=[0.25, 0.75]
      observed pct:    min=  0% median= 46% max= 75%
      observed tokens: min=     0 median= 58,431 max=187,294
      lane=00  start_turn= 15/27  ( 56%)  start_tokens= 42,580  trace_id=...

Backward-compatible: input_length is Optional everywhere, defaults to
None. Loaders that don't populate it (synthetic, raw_payload, sharegpt,
etc.) keep working unchanged and the log shows "-" for those lanes.
Only weka_trace's normal-turn path sets it for now.

Split _log_trajectory_summary into _build_trajectory_rows +
_format_observed_stats + _median helper to stay under the
80-line function size cap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Swaps the HF dataset slug from cc-traces-weka-no-subagents-051226
(949 traces) to cc-traces-weka-no-subagents-051826 (98 traces).

051826 is a stricter filter of the same source:
  - v5-only (drops legacy trace_version=4 rows)
  - CC ≥ 2.1.139 (drops rows from older CLI versions whose tool-use
    semantics differ)
  - ≥20 main-agent turns per trace post-strip
  - subagent blocks stripped (same as 051226)

InferenceX R29 surfaced a delta-encoding edge case where two of the
949 traces (turns 555-918 and 4752-4753) produced empty delta_messages,
triggering 99.5% of the 366 HTTP-400 validation rejections. The new
051826 corpus should not contain those two traces.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The preemptions counter rarely tells you anything actionable in the live
log — it's either 0 (boring) or steadily climbing (which the kv_usage +
queue=r/w fields already telegraph more usefully). Removing it tightens
the per-tick server-side row to the metrics that actually inform
intervention decisions: cache hit rates, KV usage, queue depth, and
server-side token throughput.

The accumulator still scrapes vllm:num_preemptions and
sglang:num_retracted_reqs and exposes num_preemptions on the snapshot
dict, so downstream consumers (export jsonl, future analysis) keep
working unchanged. Just the log surface is trimmed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@cquil11 cquil11 closed this May 19, 2026
@cquil11 cquil11 deleted the cjq/weka-live-assistant-responses branch May 19, 2026 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants