Cjq/weka live assistant responses by cquil11 · Pull Request #886 · ai-dynamo/aiperf

cquil11 · 2026-05-05T17:10:13Z

No description provided.

…y, DAG benchmarks, and reproducible payload replay Adds four major, interlocking input/timing/scenario capabilities to AIPerf, plus the supporting plumbing (DAG conversation branching, scenario validator framework, prerequisite system, hash-ID-aware synthesis, ISL-budget compensation, raw-payload endpoint, and an extensive test corpus). This is the first AIPerf implementation of the SemiAnalysis InferenceX AgentX-MVP benchmark and lays the foundation for byte-exact replay of the public Weka KV-cache-tester corpus. Headline features ----------------- 1. InferenceX AgentX MVP scenario (`--scenario inferencex-agentx-mvp`) - Single-flag preset that bundles every AgentX-MVP submission rule: fixed-schedule replay, no early stop, cache warmup, inter-turn delay cap (60s), Weka trace input, and submission-validity stamping. - Locks conflicting flags at config time via a new scenario-validator framework (`src/aiperf/common/scenario/`, `src/aiperf/common/validators/`). - Stamps a `submission_valid` field onto both per-run `profile_export.json` and aggregate output (`--num-profile-runs >= 2`) so reviewers can verify compliance at a glance. - Documented end-to-end in `docs/tutorials/agentx-mvp.md`. 2. Weka agentic-coding trace loader - New `weka_trace` input type replays real Claude Code coding sessions captured by the public Weka kv-cache-tester project, preserving: - Per-request timing (`type: "n"` and `type: "s"` streaming requests). - Cache-block hash IDs for KV-cache-aware replay. - Subagent topology (`type: "subagent"`) as nested SPAWN children. - Per-trace model rewriting: traces map onto whatever `--model` values the user supplies, regardless of the trace's recorded model names. - Parallel JSON->DAG conversion path (`weka_parallel_convert.py`, `weka_synth_buf.py`) for fast load of large corpora. - Byte-exact corpus tests + adversarial coverage (`tests/unit/dataset/loader/test_weka_trace*.py`, `tests/component_integration/dataset/test_weka_trace*`). - Auxiliary tooling: `tools/weka_byte_exact_verify.py`, `tools/weka_trace_inspect.py`, `tools/weka_loader_drift_audit.py`. - Companion loader for SemiAnalysis CC traces (`semianalysis_cc_traces_weka.py`). - Full tutorial: `docs/tutorials/weka-trace.md`. 3. DAG benchmarks: branching conversations - New `dag_jsonl` input type expressing one conversation per JSONL line with `forks` (KV-cache-locality children that inherit history and pin to the parent worker) and `spawns` (independent subagent children with fresh history that may land on any worker). - Branch orchestrator + trajectory source (`src/aiperf/timing/branch_orchestrator.py`, `src/aiperf/timing/trajectory_source.py`) drive the conversation graph, including SPAWN_JOIN prerequisites that gate parent turns on completed subagent trees. - Prerequisites/branch metadata layer (`src/aiperf/common/models/prerequisites.py`, `branch.py`, `branch_stats.py`). - Tutorial + reference: `docs/benchmark-modes/dag.md`. - Comprehensive test pyramid: unit, adversarial, fan-in, multi-gate, pre-session, pathological-timing, full-topology integration, and spawn-mode integration suites. 4. Reproducible payload replay - `inputs_json` loader replays the `inputs.json` artifact AIPerf already produces, enabling byte-faithful re-runs and cross-server comparisons (`docs/tutorials/inputs-json-replay.md`). - `raw_payload` loader + `raw` endpoint replay arbitrary pre-built JSONL request bodies verbatim, including multi-turn directory mode for non-standard APIs (`docs/tutorials/raw-payload-replay.md`, `src/aiperf/endpoints/raw_endpoint.py`). Supporting changes ------------------ - New timing strategies: `agentic_replay` (drives the trajectory source for Weka/DAG/AgentX runs) and `cache_bust` (collision-free cache-busting prefix injection with adversarial coverage). - New `coding_content` synthetic generator + `hash_ids_synthesis` loader for KV-cache-aware synthetic workloads matching real coding-session shapes. - ISL budget compensation (`docs/reference/isl-budget-compensation.md`, `tests/unit/dataset/composer/test_isl_budget_compensation.py`) and ISL tokenization reference (`docs/reference/isl-tokenization.md`). - New `context_overflow_count` metric with runtime gating; ergonomics and ruff baselines refreshed. - Inter-turn delay cap utility (`_delay_cap.py`) shared across multi-turn loaders so AgentX-MVP and Weka traces honor the 60s cap consistently. - Endpoints gain a shared `response_mixin` and an explicit `raw` endpoint; parser/exporter touch-ups for the new `submission_valid` field and DAG metadata tagging. - Credit pipeline: per-session field plumbing, join-turn dispatch, sticky router updates, and structs depth tests for branched topologies. - Records manager / inference client / worker updated for cache-bust injection paths and DAG/Weka metadata tagging. - Three-file sync (CLAUDE.md, copilot-instructions, cursor rules) refreshed alongside the docs index, plugin registry, and CLI/env-var generated docs. Tests ----- - ~150 new test files spanning unit, component-integration, and full integration tiers — covering DAG topology pathology, cache-bust collision freedom, AgentX scenario lockdown, Weka byte-exact drift, prerequisite metadata adversarial cases, raw-payload adversarial inputs, trajectory source extended adversarial cases, and orchestrator-validator integration. - New fixture corpora under `tests/fixtures/dag/`, `tests/fixtures/weka_traces*/`. Docs ---- - Four new tutorials: AgentX-MVP, Weka trace replay, inputs-json replay, raw-payload replay. - New benchmark-mode page for DAG conversations. - New reference pages for ISL-budget compensation and ISL tokenization. - Architecture, plugin-system, timing-modes, trace-replay, conversation- context-mode, benchmark-datasets, genai-perf-feature-comparison, CLI options, and environment-variables docs all refreshed for the new scenarios, loaders, strategies, and metrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Makefile: add `not slow` to unit/integration/component-integration/coverage/CI test selectors so slow-marked corpus tests don't OOM xdist workers. - DatasetManager: capture raw_payload state BEFORE `_preformat_payloads` runs via `_all_turns_source_loaded_payloads`; otherwise synthesized turns gain raw_payload and falsely trip the mooncake payload-mode inputs.json skip. - test_server_metrics: the mock's vllm/sglang endpoints don't expose labelled series, so assert label-column discovery is well-formed rather than non-empty. - DAG mock helpers (`_start_pre`, `_start`) accept `**_kw` so the new `cache_bust_marker` / `cache_bust_target` kwargs from `BranchOrchestrator` no longer break ~75 timing tests across four files. - test_weka_trace_v1_adversarial.py, test_weka_trace_integration.py: MagicMock `prompt_generator` now exposes real `_corpus_size`, `_tokenized_corpus`, and `tokenizer.decode`, and the user_config carries a real `inter_turn_delay_cap_seconds` so the partial-tail synth + delay-cap paths can run. - test_weka_trace_byte_exact_drift.py: `real_qwen_tokenizer` cleanly `pytest.skip`s when the Qwen3-0.6B HF cache is missing; the module-scoped `real_prompt_generator` self-seeds RNG (function-scoped reset_random_generator runs after module fixtures). - test_cli_help.py::test_profile_help_does_not_show_parameters: new `Groups.SCENARIO` adopts `--scenario` / `--unsafe-override` so they leave the default Parameters group. - test_cache_bust*.py: pass `target=` as a keyword now that `build_cache_bust_marker` is keyword-only past trace_id. - tokenizer._is_hf_cached requires at least one tokenizer file (`tokenizer.json` / `tokenizer.model` / `vocab.json` / `tokenizer_config.json`) in the snapshot so a partial cache from an interrupted download stops triggering offline-only loads. - inference_client._send_request_to_transport validates pre-serialised `payload_bytes` via `orjson.loads` so invalid JSON is converted to an error RequestRecord before the transport call. - ConversationSource.start_branch_child and cache_bust.build_cache_bust_marker: added `*,` separator to satisfy the keyword-only-args ergonomics rule. - tests/integration/conftest.py: when pre-cache fails, also purge the partial `models--<name>/` directory so subsequent subprocess calls don't short-circuit into offline mode against a broken snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…ealtime stats overhaul Builds the second half of the InferenceX AgentX MVP work on top of the initial commit. Three large additive themes plus a series of correctness fixes, all of which preserve the public output contract that ``profile_export_aiperf*`` consumers care about. Replaces the multi-stage ``MetricResultsProcessor`` + ``TimesliceMetricResultsProcessor`` pipeline with a single ``MetricsAccumulator`` that owns a NaN-sparse columnar store keyed by ``session_num``: - ``aiperf.metrics.column_store.ColumnStore`` — numpy-backed numeric + string + categorical columns, ragged list-valued columns delegated to a configurable backend (``RaggedSeries`` for exact replay, ``TDigestListMetricAggregator`` for bounded-memory sketch). Selected via ``Environment.METRICS.LIST_BACKEND`` (default ``ragged``; ``tdigest`` is the opt-in escape hatch for >1M-request runs). - ``aiperf.metrics.ragged_series.RaggedSeries`` — flat values + offsets + record_indices arrays; supports per-record replay, which the ICL-aware sweepline algorithms consume. - ``aiperf.metrics.accumulator.MetricsAccumulator`` — ingests every RECORD / AGGREGATE / DERIVED metric, computes time-windowed results for both the overall run and per-timeslice, and emits a typed ``AccumulatorMetricsSummary``. - ``aiperf.common.accumulator_protocols`` — ``AccumulatorProtocol``, ``StreamExporterProtocol``, ``AnalyzerProtocol``, ``SummaryContext``, ``ExportContext``. Replaces the old ``RESULTS_PROCESSOR`` plugin category split by ``accumulator`` / ``stream_exporter`` / ``analyzer`` in ``plugins.yaml``. - ``records_manager`` rewired: per-record fan-out to ``_metric_record_accumulators`` / ``_metric_record_stream_exporters``, static plugin lookup once at init, summarize/finalize/analyzer pipeline driving ``ProcessRecordsResultMessage`` + ``ProcessAllResultsMessage``. ``aiperf.analysis.sweepline*`` adds vectorized step-function algorithms producing two parallel families of time-weighted aggregate metrics: - ``effective_*`` — time-weighted over the entire run window: ``effective_concurrency``, ``effective_decode_concurrency``, ``effective_prefill_concurrency``, ``effective_decode_throughput``, ``effective_prefill_throughput``, ``effective_total_throughput``, ``effective_decode_throughput_per_user``, ``effective_prefill_throughput_per_user``, ``tokens_in_flight``. - ``active_*`` — time-weighted only over segments where the relevant phase has at least one in-flight request, surfacing intensity while the phase is happening: ``active_decode_throughput``, ``active_prefill_throughput``, ``active_total_throughput``, ``active_decode_throughput_per_user``, ``active_prefill_throughput_per_user``. ICL-aware variants of throughput and tokens-in-flight use per-chunk inter-chunk-latency timing when the configured backend exposes per-record replay (i.e. ``RaggedSeries``); otherwise they fall back to request-level (start_ns, generation_start_ns, end_ns) timing. - ``credit_to_start_latency`` (``request_start_ns − credit_issued_ns``) — controller-queue wait, surfaces saturation. - ``effective_latency`` (``end_ns − credit_issued_ns``) — coordinated-omission-aware latency a saturating user actually perceives. - ``completed_request_count`` — successful + failed completions, distinct from ``request_count`` which only counts successes used for latency distributions. - ``request_error_rate`` — % of completed requests that errored. Pairs with the new ``adj_*`` percentile band on flagged latency metrics. - ``context_overflow_count`` — runtime context-overflow detections, used by the InferenceX AgentX scenario to flip ``submission_valid=false`` when the rate exceeds 1%. - ``adj_<tag>`` family — error-adjusted percentile band for any latency metric flagged with ``MetricFlags.PERCENTILE_INCLUDES_FAILED_REQUESTS``. Errored requests contribute ``+inf`` to the success-only distribution so the band flips to inf at the failure rate. Multiple correctness fixes against the DAG benchmark mode added in the initial commit: - ``branch_orchestrator``: restored informative validator error messages, enforced duplicate-prereq rejection at validator load time, fixed pre-session over-count (truncation, not error), fixed gather-result triage to distinguish exception/saturated/none, restored ``on_child_stopped`` so the request-count cap unwinds the parent join. - Agentic replay: warmup target uses ``trajectory_count`` not concurrency; concurrency > trajectory count is rejected up front; cross-phase trajectory state continuity is exercised end-to-end. - ``AIPERF_DAG_FAIL_FAST`` moved to ``Environment.DAG.FAIL_FAST``. - Weka delta-encoded conversations with per-turn ``reset_context`` flagging non-monotonic LCP cuts; per-trace cache + RNG reset prevent cross-trace hash-id aliasing inflating KV-cache hit rates. - ``aiperf-mock-server`` returns orjson responses; sentencepiece SWIG warning silenced. - Smoke harness uses ``profile_export_aiperf.json`` filename. - Always-on realtime metrics with a reusable ``ConsoleMetricsExporter``-backed table renderer; replaces the prior Rich realtime table with a compact 4-line log block under non-dashboard UIs. - ``--stats-interval`` CLI flag overrides ``REALTIME_METRICS_INTERVAL``; the env var auto-defaults by UI type. - Realtime-metrics filter via ``NO_CONSOLE`` flag (was a hardcoded allowlist). - Suppressed under ``--ui dashboard`` to avoid double-rendering. - Auto-disable on incompatible endpoint; probe ``/prometheus/metrics`` fallback path. - Catch concurrent-scrape race in the data collector; translate probe failures so an unreachable Prometheus endpoint no longer aborts the run. - ``audio_duration``: read from ``RecordContext``, not stripped turns. - ``DerivedSumMetric`` simplified to read scalar sum from results dict. - Timeslice results consolidated from ``dict[int, ...]`` to ``list[TimesliceResult]`` with window bounds + completion flag bundled. - t-digest aggregator port for ``InterChunkLatencyMetric`` (ai-dynamo#865 merged in). - Pytest: ``slow`` marker is opt-in, global ``--timeout=300``. - Plugin schema/yaml updates + regenerated CLI/env docs. Unit tests: 10705 passed, 3 skipped (``-n auto``, 28s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Expands AIPerf's API-usage-field metrics to cover the full surface area exposed by every major LLM provider — OpenAI, Anthropic, Google Gemini / Vertex AI, AWS Bedrock, DeepSeek, Mistral, Cohere v1 + v2, IBM watsonx — plus shape-compatible passthroughs (vLLM, Together, Fireworks, Cerebras, AI21, SambaNova, Groq, Bailian/DashScope, Replicate, Azure OpenAI, xAI Grok REST). Adds a MetricConsoleGroup enum for rendering related metrics in dedicated console sections, and updates the mock server to emit the full OpenAI-shape usage so the new metrics can be exercised end-to-end. OpenAI extras: - usage_prompt_audio_tokens, usage_completion_audio_tokens - usage_accepted_prediction_tokens, usage_rejected_prediction_tokens - TotalUsageReasoningTokensMetric (per-record existed; total was missing) Cache split (replaces unshipped `usage_prompt_cached_tokens`): - usage_prompt_cache_read_tokens — unifies OpenAI nested prompt_tokens_details.cached_tokens, Anthropic cache_read_input_tokens, DeepSeek prompt_cache_hit_tokens, Gemini cachedContentTokenCount, Bedrock cacheReadInputTokens, Cohere v1+v2 cached_tokens. - usage_prompt_cache_write_tokens — Anthropic cache_creation_input_tokens + Bedrock cacheWriteInputTokens. LARGER_IS_BETTER intentionally omitted: writes cost more than ordinary input tokens but enable cheap reads later. - usage_prompt_cache_miss_tokens — DeepSeek-specific. Vendor-specific outliers: - usage_tool_use_prompt_tokens — Gemini's toolUsePromptTokenCount. - usage_prompt_audio_seconds — Mistral's prompt_audio_seconds (duration in seconds, MetricTimeUnit.SECONDS, distinct from usage_prompt_audio_tokens). Usage.__init__ unwraps recognized vendor wrappers on construction: - Gemini / Vertex AI: usageMetadata envelope (camelCase wire). - Cohere v1: meta envelope (response root) — lifts meta.tokens.* and meta.cached_tokens. - Cohere v2: top-level tokens sub-dict (no meta wrapper). Synonym precedence centralized in 9 ClassVar *_KEYS lists ordered by descending vendor count (most common first). billed_units intentionally NOT modelled — Cohere-specific accounting filter, not what the model processed; preserved on the dict for billing reconciliation. - ParsedResponseRecord.final_usage (cached_property) — single walkback over the responses, returns the last non-empty Usage chunk. Cached so every metric reads it once; one walk per record total. - find_last_non_empty_usage helper shared between the record-level cached_property and InferenceResultParser._compute_server_token_counts (which now walks responses ONCE instead of three times, with cross-field consistency). - BaseUsageRecordMetric[T] — declarative base class. Subclasses set usage_field and missing_message; no per-metric _parse_record loop. Net -175 lines of repetition removed. Streaming contract: "last non-empty chunk wins" instead of per-key merge. Real vendors don't change shape mid-stream and don't explicitly null fields they previously set, so per-field walkback is solving a non-problem. Adds MetricConsoleGroup enum (USAGE, CACHE, PREDICTION, AUDIO, REASONING, DEFAULT, NONE) replacing the legacy MetricFlags.NO_CONSOLE binary visibility flag with a richer grouping mechanism that lets the console exporter render related metrics in dedicated sections. Every metric now sets a console_group class attribute and a display_order for stable ordering. Usage metrics graduate from hidden to MetricConsoleGroup.USAGE (visible in their own console section). Also introduces MetricFlags.TOKENIZES_INPUT_ONLY for prompt-side metrics (parallel to PRODUCES_TOKENS_ONLY for completion-side metrics). TokenizedText.create_usage now populates the OpenAI-compatible usage shape so the new metrics can be exercised end-to-end: prompt_tokens_details: cached_tokens (30-60% of prompt, deterministic from prompt hash) completion_tokens_details: reasoning_tokens (the actual budget allocated by _generate_reasoning_tokens — non-zero only for reasoning models like gpt-oss / qwen) accepted_prediction_tokens (5-20% of completion when present) rejected_prediction_tokens (2-10% of completion when present) audio_tokens intentionally omitted from both details objects — the mock has no audio generation pipeline. All values derived deterministically from the prompt text hash so a given input yields the same usage on every run. End-to-end smoke run (30 requests, mock-model): Usage Prompt Cache Read ~250 avg (range 164-333), Usage Accepted Prediction Tokens ~66 avg (22-138), Usage Rejected Prediction Tokens ~36 avg (9-64), plus their Total* aggregates. Reasoning model (gpt-oss with reasoning_effort=high, max_tokens=400) produces reasoning_tokens=400 (capped budget). Cross-checked every modelled synonym against the actual Python SDK source for: openai-python, anthropic-sdk-python, google-genai, groq-python, together-python, cohere-python (v1+v2), client-python (Mistral), vllm, ai21-python, cerebras-cloud-sdk-python, sambanova- python, dashscope-sdk-python (Bailian), python-aiplatform (Vertex), fw-ai-external/python-sdk (Fireworks), xai-sdk-python. AWS Bedrock and DeepSeek verified via official API docs. Three real bugs found and fixed during verification: Cohere v1 meta.cached_tokens not lifted, Cohere v2 envelope not handled, Mistral prompt_audio_seconds: {} sentinel could crash float(). | Module | Responsibility | |---|---| | usage_metrics.py | Basic + reasoning + audio + prediction (per-record) | | usage_cache_metrics.py | Cache read / write / miss | | usage_extras_metrics.py | Vendor-specific outliers (tool-use, audio seconds) | | usage_total_metrics.py | All DerivedSumMetric totals | | base_usage_record_metric.py | Declarative metric base | Auto-registration via metric_registry types/*.py glob — no plugin- registry edits. - 9120 unit tests passing (267 net new tests over baseline 8861). - Three layers of usage coverage: - Happy-path metric tests in tests/unit/metrics/test_usage_metrics.py. - Usage model tests in tests/unit/endpoints/test_usage_parsing.py. - Adversarial / property-based / direct-access tests in a new tests/unit/common/models/test_usage_models_adversarial.py: 14 verbatim vendor fixtures, envelope edge cases, type pollution, synonym precedence, mutability/copy/pickle round-trips, hypothesis property tests, streaming edge cases, cross-record isolation, end-to-end metric dispatch. - Mock-server tests updated for new usage shape (3 new tests covering determinism, cache-hit proportionality, zero-completion edge case). Microbenchmarked the metric-extraction hot path: ~4.7 μs per record (cold) / ~160 ns per metric (warm); ~470 ms total over a 100K-record benchmark — invisible against per-request HTTP cost. - make check-ergonomics: 0 new violations - make check-ruff-baselined: 0 new violations - pre-commit: clean - LLM-ergonomics review findings addressed - All build-matrix CI green (Ubuntu/macOS × Python 3.10-3.13) - Integration tests passing on Python 3.10-3.13 - docs/metrics-reference.md updated with every new metric and its vendor-source notes. - New docs/reference/vendor-usage-fields.md — comprehensive vendor reference catalogue with direct GitHub links to every cited SDK file, per-vendor verification details, exhaustive unmodelled-extras catalog, an 8-step "adding a new vendor" checklist, and dated change history. - Four-file agent-instructions sync (AGENTS.md, CLAUDE.md, copilot, cursor) for the new console_group convention. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com> (cherry picked from commit 87c8e00)

Introduces two new MetricConsoleGroup values to give the sweep-line analyzer and derived-latency outputs their own console sections: - EFFECTIVE: full-window time-weighted (effective_*, tokens_in_flight, effective_latency) - ACTIVE: phase-active-only weighted (active_*_throughput variants) Analyzer-injected MetricResults aren't in MetricRegistry, so the cherry-pick's console-grouping logic was hiding them entirely (regression vs. pre-cherry HEAD). Adds an optional `console_group` field on MetricResult (excluded from JSON dumps) so analyzer outputs can carry their group inline; the console exporter falls back to it when the tag isn't registered, defaulting to DEFAULT for unknown plugin tags. Also flips the credit-related metrics to console_group=NONE (credit_to_start_latency analyzer output, CreditDropLatencyMetric) — they're internal diagnostics, not part of the user-facing console table. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…lic dumps Two console-grouping followups merged together: - Render the DEFAULT group last in the end-of-run table sequence so the most-important metrics anchor the bottom of the screen (after the EFFECTIVE / ACTIVE / USAGE / CACHE / PREDICTION / AUDIO / REASONING sections). - Make MetricResult.console_group survive the records-manager -> exporter IPC while still excluding it from user-facing JSON / CSV / REST API dumps. The field originally had exclude=True so it stayed out of public dumps, but model_dump() is also what IPC uses to serialize, so the field was silently dropped before reaching the console exporter and analyzer-injected results all collapsed into the DEFAULT table instead of EFFECTIVE/ACTIVE. Switched to a context-aware @model_serializer that drops the field unless context={"include_internal": True} is passed. BaseMessage.to_json_bytes opts in for ZMQ message passing; metrics_json_exporter (via to_json_result), accumulator CSV (via model_dump), and the REST /metrics route (via model_dump) all drop the field. Verified with `aiperf profile` against the in-repo mock server: profile_export_aiperf.json and profile_export.jsonl contain zero console_group occurrences while the EFFECTIVE/ACTIVE console tables still render correctly with DEFAULT last. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…-mvp Brings in 4 commits from main: - t-digest aggregator for InterChunkLatencyMetric (ai-dynamo#865) - cyclopts ForwardRef on 3.14, cap requires-python <3.14 (ai-dynamo#878/ai-dynamo#879) - expose count and sum in profile_export_aiperf.json (schema 1.1) (ai-dynamo#881) - server-metrics auto-disable on incompatible endpoint + /prometheus/metrics fallback (ai-dynamo#877) Conflict resolutions: - Took main's `_CLI_GROUP` removal (cyclopts 3.14 fix), preserving HEAD's scenario-validator helpers (`_use_think_time_only_explicitly_set`, `_inter_turn_delay_cap_explicitly_set`) on InputConfig / LoadGeneratorConfig. Auto-merge missed three stale `_CLI_GROUP` refs in prompt_config.py / service_config.py / tokenizer_config.py — inlined. - Kept HEAD's `MetricAggregator = MetricSeriesProtocol` alias in metric_dicts.py (API-compatible with main's new Protocol class definition; avoids redefining). - Kept HEAD's full TDigestListMetricAggregator (extra `add_for_record`, `__len__`, `SUPPORTS_PER_RECORD_REPLAY` are required by our metrics-accumulator pipeline and the ColumnStore list-backend contract). - Kept HEAD's classmethod `DerivedSumMetric._derive_value` returning the scalar directly. Our pipeline stores the running sum scalar in `scalar_dict[tag]` (MetricsAccumulator._collect_scalars_and_arrays at accumulator.py:264/304/315), so main's MetricAggregator-based version always raised "X is not a MetricAggregator" inside `_resolve_derived_metrics` — silently swallowed by the broad `except Exception`, so every Total* metric dropped out of the export. - Confirmed deletion of post_processors/metric_results_processor.py and its test (replaced by the metrics-accumulator pipeline in 9616b77). - Kept HEAD's two new fields under _MetricsSettings (LIST_BACKEND, EXPORT_FLUSH_INTERVAL) and both new doc index entries (Vendor Usage Field + JSON Export Schema). Verified end-to-end against the in-repo mock server: total_isl, total_osl, total_output_tokens, total_token_throughput populate correctly in profile_export_aiperf.json. 11,023 unit tests pass. Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Design doc for ajc/dag5 — a DAG-only branch from origin/main that combines the advanced DAG framework (fan-in, multi-gate, K-delayed-join, pre-session SPAWN, prereqs) from ajc/inferencex-agentx-mvp with the targeted endpoint and behavior refinements from ajc/dag4, deliberately omitting AgentX, Weka, cache-bust, and agentic-replay. Captures branch baseline, in/out-of-scope boundary, five-layer architecture, FORK/SPAWN data flow, --request-count semantics decision (dag4 wins), error handling, tier-subfolder testing strategy, and plugin/docs touchpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Markdown-accuracy audit caught two attribution inversions and several path errors: - TimingManager._on_dataset_configuration_failed + _wait_for_dataset_or_failure is on CURRENT (src/aiperf/timing/manager.py:95/:140), not dag4. - DagJsonlLoader.can_load non-dict guard is on CURRENT (dag_jsonl.py:122-127), not dag4. - BranchStats lives in common/models/branch_stats.py, not branch.py. - ConversationBranchMode lives in common/enums/enums.py, not branch.py. - DagSpawn lives in dataset/loader/dag_jsonl_models.py. - prerequisite.py → prerequisites.py (plural). - request_rate at src/aiperf/timing/strategies/request_rate.py. - TimingManager file is timing/manager.py, not timing_manager.py. - trajectory_source.py at src/aiperf/timing/, not dataset/loader/. - agentic_replay.py at src/aiperf/timing/strategies/. - worker.py in workers/ (plural). - test_branch_orchestrator_* are unit tests on current (tests/unit/timing/), not component_integration; count is 9 (8 named + base). - test_dag_hard_cap.py and test_dag_multi_root_payload_bytes.py live at tests/component_integration/timing/. - Three-file sync per project CLAUDE.md, not four (AGENTS.md noted as optional). - Clarify Conversation.metadata() projection is the differing call (not Turn.metadata, which projects on both branches). - Clarify dispatch_timing default is "post"; "pre" reserved for SPAWN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Plan 1 of 2 for ajc/dag5. 19 TDD-style tasks porting data models, endpoint refactor, and sister loaders (inputs_json, raw_payload, mooncake_trace payload mode) onto a fresh branch from origin/main. Plan 2 will cover the DAG runtime (BranchOrchestrator, ConversationSource child builders, FORK pin refcount, request-count semantics, dag_jsonl loader). Includes spec-coverage table mapping every in-scope item to a task, plus a "Deferred to Plan 2" section enumerating runtime items intentionally excluded. Notable adaptations from the spec (called out in their tasks): - Task 6 ConversationBranchInfo replaces dag4's is_background+subagent_type shape with the spec-mandated dispatch_timing Literal; SubagentType is the AgentX surface, out of scope. - Task 7 BranchStats uses spec-mandated joins_suppressed counter (dag4 used children_truncated). - Task 9 defers the RecordContext/RequestInfo split to Plan 2; Plan 1 only adds Turn.raw_payload since the loaders need it. - Task 11 bundles build_assistant_turn + build_messages skeleton + extract_payload_inputs; they share imports and the chat/responses refactors depend on all three. - Task 14 adds EndpointType.RAW enum value (required for plugin discovery). - Task 15 adds BaseRawPayloadLoader and get_preferred_sampling_strategy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…ages Sharpen wording around cache-bust marker placement (system message vs first user turn), weka_trace loader invocation (`--custom-dataset-type` is required), warmup turn sampling, and `--unsafe-override` semantics. Update troubleshooting error strings to match current exception messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace user-facing `--timing-mode agentic_replay` references with the scenario-driven phrasing ("agentic_replay timing mode, set today by --scenario inferencex-agentx-mvp") across docs, the cache-bust validator error, and the scenario-lock violation message. Also tighten several adjacent docs accuracy fixes: --num-conversations rename + aliases, warmup heads-up callout for trajectory-based warmup, concurrency_burst arrival-pattern note, weka-trace SPAWN/SPAWN_JOIN nuances, marker probe 4-tuple, multimodal cache-bust injection details, and refreshed sample output blocks reflecting current NOTICE phase logging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Pre-canned mode preserves recorded hash_ids byte-for-byte but emits the trace's synthesized assistant text on every turn, which invalidates the server's just-generated KV blocks at each turn boundary and artificially deflates measured cache-hit rate. Set AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1 to flip the trade-off: the loader emits user-only deltas and the worker threads the server's live assistant response back into the session via DELTAS_WITHOUT_RESPONSES. The synth_buf still tracks assistant segments internally for LCP and truncation correctness; only the wire emission filters them out. Trade-off: hash-id fidelity past turn 0 is no longer preserved (server output length drifts from the trace's recorded output_length, shifting subsequent user-turn block alignment), but cache reuse pattern matches what a real agentic user would experience. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

copy-pr-bot · 2026-05-05T17:10:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-05T17:10:33Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@a61553fdc3586f7f0ea9f934ef90f77d8e5e14ae

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@a61553fdc3586f7f0ea9f934ef90f77d8e5e14ae

Last updated for commit: a61553f • Browse code

When a non-final turn returns with a context-length error from the server, recycle the trajectory immediately instead of dispatching subsequent turns. The cumulative prompt only grows as the conversation continues — once turn N has overflowed, every later turn's prompt will too. Continuing to dispatch them just wastes compute and inflates the run's overflow rate. Mirrors kv-cache-tester's "user truncated" semantics: the first context-length error removes the user from the active pool. Implementation: - TimingStrategyProtocol.handle_credit_return now accepts an `error` keyword (default None). Other strategies (request_rate, fixed_schedule, user_centric_rate) accept and ignore it. - AgenticReplayStrategy uses the existing AgentX context-overflow classifier (is_context_overflow_response) to detect the terminal case and routes to the same _spawn_from_recycle_or_id path the final-turn branch uses. - CreditCallbackHandler threads credit_return.error through both strategy dispatch sites. Final-turn overflows already recycled; the new short-circuit only fires on non-final turns. Generic 500-class errors don't trigger it. WARMUP returns remain no-ops at the strategy level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The realtime-stats reporter background task crashed every 30s with AttributeError because _render_realtime_block was annotated as taking CreditPhaseStats but the caller (_report_realtime_metrics) actually passes PhaseRecordsStats from create_stats_for_phase. Use the records-side fields that exist on PhaseRecordsStats: - requests_completed -> total_records - requests_elapsed_time -> records_elapsed_time - request_errors -> error_records Drop in_flight_requests from the rendered line since it's a credit-side concept the records side doesn't have access to. Replaced with an "ok=" counter (success_records) so the line still shows where errors came from. Tests updated to construct PhaseRecordsStats instead of CreditPhaseStats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

for more information, see https://pre-commit.ci

…SL/OSL avg The kv-cache-tester assessment block tracked input AND output throughput plus ISL/OSL means at every poll. aiperf's realtime line was missing both — only output_token_throughput and request rates were shown. Adds: - tput_in=N/s on the headline (alongside tput_out) - new "seq isl_avg=N osl_avg=N" row when MetricResults publish those per-record metrics Both pull from MetricResults the records_manager already aggregates; no new plumbing or inter-process state required. Unchanged behavior when the underlying metric isn't published (renders "-"). Server-side cache-hit and KV-usage live tracking would require the records_manager to subscribe to ServerMetricsRecordMessage events from the server-metrics manager — separate change, not in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

for more information, see https://pre-commit.ci

The realtime block was pointing at the per-record list metric ``inter_chunk_latency`` (gaps between every streamed chunk), and list-valued metrics don't aggregate into displayable percentiles in the realtime path. Result: the ``itl`` row showed ``-`` for p50/p95/p99 mid-run even when the per-record JSONL had real values. Switch to the scalar ``inter_token_latency`` metric (avg gap across the response, computed per record). Same number that lands in profile_export_aiperf.json's aggregate; just the realtime fast path needs the scalar version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

aiperf doesn't aggregate input throughput as its own metric, so ``tput_in=-`` rendered dashes mid-run. Switch to ``total_token_throughput`` (input + output tokens / wall-clock) and rename the field to ``tput_total``. Matches the "Total throughput" assessment field kv-cache-tester showed; output throughput stays alongside it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

aiperf had system-level total_token_throughput and output_token_throughput but no input-side analog, so prefill TPS could only be inferred via subtraction. For long-context agentic workloads (this corpus has 80k-token avg ISL) input throughput dominates the bandwidth and tracking it as its own metric matches what most production benchmarks expect. New ``input_token_throughput`` metric mirrors ``output_token_throughput`` exactly: input_token_throughput = TotalInputSequenceLength / BenchmarkDuration Auto-discovered via the standard types/ scan. Realtime renderer flips the dashed ``tput_in=-`` field back to populated (``tput_in=N/s``) since the metric is now collected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Adds a "srv" row to the realtime stats block showing live server-side metrics scraped from the inference server's /metrics endpoint: - prefix_cache_hit_rate (cumulative across the run) - external_prefix_cache_hit_rate (when CPU offload is active) - kv_cache_usage_pct (latest gauge, max across endpoints) - num_preemptions (cumulative count) Implementation: - New ServerMetricsAccumulator.realtime_snapshot() walks the live hierarchy and returns a flat dict of these metrics. - _render_realtime_block accepts an optional server_snapshot kwarg and renders a "srv" row when keys are present. - _report_realtime_metrics queries the accumulator best-effort (silent fallback when --no-server-metrics or no records yet). Matches the cumulative server-metrics display kv-cache-tester showed in its assessment block. The rest of the metrics needed for full parity (per-interval rates, GPU/CPU offload bandwidth) are still available in server_metrics_export.json for offline analysis; the realtime row focuses on the most actionable signals at glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

for more information, see https://pre-commit.ci

…group) The metric was being computed but ``filter_display_metrics`` dropped it because I set ``console_group = MetricConsoleGroup.NONE`` while copying the ``total_token_throughput`` template. Realtime rendered ``tput_in=-`` even though the value was available. Removed the override so the metric uses the default console group (DEFAULT) and flows through the realtime display path. Mirrors ``OutputTokenThroughputMetric``, which also relies on the default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

…e tail `CreditCallbackHandler._maybe_signal_dag_completion` (deferred guard at end of `on_credit_return`) and `_credit_will_dispatch_children` (inline defer) only run inside credit-return callbacks. Under `--concurrency >= 2`, the orchestrator's last drain step (`_handle_child_done` decrement, `dispatch_join_turn` returning False under cap, all-children- rolled-back path in `_spawn_children_and_register_gates`) can land between credit returns, leaving `all_credits_returned_event` unset and forcing the phase to wait for either the pre-wait short-circuit on the next entry into `_wait_for_returning_complete` or the drain-timeout backstop. The pre-wait short-circuit catches it most of the time, but when it doesn't (e.g. runner enters `_wait_for_returning_complete` BEFORE all credits return, then awaits the event), the drain-timeout backstop is the only escape — costing a full grace period of wall-clock per occurrence. Fix mirrors the dag5 branch's commit 7cd4180b7: orchestrator publishes `set_drain_observer(callback)`. The three drain points — `_handle_child_done`, `_handle_child_errored_fail_fast`, and the no-children-landed path inside `_spawn_children_and_register_gates` (where every dispatch_first_turn refused at the cap) — call `_notify_drain()` after their state mutations are complete. `CreditCallbackHandler.set_branch_orchestrator` registers `_on_orchestrator_drain`, which iterates active phase handlers and re-runs the deferred check for each. Idempotent (no-ops on already-set event or disagreeing predicate). Observer detached in `cleanup()`. Net effect: the race is closed at the source. The pre-wait short- circuit and drain-timeout backstop become true safety nets rather than the primary completion path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Adds six unit tests for the drain-observer wiring on CreditCallbackHandler.set_branch_orchestrator and the closure it registers via BranchOrchestrator.set_drain_observer: - registration: attach calls set_drain_observer with a callable; detach calls set_drain_observer(None) - happy path: callback fires + counters say all returned + orchestrator predicate clean → event set - defer when has_pending_branch_work=True - defer when check_all_returned_or_cancelled=False - skip phase handlers whose lifecycle.is_complete (event already finalized via normal end-of-phase path) - idempotent: multiple callback invocations after event already set remain a no-op These tests would have failed pre-fix (the observer wasn't registered) and lock in the contract for the closure logic in CreditCallbackHandler._on_orchestrator_drain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

The except block bound ``Exception`` without a name but the lazy log lambda referenced ``exc``. In production (INFO level) the lambda is never invoked so this would silently swallow the underlying error; in debug builds it raises NameError when formatting, masking the actual exception from realtime_snapshot. Bind ``exc`` correctly via ``except Exception as exc`` and capture it in the lambda default-arg so debug builds surface the real failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Registers ``faulthandler.SIGUSR1`` on every aiperf process (system controller + every subprocess service via bootstrap_and_run_service). ``kill -USR1 <pid>`` from outside writes the full Python traceback for every thread to stderr without disturbing the process otherwise. Crucial for debugging hangs in production-like deployments where py-spy isn't available on the runner host. Procedure for next hang: ssh <node> pgrep -af aiperf # list every aiperf process for pid in $(pgrep -f 'aiperf|system_controller|worker_'); do kill -USR1 $pid done cat /workspace/results/server.log /workspace/results/trace_replay/logs/*.log Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The post-build 'concurrency > pool' guard is replaced by automatic wrap-fill in TrajectorySource. The error class is no longer reachable; remove it and convert each test that asserted it into a positive assertion that wrap-fill produces the requested concurrency. Empty-pool cases still raise EmptyTracePoolError (separate class). Also fix test_recycle_pass_dict_grows_only_to_pool_size: it seeded the double-recycle guard with a trace_id, but the guard now keys on correlation_id (Task 5). Update the seed to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Component-integration test for pool < concurrency: 1-trace pool, 4-way concurrency. Validates wrap-fill activates, lanes get decorrelated k_i, per-lane cache-bust markers are distinct, and the recycle loop completes without tripping the double-recycle guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…ation helpers Task 5 re-keyed _in_flight_recycled from trace_id to correlation_id. Five component-integration tests had _make_credit helpers that defaulted x_correlation_id='xcorr' across all credits — that worked under trace-id keying but now correctly trips the guard when the same correlation_id is fed in twice. Fix the helpers to mint a unique correlation_id per credit (uuid4 if no explicit value is passed). Preserves the deliberate "same correlation_id twice" assertions that test the guard itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

A ~3-page external-facing whitepaper for ML perf practitioners explaining the EFFECTIVE and ACTIVE MetricConsoleGroup families and the CO-aware effective_latency metric. Includes a real worked example from a mock-server run showing the ~19x gap between effective_prefill_throughput and active_prefill_throughput when prefill is sparse in the run window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Register a new --public-dataset tag pointing at the same SemiAnalysisCCTracesWekaLoader class, sourced from semianalysisai/cc-traces-weka-no-subagents-051226 (949 traces, 136.1k requests). Same WekaTrace schema as the 042026 corpus, just with all WekaSubagentEntry blocks stripped, so reconstruction goes through the existing WekaTraceLoader path unchanged. Smoke test (5 rows) confirms all rows validate as WekaTrace, 519 requests with type='s' only, 0 subagent entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…26 variant Cut the InferenceX AgentX-MVP scenario over from the 042026 corpus (739 traces with full subagent fan-out) to the 051226 no-subagents variant (949 traces, main-agent linear streams only). The new variant is now the AgentX MVP default everywhere: - Scenario require_loader: ("semianalysis_cc_traces_weka_no_subagents", "weka_trace") instead of ("semianalysis_cc_traces_weka", "weka_trace"). - agentx-mvp.md tutorial: dataset name, trace count (739 -> 949), --num-dataset-entries default, Subagents section rewritten to note the variant strips them, troubleshooting sections updated. - weka-trace.md: HF section now documents both variants with a selection table; no-subagents flagged as AgentX MVP default. - README tutorial index: both HF variants mentioned, no-subagents flagged as AgentX MVP default. - Loader docstring (module + class): generalized to describe both variants registered against the same class. - Scenario validator tests, scenario registry test, public-composer fixture, and loader test fixture all updated to the new tag/HF name. - Loader test gains a separate pin assertion so the 042026 plugin entry stays anchored to its HF name (and the no-subagents entry to its own) under future registry edits. The legacy semianalysis_cc_traces_weka plugin entry remains registered unchanged - it's just no longer the AgentX MVP default. Anyone wanting to replay the with-subagents corpus can still pass --public-dataset semianalysis_cc_traces_weka explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Bump the SemiAnalysis Weka loader's HF dataset target from the 042026 full-subagent corpus (657 MB / 739 traces) to the no-subagents variant (2.77 GB / 949 traces / 136k requests). Subagent entries are stripped: only top-level main-agent turns remain, so each row produces exactly one conversation downstream (no parent/child fan-out). Plugins.yaml, loader docstrings, and two fixture/constant references in the unit tests follow the rename. The registered tag and class are unchanged, so the inferencex-agentx-mvp scenario binding continues to resolve. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Two concurrent aiperf processes that miss the cache on the same key both pay the expensive tokenize+reconstruct cost today. The atomic os.replace at the end of populate() means only one set of bytes ends up canonical, but both sides do the work. On a shared filesystem cache (Lustre / NFS) with N concurrent agentic jobs, that's N redundant tokenizations. Add a cross-process populate lock that serializes the miss path: - lookup -> HIT -> use it (no lock, fast path) - lookup -> MISS -> acquire flock -> re-lookup -> ... -> HIT (someone populated while we waited) -> use it -> MISS -> do the tokenize + populate -> release Implementation mirrors huggingface_hub's WeakFileLock pattern: - filelock.FileLock with mode=0o664 so multiple users sharing a cache directory contend correctly - SoftFileLock fallback on filesystems without flock support - INFO log every 10s while waiting so a waiter is visible, not silent - thread_local=False so release on a different asyncio worker thread (acquire vs release end up on distinct asyncio.to_thread workers) still actually drops the OS-level lock - asyncio.to_thread for the blocking acquire so the event loop is not blocked Code lives in a new mmap_cache_lock module to keep mmap_cache.py under the 500-line file-size budget; mmap_cache re-exports acquire_cache_lock with cache_dir pre-bound so callers see one entry point. DatasetManager._do_profile_configure now wraps the miss path in the lock, with the double-checked re-lookup factored into a small helper to keep the function under the 80-line size budget. Three new unit tests in test_mmap_cache.py cover: concurrent acquires on the same key serialize; acquires on different keys run in parallel; holder beyond timeout causes the waiter to raise filelock.Timeout. filelock is added as an explicit project dependency. It was already a transitive via huggingface-hub; declared here so a future HF drop doesn't silently break the cache lock. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The single 'seq isl_avg=... osl_avg=...' line hid the distribution shape that's the main reason to watch sequence lengths mid-run (spotting long-tail agentic prompts or response truncation). Replace it with two percentile rows matching the latency / per-user-throughput row format already used above: isl p50=123,952 p75=245,124 p90=391,085 p99=720,485 (tokens) osl p50=261 p75=664 p90=1,614 p99=7,013 (tokens) Reads p90 off the existing JsonMetricResult schema (already populated by the accumulator); no extra plumbing. Rows are skipped entirely when both ISL/OSL metrics are absent so the renderer stays compact on non-tokenizing endpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The TQDMProgressUI + Textual dashboard already wire @on_warmup_progress to a tqdm bar, but those UIs are disabled under non-TTY execution (the CI/SLURM bench-driver case), so users have no visibility into warmup progress until the phase either completes or hangs. Add a per-return INFO log inside AgenticReplayStrategy.handle_credit_return under CreditPhase.WARMUP that fires once per returning warmup credit (success or error), of the shape: WARMUP 7/40 returned [ok] (lane=6, trace_id=abc123...) This gives operators a line per completion in benchmark.log even when no UI is active. The lambda log form keeps the per-completion work off the hot path when log level filters out INFO. Bump ergonomics file-size baseline: the addition pushes agentic_replay.py from 500 -> 510 lines. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

The tin (prefill_throughput_per_user) and tout (output_token_throughput_per_user) rows in the realtime block are literally just 1/prefill_time and 1/ITL percentiled across requests, which is the same information conveyed more clearly as the "interactivity" metric (1/tpot) familiar from LLM serving literature. Drop the two per-user-throughput rows and emit a single intvty p50=9 p75=10 p95=15 p99=20 (1/tpot tok/s) row using the same backing metric (output_token_throughput_per_user = 1 / inter_token_latency per request). Aggregate tput_in/tput_out on line 1 are unchanged. Same numbers, clearer label, less noise. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Surfaces the server's own view of input/output token throughput in the realtime srv row, computed as a running average from the vLLM prometheus counters (delta / elapsed): vllm:prompt_tokens_total -> tput_in_srv=N/s vllm:generation_tokens_total -> tput_out_srv=N/s Useful contrast with the client-side aggregate tput_in/tput_out on line 1: server-side counts work in flight that hasn't returned a response yet, so it tracks the engine's actual ingest/emit rate rather than the rate of completions aiperf has parsed. Suppressed entirely when the counters aren't exposed (SGLang, non-vLLM servers, or runs where /metrics isn't scraped) so the row stays clean. Bump ergonomics baseline: file +5 lines (526->531), realtime_snapshot +10 lines (80->90); both fall under existing advisory caps that already accept similar-sized siblings. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

…enarios Context-overflow errors mid-trajectory are already handled by agentic_replay.handle_credit_return via the separate CreditReturn message path: the trajectory is terminated, the conversation is recycled, and a fresh trajectory spawned in its place. Emitting an error MetricRecordsMessage for the same event double-counts it -- the record shows up in failure tallies, dragging down the per-run success rate even though the overflow is an expected and intentionally-tolerated end-of-trajectory signal in this scenario. When the active scenario's timing mode is AGENTIC_REPLAY, drop context-overflow records before they enter the metrics pipeline. The filter is gated on scenario timing mode (cached at __init__) and the RequestRecord.context_overflow flag the parser already sets per InferenceX AgentX RFC §7. The check is a no-op for: - non-scenario runs (user_config.scenario is None) - non-agentic scenarios (any future timing_mode != agentic_replay) - records that aren't context-overflow events (flag stays False) The existing ContextOverflowCountMetric continues to work for diagnostic purposes outside agentic scenarios. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

When the active PROFILING-phase failure rate (error_records / total_records) exceeds the user-supplied threshold after a grace floor of max(concurrency, 10) records, broadcast ProfileCancelCommand on the message bus to terminate the run early. The existing cancel handlers in records_manager, timing_manager, server_metrics manager, and gpu_telemetry manager stop their work cleanly; the run finalizes with cancelled=True and exits non-zero via the standard cancel flow. The grace floor exists so a single early failure can't trip a tiny-N threshold (e.g., 1/1 = 100% > 0.5 would abort instantly without this guard). Pairs naturally with the AGENTIC_REPLAY context-overflow drop in record_processor_service: context overflows aren't counted as errors in agentic scenarios, so the threshold measures real failures only (server 5xx, parse errors, malformed responses) and won't trip on the expected end-of-trajectory overflow signal. Default is None (disabled, matching existing behavior). Valid range [0.0, 1.0]. Idempotent: the abort fires at most once per run; if the ProfileCancelCommand publish fails for any reason, the trigger is reset so the next record re-evaluates and re-attempts. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

…warmup log Two coupled changes for AGENTIC_REPLAY scenarios: 1. Configurable trajectory start range. Previously each trajectory's k_i (start turn) was sampled uniformly from [0, int(0.7 * n)] with the 0.7 hardcoded in TrajectorySource. Now exposed as two CLI flags on the LoadGenerator group: --trajectory-start-min-ratio (default 0.0) --trajectory-start-max-ratio (default 0.7, preserves prior behavior) Sampling becomes uniform on [int(min_ratio * n), int(max_ratio * n)], both clamped to n-2 so the trajectory always retains at least one profiling turn after warmup. Validated cross-field so min <= max. Plumbed: loadgen_config -> timing.config.TimingConfig -> phase_orchestrator -> TrajectorySource constructor. RNG seed still derived from --random-seed via SHA-256 salt with trace_id, so k_i remains deterministic per (seed, trace_id). 2. Per-trajectory warmup completion log. The previous WARMUP info line reported only lane + trace_id. It now also reports start_turn=k_i/N and the percent of the trace that was warmed. Per-request token count (ISL) is intentionally NOT included in this commit -- it would require plumbing prompt_tokens through CreditReturn (or subscribing AgenticReplayStrategy to MetricRecordsMessage for the WARMUP phase). Leaving that as a follow-up. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

Adds a one-block info log emitted by TrajectorySource at the end of __init__, before any dispatch fires. Shows the configured start-range plus the actual per-trajectory (lane, k_i, num_turns, pct) so the operator can sanity-check that the configured --trajectory-start-{min,max}-ratio produced a sensible distribution. Example: TrajectorySource: built 14 trajectories from 949 traces range cfg=[0.25, 0.75] observed pct: min=27% median=51% max=72% lane=00 start_turn= 6/24 (25%) trace_id=abc... lane=01 start_turn= 15/22 (68%) trace_id=def... ... Complements the existing per-trajectory warmup-completion lines. Logged once at build time so the full distribution is visible upfront without correlating across credit-return events. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

The previous "[RecordProcessor] Drop context-overflow records for AGENTIC_REPLAY scenarios" commit returned early from record_processor_service._on_inference_results before pushing the MetricRecordsMessage, which broke the records-side <-> credit-side counter invariant: RecordsTracker.total_records is compared for equality against the credit-side final_requests_completed at end-of-phase, and the drop made the records-side lag the credit-side by one for every overflow event. The completion barrier never converged, hanging the run for the full benchmark_grace_period before timing out + cancelling in-flight credits. Symptom in the log: NOTICE All requests have completed, please wait for the results to be processed (currently 423 of 424 records processed)... ... (30s timeout) ... WARNING Phase profiling timed out, cancelling all credits. Fix: don't drop the record. Add a context_overflow_skip flag to MetricRecordMetadata. RecordProcessor sets it when the record is context-overflow AND scenario is AGENTIC_REPLAY. RecordsManager recognizes the flag and: - Counts the record toward total_records (preserves the invariant) - Classifies as success in RecordsTracker (so error counters stay at 0) - Skips error_tracker.increment_error_count_for_phase - Skips _send_record_to_accumulators (latency/throughput/etc. unaffected) - Skips _maybe_trigger_failed_request_abort (overflow is not a real failure for threshold purposes) Net behavior matches the original intent ("nothing about the context overload is counted towards metrics whatsoever") without breaking the end-of-phase completion barrier. End-of-phase completion now matches credit-side: total_records = success + 0 errors, where success includes the overflow-skip records. Signed-off-by: Cam Quilici <cameron@semianalysis.com>

…o cjq/weka-live-assistant-responses # Conflicts: # src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py

Cancel path was incomplete: PhaseRunner.cancel() cancelled the credit-issuance _execution_task but never set all_credits_sent_event / all_credits_returned_event. The runner's outer _wait_for_sending_complete / _wait_for_returning_complete awaits keep blocking on the unset events until the phase timeout elapses (= --benchmark-duration, default 1800s for profiling). Empirical: a --failed-request-threshold-triggered ProfileCancelCommand at T=170s into a 1800s profiling phase causes the run to hang for the remaining ~1630s before _wait_for_sending_complete finally returns via its own timeout. The graceful "if self._was_cancelled: return" branch is reached, but only 27 min later. Set both events at the top of cancel() so the runner's awaits wake immediately. Mirrors the event-set order already present in the except-Exception recovery path (runner.py:363-373) — same correctness guarantee, just on the external-cancel path too. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

Two log-ergonomics changes to make long agentic-replay runs in captured-stdout contexts (CI, srt-slurm, scripted invocations) readable: 1) EventLoopMonitor "Event loop ... taking too long to run. Overhead: XXms" downgraded from warning to debug. At sustained conc>=32 on agentic workloads this fires dozens of times per minute and is not actionable (inherent async-scheduler overhead under load). 2) callback_handler "Credit return after phase {phase} complete, credit_id=N, worker=W" downgraded from warning to debug. Every cancel-triggered phase shutdown (e.g. --failed-request-threshold trip) emits one such line per in-flight credit — up to concurrency-many = thousands — flooding the log without being actionable; the late return is expected under the cancel race. 3) service_config.validate_ui_type now picks UIType.TQDM (not NONE) when stdout isn't a tty. tqdm's progress bars still render usefully in tailed logs (carriage returns are preserved by typical pipe sinks), so users running aiperf from srt-slurm / gha runners get a visible progress indicator. Explicit --ui-type none still opts back into the previous silence. All three changes are config/severity adjustments only; no behavioral changes to phase orchestration, credit accounting, or UI rendering beyond the visibility surface. Signed-off-by: Cam Quilici <cjquilici@gmail.com>

8aad400 introduced an AttributeError on every aiperf invocation in non-TTY contexts (CI, srt-slurm benchmarks). UIType is an extensible enum with members SIMPLE/DASHBOARD/NONE; "TQDM" was never registered. SIMPLE is the tqdm-backed UI per the field docstring on `ui_type`. Repro: any `aiperf profile ...` call where stdout is captured (e.g. redirected to a file) crashes immediately in ServiceConfig.validate_ui_type with `AttributeError: 'UIType' has no attribute 'TQDM'`. Surfaced via InferenceX R26 1p6d shards where the agentic benchmark wrapper saw aiperf exit 1 within seconds of invocation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This reverts commit a6812b0.

This reverts commit 8aad400.

TrajectorySource._log_trajectory_summary previously reported each lane's start position only as turn index + percentage. With agentic-replay on weka cc-traces, the useful question is "how many tokens of context does warmup start at?" — and the answer was invisible until you correlated trace_ids back to the raw dataset. Threads input_length (proxy-tokenizer "in" field) end-to-end: WekaNormalRequest.input_length (already in scope at construction) -> Turn.input_length (new optional field) -> TurnMetadata.input_length (new optional field; propagated by Turn.metadata() and Conversation.metadata()) -> TrajectorySource summary log New log shape: TrajectorySource: built 192 trajectories from 949 traces range cfg=[0.25, 0.75] observed pct: min= 0% median= 46% max= 75% observed tokens: min= 0 median= 58,431 max=187,294 lane=00 start_turn= 15/27 ( 56%) start_tokens= 42,580 trace_id=... Backward-compatible: input_length is Optional everywhere, defaults to None. Loaders that don't populate it (synthetic, raw_payload, sharegpt, etc.) keep working unchanged and the log shows "-" for those lanes. Only weka_trace's normal-turn path sets it for now. Split _log_trajectory_summary into _build_trajectory_rows + _format_observed_stats + _median helper to stay under the 80-line function size cap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Swaps the HF dataset slug from cc-traces-weka-no-subagents-051226 (949 traces) to cc-traces-weka-no-subagents-051826 (98 traces). 051826 is a stricter filter of the same source: - v5-only (drops legacy trace_version=4 rows) - CC ≥ 2.1.139 (drops rows from older CLI versions whose tool-use semantics differ) - ≥20 main-agent turns per trace post-strip - subagent blocks stripped (same as 051226) InferenceX R29 surfaced a delta-encoding edge case where two of the 949 traces (turns 555-918 and 4752-4753) produced empty delta_messages, triggering 99.5% of the 366 HTTP-400 validation rejections. The new 051826 corpus should not contain those two traces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…mary log" This reverts commit 61a9ed8.

The preemptions counter rarely tells you anything actionable in the live log — it's either 0 (boring) or steadily climbing (which the kv_usage + queue=r/w fields already telegraph more usefully). Removing it tightens the per-tick server-side row to the metrics that actually inform intervention decisions: cache hit rates, KV usage, queue depth, and server-side token throughput. The accumulator still scrapes vllm:num_preemptions and sglang:num_retracted_reqs and exposes num_preemptions on the snapshot dict, so downstream consumers (export jsonl, future analysis) keep working unchanged. Just the log surface is trimmed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ajcasagrande and others added 13 commits May 1, 2026 15:11

cquil11 and others added 15 commits May 5, 2026 15:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

7e6b2e0

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b2ed6c

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

9902357

for more information, see https://pre-commit.ci

ajcasagrande and others added 27 commits May 11, 2026 19:39

Merge remote-tracking branch 'upstream/ajc/inferencex-agentx-mvp' int…

929aa76

…o cjq/weka-live-assistant-responses # Conflicts: # src/aiperf/dataset/loader/semianalysis_cc_traces_weka.py

Revert "fix: UIType.TQDM does not exist — use UIType.SIMPLE"

4eb1a73

This reverts commit a6812b0.

Revert "quiet noisy phase-shutdown warnings + tqdm default in non-tty"

2f30ea8

This reverts commit 8aad400.

Revert "trajectory_source: surface per-lane start-token counts in sum…

90c93ab

…mary log" This reverts commit 61a9ed8.

cquil11 closed this May 19, 2026

cquil11 deleted the cjq/weka-live-assistant-responses branch May 19, 2026 22:45

cquil11 mentioned this pull request May 19, 2026

[WIP]: InferenceX agentic benchmark v0.3 #964

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cjq/weka live assistant responses#886

Cjq/weka live assistant responses#886
cquil11 wants to merge 136 commits into
ai-dynamo:mainfrom
cquil11:cjq/weka-live-assistant-responses

cquil11 commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cquil11 commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented May 5, 2026 •

edited

Loading