mlcommons
diff --git a/‎.claude/skills/msgspec-patterns/SKILL.md‎
Lines changed: 412 additions & 0 deletions b/‎.claude/skills/msgspec-patterns/SKILL.md‎
Lines changed: 412 additions & 0 deletions
diff --git a/‎.claude/skills/msgspec-struct-gc-check/SKILL.md‎
Lines changed: 120 additions & 0 deletions b/‎.claude/skills/msgspec-struct-gc-check/SKILL.md‎
Lines changed: 120 additions & 0 deletions
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 6 additions & 6 deletions b/‎.github/workflows/test.yml‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 1 deletion b/‎.gitignore‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎AGENTS.md‎
Lines changed: 10 additions & 12 deletions b/‎AGENTS.md‎
Lines changed: 10 additions & 12 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 2 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 38 additions & 20 deletions b/‎README.md‎
Lines changed: 38 additions & 20 deletions
diff --git a/‎docs/CLI_DESIGN.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/CLI_DESIGN.md‎
Lines changed: 3 additions & 1 deletion
@@ -0,0 +1,120 @@
+---
+name: msgspec-struct-gc-check
+description: Check whether msgspec.Struct types can safely use gc=False. Use when adding or changing msgspec.Struct definitions, or when reviewing code that uses msgspec structs.
+allowed-tools: Read, Grep, Glob
+---
+
+# msgspec.Struct gc=False Safety Check
+
+## When to use this skill
+
+- Adding or modifying a class that inherits from `msgspec.Struct`
+- Reviewing or refactoring code that defines or uses msgspec structs
+- Deciding whether to add or remove `gc=False` on a Struct
+
+## Why gc=False matters
+
+Setting `gc=False` on a Struct means instances are **never tracked** by Python's garbage collector. This reduces GC pressure and can improve performance when many structs are allocated. The **only** risk: if a **reference cycle** involves only gc=False structs (or objects not tracked by GC), that cycle will **never be collected** (memory leak).
+
+Reference: [msgspec Structs – Disabling Garbage Collection](https://jcristharif.com/msgspec/structs.html#struct-gc).
+
+## Verified safety constraints
+
+Use these constraints to decide if a Struct can use `gc=False`. All must hold.
+
+### 1. No reference cycles
+
+- The struct (and any container it references) must never be part of a reference cycle.
+- **Multiple variables** pointing to the same struct (`x = s; y = x`) are **safe** — that is not a cycle. A cycle is A → B → … → A.
+- **Returning** a struct from a function is **safe**. What matters is whether any reference path leads back to the struct (e.g. struct's list contains the struct or something that holds the struct).
+
+### 2. No mutation that could create cycles
+
+- **Do not mutate** struct fields after construction in a way that could introduce a cycle (e.g. set a field to an object that references the struct, or append the struct to its own list/dict).
+- **Frozen structs** (`frozen=True`) prevent field reassignment; `force_setattr` in `__post_init__` is one-time init only, so that's acceptable.
+- Assigning **scalars** (int, str, bool, float, None) to fields is safe — they cannot form cycles.
+
+### 3. Mutable containers (list, dict, set) on the struct
+
+- If the struct has list/dict/set fields, either:
+  - **Never mutate** those containers after creation (no `.append`, `.update`, `[...] = ...`, etc.), and never store in them any object that references the struct, or
+  - Do not use `gc=False` (conservative).
+- **Reading** from containers (e.g. `x = struct.foobars[i]`) does not create cycles and is allowed.
+
+### 4. Nested structs
+
+- If a struct holds another Struct (or holds containers that hold Structs), the same rules apply to the whole reference graph: no cycles, no mutation that could create cycles. If any nested Struct uses `gc=False`, the whole graph must still be cycle-free.
+
+### 5. Generic / mixins
+
+- With `gc=False`, the type must be compatible with `__slots__` (e.g. if using `Generic`, the mixin must define `__slots__ = ()`). See msgspec issue #631 / PR #635.
+
+## Checklist for "can use gc=False"
+
+- [ ] Struct and everything it references can never participate in a reference cycle.
+- [ ] No mutation of struct fields after construction that could introduce a cycle (frozen or init-only mutation is ok; scalar assignment is ok).
+- [ ] Any list/dict/set fields are never mutated after creation, or we do not use gc=False.
+- [ ] No storing the struct (or anything that references it) inside its own container fields.
+- [ ] If Generic/mixins are used, `__slots__` compatibility is satisfied.
+
+## Checklist for "must NOT use gc=False"
+
+- [ ] Struct is mutated after creation in a way that could create a cycle (e.g. appending self to a list field).
+- [ ] Container fields are mutated after creation and could hold the struct or back-references.
+- [ ] Struct is used in a pattern where it's stored in a container that the struct (or its fields) also references.
+
+## Quick per-struct analysis steps
+
+1. List all fields and their types (scalars vs containers vs nested Structs).
+2. Search the codebase for: assignments to this struct's fields, mutations of its container fields (`.append`, `.update`, etc.), and any place the struct instance is stored (e.g. in a list/dict that might be referenced by the struct).
+3. If only scalars or immutable types, or frozen with no container mutation → likely safe for gc=False.
+4. If mutable containers and they're never mutated (and never made to reference the struct) → likely safe; otherwise → do not use gc=False.
+
+## Risky structs: audit and at-risk comment
+
+A struct is **risky** for gc=False if it has a condition that would normally disallow gc=False (e.g. mutable list/dict/set fields), but that condition might never arise in practice (e.g. the field is only ever read, never mutated after construction).
+
+### Auditing a risky struct
+
+1. Identify the at-risk condition (e.g. "has `metadata: dict` that could be mutated").
+2. Search the codebase for all uses of that struct and of the at-risk field:
+   - Any assignment to the field: `obj.field = ...`, `obj.field[key] = ...`, `obj.field.append(...)`, `obj.field.update(...)`, etc.
+   - Any code path that could store the struct (or something holding it) inside that container.
+3. If the audit finds **no** such mutation or cycle-creating storage, the condition never arises and gc=False is acceptable **provided** you add the at-risk marker so future changes are re-audited.
+
+### When audit passes
+
+- Set `gc=False` on the struct.
+- Add an **at-risk comment** and docstring note:
+
+  - **Above the class**: a short comment stating why gc=False is used despite the at-risk condition, and when the audit was done (e.g. `# gc=False: audit YYYY-MM: <condition> is only read, never mutated.`).
+  - **In the docstring**: a line that signals to future readers and to this skill that changes touching this struct must be re-audited. Use this format:
+
+    `AT-RISK (gc=False): Has <brief condition>. Any change that <what would violate safety> must be audited; if so, remove gc=False.`
+
+- Example (for a struct with a `metadata` dict that is only ever read):
+
+  ```python
+  # gc=False: audit 2026-03: metadata dict is only ever read, never mutated after construction.
+  class QueryResult(msgspec.Struct, ..., gc=False):
+      """Result of a completed inference query.
+
+      AT-RISK (gc=False): Has mutable container field `metadata`. Any change that
+      mutates `metadata` after construction or stores this struct in a container
+      referenced by this struct must be audited; if so, remove gc=False.
+      ...
+  ```
+
+### When touching an at-risk struct
+
+If you are adding or changing code that uses a struct marked AT-RISK (gc=False):
+
+1. Re-run the audit for that struct (searches above).
+2. If your change mutates the at-risk field(s) or creates a cycle (e.g. stores the struct in its own container), **remove** `gc=False` from the struct and remove the at-risk comment/docstring line.
+3. If your change does not touch the at-risk field or create cycles, the existing gc=False and at-risk comment remain; you may add a short note in the at-risk comment if the audit was re-checked (e.g. update the audit date).
+
+## References
+
+- [msgspec Structs – Disabling Garbage Collection](https://jcristharif.com/msgspec/structs.html#struct-gc)
+- [msgspec Performance Tips – Use gc=False](https://jcristharif.com/msgspec/perf-tips.html#use-gc-false)
+- [msgspec #631 – Generic structs and gc=False](https://github.com/jcrist/msgspec/issues/631)
@@ -30,13 +30,13 @@ jobs:
         run: |
           pytest -xv -m "not slow and not performance" --cov=src --cov-report=xml --cov-report=html
 
-      - name: Upload coverage to Codecov
-        uses: codecov/codecov-action@57e3a136b779b570ffcdbf80b3bdc90e7fab3de2 # v6.0.0
+      - name: Upload coverage report
+        uses: actions/upload-artifact@v4
         with:
-          file: ./coverage.xml
-          flags: unittests
-          name: codecov-umbrella
-          fail_ci_if_error: false
+          name: coverage-report
+          path: |
+            coverage.xml
+            htmlcov/
 
   audit:
     runs-on: ubuntu-latest
 
@@ -189,5 +189,10 @@ outputs/
 # Example vLLM virtualenv
 examples/03_BenchmarkComparison/vllm_venv/
 
-# Cursor artifacts (local development only)
+# Agent artifacts (local development only)
 .cursor_artifacts/
+.claude/agent-memory/
+
+# User-specific local rules (local Docker dev); do not commit
+.cursor/rules/local-docker-dev.mdc
+CLAUDE.local.md
@@ -73,7 +73,7 @@ CLI is auto-generated from `config/schema.py` Pydantic models via cyclopts. Fiel
 
 - **CLI mode** (`offline`/`online`): cyclopts constructs `OfflineBenchmarkConfig`/`OnlineBenchmarkConfig` (subclasses in `config/schema.py`) directly from CLI args. Type locked via `Literal`. `--dataset` is repeatable with TOML-style format `[perf|acc:]<path>[,key=value...]` (e.g. `--dataset data.csv,samples=500,parser.prompt=article`). Full accuracy support via `accuracy_config.eval_method=pass_at_1` etc.
 - **YAML mode** (`from-config`): `BenchmarkConfig.from_yaml_file()` loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional `--timeout`/`--mode` overrides via `config.with_updates()`.
-- **eval**: Not yet implemented (raises `NotImplementedError`)
+- **eval**: Not yet implemented (raises `CLIError` with a tracking issue link)
 
 ### Config Construction & Validation
 
@@ -137,7 +137,11 @@ src/inference_endpoint/
 │   └── utils.py               # Port range helpers
 ├── async_utils/
 │   ├── loop_manager.py        # LoopManager (uvloop + eager_task_factory)
+│   ├── runner.py              # run_async() — uvloop + eager_task_factory entry point for CLI commands
 │   ├── event_publisher.py     # Async event pub/sub
+│   ├── services/
+│   │   ├── event_logger/      # EventLoggerService: writes EventRecords to JSONL/SQLite
+│   │   └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
 │   └── transport/             # ZMQ-based IPC transport layer
 │       ├── protocol.py        # Transport protocols + TransportConfig base
 │       ├── record.py          # Transport records
@@ -192,26 +196,20 @@ tests/
 
 ## Development Standards
 
-### Code Style
+### Code Style and Pre-commit Hooks
 
 - **Formatter/Linter**: `ruff` (line-length 88, target Python 3.12)
 - **Type checking**: `mypy` (via pre-commit)
 - **Formatting**: `ruff-format` (double quotes, space indent)
 - **License headers**: Required on all Python files (enforced by pre-commit hook `scripts/add_license_header.py`)
 - **Conventional commits**: `feat:`, `fix:`, `docs:`, `test:`, `chore:`
 
-### Pre-commit Hooks
-
-All of these run automatically on commit:
-
-- trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements
-- `ruff` (lint + autofix) and `ruff-format`
-- `mypy` type checking
-- `prettier` for YAML/JSON/Markdown
-- License header enforcement
+All of these hooks run automatically on commit: trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements, `ruff` (lint + autofix), `ruff-format`, `mypy`, `prettier` (YAML/JSON/Markdown), license header enforcement.
 
 **Always run `pre-commit run --all-files` before committing.**
 
+See [Development Guide](docs/DEVELOPMENT.md) for full setup and workflow details.
+
 ### Data Types & Serialization
 
 - **Core types** (`Query`, `QueryResult`, `StreamChunk`): `msgspec.Struct` with `frozen=True`, `array_like=True`, `gc=False`, `omit_defaults=True`
@@ -291,7 +289,7 @@ Update AGENTS.md as part of any PR that includes a **significant refactor**, mea
 - **Added or removed CLI commands/subcommands** — update CLI Modes and Common Commands
 - **Changed test infrastructure** (new fixtures, changed markers, new test directories) — update Testing section
 - **Added or removed key dependencies** — update Key Dependencies table
-- **Changed build/tooling** (new pre-commit hooks, changed ruff config, new CI steps) — update Code Style and Pre-commit Hooks
+- **Changed build/tooling** (new pre-commit hooks, changed ruff config, new CI steps) — update [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md)
 - **Changed hot-path patterns** (new transport, changed serialization, new performance constraints) — update Performance Guidelines
 
 ### How to Update
 
@@ -2,4 +2,6 @@
 
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 
+Full guidance is maintained in AGENTS.md (shared with all AI coding agents) and is included below:
+
 @AGENTS.md
@@ -7,3 +7,5 @@ Generally we encourage people to become MLCommons members if they wish to contri
 Regardless of whether you are a member, your organization (or you as an individual contributor) needs to sign the MLCommons Contributor License Agreement (CLA). Please submit your GitHub username to the [MLCommons Subscription form](https://mlcommons.org/community/subscribe/) to start that process.
 
 MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your pull requests.
+
+For project-specific development standards (code style, test requirements, pre-commit hooks, commit format), see the [Development Guide](docs/DEVELOPMENT.md).
@@ -66,7 +66,7 @@ inference-endpoint benchmark offline \
 
 ```bash
 # Start local echo server
-python -m inference_endpoint.testing.echo_server --port 8765 &
+python3 -m inference_endpoint.testing.echo_server --port 8765 &
 
 # Test with dummy dataset (included in repo)
 inference-endpoint benchmark offline \
@@ -94,33 +94,51 @@ pytest -m "not performance and not run_explicitly"
 
 ## 📚 Documentation
 
+- [AGENTS.md](AGENTS.md) - Architecture, conventions, and AI agent guidelines
 - [CLI Quick Reference](docs/CLI_QUICK_REFERENCE.md) - Command-line interface guide
 - [Local Testing Guide](docs/LOCAL_TESTING.md) - Test with echo server
 - [Development Guide](docs/DEVELOPMENT.md) - How to contribute and develop
+- [Performance Architecture](docs/PERF_ARCHITECTURE.md) - Hot-path design and tuning
+- [Performance Tuning](docs/CLIENT_PERFORMANCE_TUNING.md) - CPU affinity and client tuning
 - [GitHub Setup Guide](docs/GITHUB_SETUP.md) - GitHub authentication and setup
 
+### Component Design Specs
+
+Each top-level component under `src/inference_endpoint/` has a corresponding spec:
+
+| Component         | Spec                                                             |
+| ----------------- | ---------------------------------------------------------------- |
+| Core types        | [docs/core/DESIGN.md](docs/core/DESIGN.md)                       |
+| Load generator    | [docs/load_generator/DESIGN.md](docs/load_generator/DESIGN.md)   |
+| Endpoint client   | [docs/endpoint_client/DESIGN.md](docs/endpoint_client/DESIGN.md) |
+| Metrics           | [docs/metrics/DESIGN.md](docs/metrics/DESIGN.md)                 |
+| Config            | [docs/config/DESIGN.md](docs/config/DESIGN.md)                   |
+| Async utils       | [docs/async_utils/DESIGN.md](docs/async_utils/DESIGN.md)         |
+| Dataset manager   | [docs/dataset_manager/DESIGN.md](docs/dataset_manager/DESIGN.md) |
+| Commands (CLI)    | [docs/commands/DESIGN.md](docs/commands/DESIGN.md)               |
+| OpenAI adapter    | [docs/openai/DESIGN.md](docs/openai/DESIGN.md)                   |
+| SGLang adapter    | [docs/sglang/DESIGN.md](docs/sglang/DESIGN.md)                   |
+| Evaluation        | [docs/evaluation/DESIGN.md](docs/evaluation/DESIGN.md)           |
+| Testing utilities | [docs/testing/DESIGN.md](docs/testing/DESIGN.md)                 |
+| Profiling         | [docs/profiling/DESIGN.md](docs/profiling/DESIGN.md)             |
+| Plugins           | [docs/plugins/DESIGN.md](docs/plugins/DESIGN.md)                 |
+| Utils             | [docs/utils/DESIGN.md](docs/utils/DESIGN.md)                     |
+
 ## 🎯 Architecture
 
 The system follows a modular, event-driven architecture:
 
 ```
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Dataset       │    │   Load          │    │   Endpoint      │
-│   Manager       │───▶│   Generator     │───▶│   Client        │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
-         │                       │                       │
-         ▼                       ▼                       ▼
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Metrics       │    │   Configuration │    │   Endpoint      │
-│   Collector     │◄───│   Manager       │    │   (External)    │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
+Dataset Manager ──► Load Generator ──► Endpoint Client ──► External Endpoint
+                          │
+                    Metrics Collector
+                 (event logging + reporting)
 ```
 
-- **Load Generator**: Central orchestrator managing query lifecycle
-- **Dataset Manager**: Handles benchmark datasets and preprocessing
-- **Endpoint Client**: Abstract interface for endpoint communication
-- **Metrics Collector**: Performance measurement and analysis
-- **Configuration Manager**: System configuration (TBD)
+- **Dataset Manager**: Loads benchmark datasets and applies transform pipelines
+- **Load Generator**: Central orchestrator — controls timing (scheduler), issues queries, and emits sample events
+- **Endpoint Client**: Multi-process HTTP worker pool communicating over ZMQ IPC
+- **Metrics Collector**: Receives sample events from Load Generator; writes to SQLite (EventRecorder), aggregates after the run (MetricsReporter)
 
 ## Accuracy Evaluation
 
@@ -132,14 +150,13 @@ configuration. Currently, Inference Endpoints provides the following pre-defined
 - LiveCodeBench (default: lite, release_v6)
 
 However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the
-[LiveCodeBench](src/inference_endpoint/dataset_manager/predefined/livecodebench/README.md) documentation
-for details and explanations.
+[LiveCodeBench](src/inference_endpoint/evaluation/livecodebench/README.md) documentation for
+details and explanations.
 
 ## 🚧 Pending Features
 
 The following features are planned for future releases:
 
-- [ ] **Performance Tuning** - Advanced performance optimization features
 - [ ] **Submission Ruleset Integration** - Full MLPerf submission workflow support
 - [ ] **Documentation Generation and Hosting** - Sphinx-based API documentation with GitHub Pages
 
@@ -166,7 +183,8 @@ We are grateful to these communities for their contributions to LLM benchmarking
 
 ## 📄 License
 
-This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE.md) file for
+details.
 
 ## 🔗 Links
 
 
@@ -172,9 +172,11 @@ InputValidationError    2           Bad user input, invalid config
 SetupError              3           Dataset load failure, connection error
 ExecutionError          4           Benchmark failed after setup
 CLIError                1           Generic CLI error (base class)
-NotImplementedError     1           Unimplemented command (eval)
 ```
 
+The reserved `eval` command currently raises `CLIError` with a tracking issue link rather than a
+dedicated exception type.
+
 ## Development Guide
 
 ### Adding a CLI flag