mlcommons
diff --git a/‎AGENTS.md‎
Lines changed: 9 additions & 24 deletions b/‎AGENTS.md‎
Lines changed: 9 additions & 24 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 38 additions & 20 deletions b/‎README.md‎
Lines changed: 38 additions & 20 deletions
diff --git a/‎docs/CLI_QUICK_REFERENCE.md‎
Lines changed: 6 additions & 12 deletions b/‎docs/CLI_QUICK_REFERENCE.md‎
Lines changed: 6 additions & 12 deletions
@@ -9,10 +9,7 @@ High-performance benchmarking tool for LLM inference endpoints targeting 50k+ QP
 ## Common Commands
 
 ```bash
-# Development setup
-python3.12 -m venv venv && source venv/bin/activate
-pip install -e ".[dev,test]"
-pre-commit install
+# Development setup — see docs/DEVELOPMENT.md for full instructions
 
 # Testing
 pytest                                        # All tests (excludes slow/performance)
@@ -73,7 +70,7 @@ CLI is auto-generated from `config/schema.py` Pydantic models via cyclopts. Fiel
 
 - **CLI mode** (`offline`/`online`): cyclopts constructs `OfflineBenchmarkConfig`/`OnlineBenchmarkConfig` (subclasses in `config/schema.py`) directly from CLI args. Type locked via `Literal`. `--dataset` is repeatable with TOML-style format `[perf|acc:]<path>[,key=value...]` (e.g. `--dataset data.csv,samples=500,parser.prompt=article`). Full accuracy support via `accuracy_config.eval_method=pass_at_1` etc.
 - **YAML mode** (`from-config`): `BenchmarkConfig.from_yaml_file()` loads YAML, resolves env vars, and auto-selects the right subclass via Pydantic discriminated union. Optional `--timeout`/`--mode` overrides via `config.with_updates()`.
-- **eval**: Not yet implemented (raises `NotImplementedError`)
+- **eval**: Not yet implemented (raises `CLIError` with a tracking issue link)
 
 ### Config Construction & Validation
 
@@ -137,7 +134,11 @@ src/inference_endpoint/
 │   └── utils.py               # Port range helpers
 ├── async_utils/
 │   ├── loop_manager.py        # LoopManager (uvloop + eager_task_factory)
+│   ├── runner.py              # run_async() — uvloop + eager_task_factory entry point for CLI commands
 │   ├── event_publisher.py     # Async event pub/sub
+│   ├── services/
+│   │   ├── event_logger/      # EventLoggerService: writes EventRecords to JSONL/SQLite
+│   │   └── metrics_aggregator/ # MetricsAggregatorService: real-time metrics (TTFT, TPOT, ISL, OSL)
 │   └── transport/             # ZMQ-based IPC transport layer
 │       ├── protocol.py        # Transport protocols + TransportConfig base
 │       ├── record.py          # Transport records
@@ -192,25 +193,9 @@ tests/
 
 ## Development Standards
 
-### Code Style
+### Code Style and Pre-commit Hooks
 
-- **Formatter/Linter**: `ruff` (line-length 88, target Python 3.12)
-- **Type checking**: `mypy` (via pre-commit)
-- **Formatting**: `ruff-format` (double quotes, space indent)
-- **License headers**: Required on all Python files (enforced by pre-commit hook `scripts/add_license_header.py`)
-- **Conventional commits**: `feat:`, `fix:`, `docs:`, `test:`, `chore:`
-
-### Pre-commit Hooks
-
-All of these run automatically on commit:
-
-- trailing-whitespace, end-of-file-fixer, check-yaml, check-merge-conflict, debug-statements
-- `ruff` (lint + autofix) and `ruff-format`
-- `mypy` type checking
-- `prettier` for YAML/JSON/Markdown
-- License header enforcement
-
-**Always run `pre-commit run --all-files` before committing.**
+See [Development Guide](docs/DEVELOPMENT.md) for formatting, linting, and pre-commit hook details.
 
 ### Data Types & Serialization
 
@@ -291,7 +276,7 @@ Update AGENTS.md as part of any PR that includes a **significant refactor**, mea
 - **Added or removed CLI commands/subcommands** — update CLI Modes and Common Commands
 - **Changed test infrastructure** (new fixtures, changed markers, new test directories) — update Testing section
 - **Added or removed key dependencies** — update Key Dependencies table
-- **Changed build/tooling** (new pre-commit hooks, changed ruff config, new CI steps) — update Code Style and Pre-commit Hooks
+- **Changed build/tooling** (new pre-commit hooks, changed ruff config, new CI steps) — update [docs/DEVELOPMENT.md](docs/DEVELOPMENT.md)
 - **Changed hot-path patterns** (new transport, changed serialization, new performance constraints) — update Performance Guidelines
 
 ### How to Update
 
@@ -7,3 +7,5 @@ Generally we encourage people to become MLCommons members if they wish to contri
 Regardless of whether you are a member, your organization (or you as an individual contributor) needs to sign the MLCommons Contributor License Agreement (CLA). Please submit your GitHub username to the [MLCommons Subscription form](https://mlcommons.org/community/subscribe/) to start that process.
 
 MLCommons project work is tracked with issue trackers and pull requests. Modify the project in your own fork and issue a pull request once you want other developers to take a look at what you have done and discuss the proposed changes. Ensure that cla-bot and other checks pass for your pull requests.
+
+For project-specific development standards (code style, test requirements, pre-commit hooks, commit format), see the [Development Guide](docs/DEVELOPMENT.md).
@@ -66,7 +66,7 @@ inference-endpoint benchmark offline \
 
 ```bash
 # Start local echo server
-python -m inference_endpoint.testing.echo_server --port 8765 &
+python3 -m inference_endpoint.testing.echo_server --port 8765 &
 
 # Test with dummy dataset (included in repo)
 inference-endpoint benchmark offline \
@@ -94,33 +94,51 @@ pytest -m "not performance and not run_explicitly"
 
 ## 📚 Documentation
 
+- [AGENTS.md](AGENTS.md) - Architecture, conventions, and AI agent guidelines
 - [CLI Quick Reference](docs/CLI_QUICK_REFERENCE.md) - Command-line interface guide
 - [Local Testing Guide](docs/LOCAL_TESTING.md) - Test with echo server
 - [Development Guide](docs/DEVELOPMENT.md) - How to contribute and develop
+- [Performance Architecture](docs/PERF_ARCHITECTURE.md) - Hot-path design and tuning
+- [Performance Tuning](docs/CLIENT_PERFORMANCE_TUNING.md) - CPU affinity and client tuning
 - [GitHub Setup Guide](docs/GITHUB_SETUP.md) - GitHub authentication and setup
 
+### Component Design Specs
+
+Each top-level component under `src/inference_endpoint/` has a corresponding spec:
+
+| Component         | Spec                                                             |
+| ----------------- | ---------------------------------------------------------------- |
+| Core types        | [docs/core/Design.md](docs/core/Design.md)                       |
+| Load generator    | [docs/load_generator/Design.md](docs/load_generator/Design.md)   |
+| Endpoint client   | [docs/endpoint_client/Design.md](docs/endpoint_client/Design.md) |
+| Metrics           | [docs/metrics/Design.md](docs/metrics/Design.md)                 |
+| Config            | [docs/config/Design.md](docs/config/Design.md)                   |
+| Async utils       | [docs/async_utils/Design.md](docs/async_utils/Design.md)         |
+| Dataset manager   | [docs/dataset_manager/Design.md](docs/dataset_manager/Design.md) |
+| Commands (CLI)    | [docs/commands/Design.md](docs/commands/Design.md)               |
+| OpenAI adapter    | [docs/openai/Design.md](docs/openai/Design.md)                   |
+| SGLang adapter    | [docs/sglang/Design.md](docs/sglang/Design.md)                   |
+| Evaluation        | [docs/evaluation/Design.md](docs/evaluation/Design.md)           |
+| Testing utilities | [docs/testing/Design.md](docs/testing/Design.md)                 |
+| Profiling         | [docs/profiling/Design.md](docs/profiling/Design.md)             |
+| Plugins           | [docs/plugins/Design.md](docs/plugins/Design.md)                 |
+| Utils             | [docs/utils/Design.md](docs/utils/Design.md)                     |
+
 ## 🎯 Architecture
 
 The system follows a modular, event-driven architecture:
 
 ```
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Dataset       │    │   Load          │    │   Endpoint      │
-│   Manager       │───▶│   Generator     │───▶│   Client        │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
-         │                       │                       │
-         ▼                       ▼                       ▼
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Metrics       │    │   Configuration │    │   Endpoint      │
-│   Collector     │◄───│   Manager       │    │   (External)    │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
+Dataset Manager ──► Load Generator ──► Endpoint Client ──► External Endpoint
+                          │
+                    Metrics Collector
+                  (EventRecorder + MetricsReporter)
 ```
 
-- **Load Generator**: Central orchestrator managing query lifecycle
-- **Dataset Manager**: Handles benchmark datasets and preprocessing
-- **Endpoint Client**: Abstract interface for endpoint communication
-- **Metrics Collector**: Performance measurement and analysis
-- **Configuration Manager**: System configuration (TBD)
+- **Dataset Manager**: Loads benchmark datasets and applies transform pipelines
+- **Load Generator**: Central orchestrator — controls timing (scheduler), issues queries, and emits sample events
+- **Endpoint Client**: Multi-process HTTP worker pool communicating over ZMQ IPC
+- **Metrics Collector**: Receives sample events from Load Generator; writes to SQLite (EventRecorder), aggregates after the run (MetricsReporter)
 
 ## Accuracy Evaluation
 
@@ -132,14 +150,13 @@ configuration. Currently, Inference Endpoints provides the following pre-defined
 - LiveCodeBench (default: lite, release_v6)
 
 However, LiveCodeBench will not work out-of-the-box and requires some additional setup. See the
-[LiveCodeBench](src/inference_endpoint/dataset_manager/predefined/livecodebench/README.md) documentation
-for details and explanations.
+[LiveCodeBench](src/inference_endpoint/evaluation/livecodebench/README.md) documentation for
+details and explanations.
 
 ## 🚧 Pending Features
 
 The following features are planned for future releases:
 
-- [ ] **Performance Tuning** - Advanced performance optimization features
 - [ ] **Submission Ruleset Integration** - Full MLPerf submission workflow support
 - [ ] **Documentation Generation and Hosting** - Sphinx-based API documentation with GitHub Pages
 
@@ -166,7 +183,8 @@ We are grateful to these communities for their contributions to LLM benchmarking
 
 ## 📄 License
 
-This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
+This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE.md) file for
+details.
 
 ## 🔗 Links
 
 
@@ -1,13 +1,6 @@
 # CLI Quick Reference
 
-## Architecture
-
-The CLI is auto-generated from Pydantic models in `config/schema.py` using
-cyclopts. schema.py is the single source of truth for both YAML configs and CLI flags.
-
-- **All schema fields** available as CLI flags on each subcommand (dotted kebab-case)
-- **Shorthand aliases** declared via `cyclopts.Parameter(alias="--flag")` on schema fields
-- **`${VAR}` interpolation** in YAML files (with `${VAR:-default}` fallback)
+Command-line reference for all `inference-endpoint` subcommands, flags, load patterns, and usage examples.
 
 ## Commands
 
@@ -109,6 +102,8 @@ Flag names shown as `--full.dotted.path --alias`. Both forms work.
 - `--endpoint-config.api-key --api-key` - API authentication
 - `--endpoint-config.api-type --api-type` - API type: openai/sglang (default: openai)
 - `--report-dir` - Report output directory
+  Note: applies to CLI-driven `benchmark offline` / `benchmark online`; `benchmark from-config`
+  does not expose a CLI override for `report_dir`, so set it in the YAML.
 - `--timeout` - Global timeout in seconds
 - `--enable-cpu-affinity / --no-cpu-affinity` - NUMA-aware CPU pinning (default: true)
 
@@ -169,7 +164,7 @@ Accuracy config is supported in both CLI and YAML:
 inference-endpoint benchmark offline \
   --endpoints URL --model M \
   --dataset perf:perf.jsonl \
-  --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer \
+  --dataset acc:eval.jsonl,accuracy_config.eval_method=pass_at_1,accuracy_config.ground_truth=answer,accuracy_config.extractor=boxed_math_extractor \
   --mode both
 ```
 
@@ -242,10 +237,9 @@ inference-endpoint init submission
 
 # 2. Edit submission_template.yaml (set model, datasets, ruleset, endpoint)
 
-# 3. Run (YAML mode)
+# 3. Run (YAML mode - config-driven; CLI only allows --config, --timeout, and --mode; set report-dir in the YAML)
 inference-endpoint benchmark from-config \
-  --config submission_template.yaml \
-  --report-dir official_results
+  --config submission_template.yaml
 ```
 
 ### Validate First