docs(agents): add reference-hygiene rules + clean up violations

nv-alicheng · claude · nv-alicheng · commit 85dfac6a7d25 · 2026-05-06T17:16:29.000-07:00
Adds two new sections to AGENTS.md "Development Standards":

1. "Documentation references — no local-only artifacts" — docs and
   comments must not reference paths outside the repo (gitignored
   directories, local scratch dirs, contributor workstation paths).
   A reviewer fetching the PR should be able to follow every cited
   reference.

2. "Comments and docstrings — describe current state, not development
   history" — no comments narrating iteration on the codebase ("we
   tried X first", "an earlier implementation did Y"). Such pointers
   belong in the PR description and git log, not the source tree.
   Especially relevant under AI-assisted development where it's
   tempting to leave a paper trail of design pivots inline.

Sweeps existing violations across both rules:

Production code: drops cites to ``metrics_pubsub_design_v5.md`` from
module/class docstrings (snapshot.py, registry.py, publisher.py) and
inlines self-contained rationale where useful (aggregator.py HDR
bounds, TOTAL_DURATION_NS comment).

Tests: removes "Migrated to ..." / "The legacy tests ..." framing
from rewritten test module docstrings; reframes regression-test
docstrings (test_registry.py, test_publisher.py, test_aggregator.py)
to describe the invariant being protected rather than narrating the
prior bug's discovery.

AGENTS.md: removes its own self-violation cite to the gitignored
design doc.

Behavior: no functional changes.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/AGENTS.md b/AGENTS.md
@@ -109,7 +109,7 @@ Multi-process, event-loop design optimized for throughput:
 
 ### Metrics Aggregator subprocess (pub/sub)
 
-The aggregator is a separate process (`python -m inference_endpoint.async_utils.services.metrics_aggregator`) that subscribes to events and publishes `MetricsSnapshot` messages. State machine and wire contract are documented in `.cursor_artifacts/metrics_pubsub_design_v5.md` §1; key facts for working in this layer:
+The aggregator is a separate process (`python -m inference_endpoint.async_utils.services.metrics_aggregator`) that subscribes to events and publishes `MetricsSnapshot` messages. Key facts for working in this layer:
 
 - **Series storage**: each `SeriesSampler` keeps three parallel views: O(1) cheap rollups (count/total/min/max/sum_sq, exact), an HDR Histogram (cheap live percentiles), and an in-memory `array.array` of raw values (for exact percentiles in the `COMPLETE` snapshot). Hot path is `registry.record(name, value)` — no allocation, no I/O.
 - **Counter API**: `registry.increment(name, delta=1)` for sample-event counters. `registry.set_counter(name, value)` only for the two duration counters (`total_duration_ns` max-of-elapsed, `tracked_duration_ns` sum-of-blocks).
@@ -343,6 +343,49 @@ These apply especially to code in the hot path (load generator, endpoint client,
 - `src/inference_endpoint/openai/openai_types_gen.py` — auto-generated, excluded from ruff/pre-commit
 - `src/inference_endpoint/openai/openapi.yaml` — OpenAI API spec, excluded from pre-commit
 
+### Documentation references — no local-only artifacts
+
+**Code, comments, docstrings, tests, and committed Markdown MUST NOT reference paths that aren't in the repository.** This includes anything under `.gitignore`d directories (e.g. `.cursor_artifacts/`, design scratch dirs, untracked working notes), absolute paths to a contributor's workstation, build outputs, or unmerged branch artifacts. A reviewer fetching the PR should be able to follow every reference cited in the diff.
+
+**Why:** stale references compound — `See foo.md §3` is meaningless once `foo.md` is gone, renamed, or never existed in the merged tree, and rotting cross-references are how docs stop being trusted. AI agents reading the codebase later treat dangling pointers as ground truth and propagate confusion.
+
+**Allowed:**
+
+- Paths to files committed to the repo (`docs/...`, `src/...`, `tests/...`, `README.md`, etc.).
+- External URLs (issue trackers, PRs, RFCs, vendor docs).
+- Generic references to environment/setup that the reader is expected to create themselves (e.g. `source .venv/bin/activate` in a setup README, where `.venv` is the user's local venv).
+
+**Disallowed examples:**
+
+- `See .cursor_artifacts/foo_design.md §2` — `.cursor_artifacts/` is gitignored.
+- `See ~/work/notes/architecture.txt` — contributor-local.
+- `Tracked in metrics_pubsub_design_v5.md test impact section` — same gitignored doc.
+
+If a design doc is worth referencing from the source tree, commit it to `docs/` or inline the relevant content into the code comment / docstring. For one-off rationale that won't survive the conversation, prefer a self-contained explanation in the comment itself rather than a pointer to ephemera.
+
+### Comments and docstrings — describe current state, not development history
+
+**Don't write comments or docstrings that narrate iteration on the codebase.** Pointers to abandoned approaches, prior implementations, or design pivots belong in the PR description and `git log`, not in the source tree. They rot quickly: the prior implementation is gone, the reader has no way to evaluate the comparison, and the scaffolding accumulates with every iteration. Future readers — humans and AI agents alike — treat the comment as if it describes load-bearing context when it's actually historical clutter.
+
+This applies especially to AI-assisted development, where it's tempting to leave a paper trail of "we tried X first, then switched to Y" inside the source. That paper trail belongs in the PR description.
+
+**Disallowed patterns:**
+
+- `# Originally used X, but switched to Y for ...`
+- `# An earlier implementation did X — this version does Y`
+- `# Removed the foo parameter` / `# Replaced bar with baz`
+- `# Note: this used to be sync but is now async`
+- `# Regression: an earlier shape did X` — even in regression-test docstrings, drop the narrative framing.
+- `# An alternative design considered ... but was rejected because ...` (unless the rejected alternative is a _common_ path a future contributor might re-attempt — in that case, frame it as "**don't** do X because Y", not as developer history).
+
+**Allowed:**
+
+- **Current rationale**: `# Uses dict dispatch — hot path measured at sub-ms` (describes why the current design exists; no history).
+- **Regression context that doesn't narrate the prior bug's discovery**: `# Without this check, value > hdr_high silently corrupts the histogram total` (describes the bug being prevented, framed as a current invariant — not "we used to have a bug here").
+- **Inline TODO/FIXME** pointing at a tracking issue (URL or issue number, not "we plan to do X eventually").
+
+**Rule of thumb:** if removing the comment would leave the code's intent unchanged for someone seeing it for the first time, the comment is fine. If the comment only makes sense to someone who saw the prior version, delete it.
+
 ## Keeping AGENTS.md Up to Date
 
 **This file is the source of truth for AI agents working in this repo.** If it is stale or wrong, every AI-assisted session starts from a broken foundation.
diff --git a/src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py b/src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py
@@ -71,13 +71,10 @@ class MetricCounterKey(str, Enum):
     TRACKED_SAMPLES_FAILED = "tracked_samples_failed"
     TRACKED_DURATION_NS = "tracked_duration_ns"
     # Total wall-clock duration since session start. Updated on every event as
-    # max(current, event_timestamp - session_start) to be defensive against
-    # non-monotonic timestamps.
-    #
-    # An alternative design was considered: store session_start_ns once and
-    # compute duration as (now - start) on read. This is infeasible because
-    # time.monotonic_ns() has inconsistent epoch per process — a reader in
-    # another process would get a meaningless value.
+    # max(current, event_timestamp - session_start). Stored as a counter
+    # rather than computed from (now - start) at read time because
+    # time.monotonic_ns() has a process-local epoch — a reader in another
+    # process would get a meaningless value.
     TOTAL_DURATION_NS = "total_duration_ns"
 
 
@@ -91,7 +88,9 @@ class MetricCounterKey(str, Enum):
 )
 
 
-# HDR bounds per series. See metrics_pubsub_design_v5.md §1 for rationale.
+# HDR bounds per series — chosen conservatively so realistic benchmark
+# values cannot fall outside [low, high]. Values outside the range are
+# clamped on insert and a warning is logged once per series.
 _NS_HDR_LOW: Final[int] = 1
 _NS_HDR_HIGH: Final[int] = 3_600_000_000_000  # 1 hour in ns
 _TOKEN_HDR_LOW: Final[int] = 1
diff --git a/src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py b/src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py
@@ -13,10 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""``MetricsPublisher``: publish ``MetricsSnapshot`` over pub/sub + disk fallback.
-
-See ``metrics_pubsub_design_v5.md`` §5 for the design and failure mode table.
-"""
+"""``MetricsPublisher``: publish ``MetricsSnapshot`` over pub/sub + disk fallback."""
 
 from __future__ import annotations
 
diff --git a/src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py b/src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py
@@ -26,8 +26,6 @@
 1. Cheap exact rollups (count/total/min/max/sum_sq) — O(1), exact.
 2. HDR Histogram — supports cheap live percentiles/histogram.
 3. ``array.array`` of raw values — supports exact final percentiles.
-
-See ``metrics_pubsub_design_v5.md`` §2 for full design.
 """
 
 from __future__ import annotations
diff --git a/src/inference_endpoint/async_utils/services/metrics_aggregator/snapshot.py b/src/inference_endpoint/async_utils/services/metrics_aggregator/snapshot.py
@@ -20,9 +20,6 @@
 ``DRAINING`` between ``ENDED`` and the final publish, ``COMPLETE`` for the
 last snapshot). The snapshot is the only public wire format between the
 aggregator and any consumer (main process, future TUI).
-
-See ``metrics_pubsub_design_v5.md`` §1 for invariants, field reference,
-and HDR bounds.
 """
 
 from __future__ import annotations
@@ -137,9 +134,6 @@ class MetricsSnapshot(
         metrics:          Tagged union of ``CounterStat`` and ``SeriesStat``,
                           ordered counters-first then series, registration
                           order within each.
-
-    See ``metrics_pubsub_design_v5.md`` §1 for the full reference table and
-    the state-machine diagram.
     """
 
     counter: int
diff --git a/tests/integration/commands/test_benchmark_command.py b/tests/integration/commands/test_benchmark_command.py
@@ -207,10 +207,9 @@ def _resolve_template(template_path: Path, server_url: str) -> dict:
     raw = re.sub(r"http://localhost:\d+", server_url, raw)
     data = yaml.safe_load(raw)
 
-    # Swap any gated default model name for a non-gated tokenizer. The
-    # generated templates' "eg: meta-llama/Llama-3.1-8B-Instruct" placeholder
-    # points at a gated repo; substituting gpt2 lets these tests run in CI
-    # without HF_TOKEN.
+    # Swap the placeholder-default model name for a non-gated tokenizer
+    # (see _TEST_MODEL_NAME above) so these tests can run in CI without
+    # HF_TOKEN.
     if "model_params" in data and isinstance(data["model_params"], dict):
         data["model_params"]["name"] = _TEST_MODEL_NAME
 
diff --git a/tests/unit/async_utils/services/metrics_aggregator/conftest.py b/tests/unit/async_utils/services/metrics_aggregator/conftest.py
@@ -15,9 +15,8 @@
 
 """Shared test doubles and factories for metrics aggregator tests.
 
-Migrated for the registry/publisher refactor (metrics_pubsub_design_v5):
-no more ``InMemoryKVStore``. Tests that need to inspect emitted values
-build them directly off a ``MetricsRegistry`` and a ``MetricsSnapshot``.
+Tests that need to inspect emitted values build them directly off a
+``MetricsRegistry`` and a ``MetricsSnapshot``.
 
 The helpers here are intentionally small — most reused-across-tests
 construction lives in ``_make_aggregator`` style fixtures local to each
diff --git a/tests/unit/async_utils/services/metrics_aggregator/test_aggregator.py b/tests/unit/async_utils/services/metrics_aggregator/test_aggregator.py
@@ -15,8 +15,7 @@
 
 """Tests for ``MetricsAggregatorService.process()``.
 
-Migrated to the registry/publisher refactor (metrics_pubsub_design_v5):
-events are injected directly via ``await agg.process([...])``; emitted
+Events are injected directly via ``await agg.process([...])``; emitted
 metrics are inspected by reading the ``MetricsRegistry``'s snapshot
 output. The aggregator is constructed with a real SUB socket (so the
 ``ZmqMessageSubscriber`` base initializes cleanly) and a mocked
@@ -486,12 +485,7 @@ async def test_complete_removes_row(self, tmp_path):
 
     @pytest.mark.asyncio
     async def test_session_ended_calls_publish_final(self, tmp_path):
-        """ENDED triggers ``publish_final`` on the publisher.
-
-        The legacy assertion was on ``store.closed``; with the registry/
-        publisher refactor the ENDED handler invokes ``publish_final``
-        and ``close`` on the (mocked) publisher.
-        """
+        """ENDED triggers ``publish_final`` and ``close`` on the publisher."""
         loop = asyncio.get_event_loop()
         with ManagedZMQContext.scoped(socket_dir=str(tmp_path)) as ctx:
             agg, _, publisher = make_aggregator(ctx, loop, "agg_ended_publish_final")
@@ -1023,5 +1017,4 @@ async def test_shutdown_drains_async_tasks(self, tmp_path):
     # NOTE(agents): Trigger exception handling (logger.exception paths) is not
     # exercised here. Adding a MockTokenizePool that raises on
     # token_count_async would let us assert no metric is emitted, the
-    # aggregator does not crash, and the task set is cleaned up. Tracked as
-    # follow-up; see the same TODO in the pre-refactor file.
+    # aggregator does not crash, and the task set is cleaned up.
diff --git a/tests/unit/async_utils/services/metrics_aggregator/test_aggregator_e2e.py b/tests/unit/async_utils/services/metrics_aggregator/test_aggregator_e2e.py
@@ -15,11 +15,9 @@
 
 """End-to-end pub/sub round-trip tests for the metrics aggregator.
 
-The legacy E2E suite exercised the full ``EventPublisherService`` →
-``MetricsAggregatorService`` → ``InMemoryKVStore`` pipeline. With the
-registry/publisher refactor, the wire surface that matters at this layer
-is the snapshot pub/sub channel: aggregator → ``MetricsPublisher`` →
-ZMQ PUB → ``MetricsSnapshotSubscriber``.
+The wire surface that matters at this layer is the snapshot pub/sub
+channel: aggregator → ``MetricsPublisher`` → ZMQ PUB →
+``MetricsSnapshotSubscriber``.
 
 These tests stand up a real ``MetricsPublisher`` and
 ``MetricsSnapshotSubscriber`` against a single ``ManagedZMQContext.scoped``
@@ -104,11 +102,10 @@ async def test_publish_final_arrives_at_subscriber(
     ):
         """``publish_final`` produces a COMPLETE snapshot reachable over IPC.
 
-        This replaces the legacy single-sample pipeline assertion: the
-        aggregator's ``publish_final`` is what crosses the wire, and the
-        ``MetricsSnapshotSubscriber`` is what the main process uses to
-        observe the run's end. The exact metric values aren't the point
-        here — the round-trip + state field is.
+        The aggregator's ``publish_final`` is what crosses the wire, and
+        the ``MetricsSnapshotSubscriber`` is what the main process uses
+        to observe the run's end. The exact metric values aren't the
+        point here — the round-trip + state field is.
         """
         loop = asyncio.get_event_loop()
         publisher, subscriber = _make_pair(
@@ -143,7 +140,7 @@ async def test_live_tick_then_final(
 
         Tracks the lifecycle the main process sees: subscriber's
         ``latest`` is updated by every live tick, and ``complete`` is
-        only set once. Mirrors the design v5 §1 state machine.
+        only set once (when the COMPLETE-state snapshot arrives).
         """
         loop = asyncio.get_event_loop()
         publisher, subscriber = _make_pair(
diff --git a/tests/unit/async_utils/services/metrics_aggregator/test_aggregator_error_handler.py b/tests/unit/async_utils/services/metrics_aggregator/test_aggregator_error_handler.py
@@ -92,10 +92,10 @@ async def test_error_event_increments_tracked_failed_when_row_exists(tmp_path):
     """ERROR for a tracked, in-flight sample increments BOTH total and
     tracked failure counters.
 
-    Regression for design v5 §3: this only works because session.py emits
-    ERROR before COMPLETE — if the order regresses, the row is removed by
-    set_field(...COMPLETE...) before the ERROR handler runs and
-    ``TRACKED_SAMPLES_FAILED`` silently stays at 0.
+    This only works because session.py emits ERROR before COMPLETE — if
+    the order regresses, the row is removed by set_field(...COMPLETE...)
+    before the ERROR handler runs and ``TRACKED_SAMPLES_FAILED`` silently
+    stays at 0.
     """
     import asyncio
 
diff --git a/tests/unit/async_utils/services/metrics_aggregator/test_metrics_table.py b/tests/unit/async_utils/services/metrics_aggregator/test_metrics_table.py
@@ -15,12 +15,9 @@
 
 """Tests for ``MetricsTable``, ``SampleRow``, and ``TrackedBlock``.
 
-Migrated to the registry-backed table introduced in
-``metrics_pubsub_design_v5.md``: ``MetricsTable(registry)`` instead of
-``MetricsTable(kv_store)``. The table itself is registry-agnostic for
-most flows — these tests pass a fresh ``MetricsRegistry`` per test and
-do not register any triggers, so the registry is only used to satisfy
-the constructor signature.
+The table is registry-agnostic for most flows — these tests pass a
+fresh ``MetricsRegistry`` per test and do not register any triggers,
+so the registry is only used to satisfy the constructor signature.
 """
 
 from __future__ import annotations
diff --git a/tests/unit/async_utils/services/metrics_aggregator/test_publisher.py b/tests/unit/async_utils/services/metrics_aggregator/test_publisher.py
@@ -167,9 +167,11 @@ async def test_publish_final_awaits_tick_task_cancellation(
     ):
         """publish_final MUST NOT return while the tick task could still emit.
 
-        Regression: an earlier shape called ``self._tick_task.cancel()`` but
-        did not await the task. With ``conflate=True`` on the SUB side, a late
-        live tick landing after the final frame would replace it in the queue.
+        ``self._tick_task.cancel()`` only schedules cancellation at the
+        next await point; without ``await``ing the task, a late live tick
+        landing after the COMPLETE frame would replace it in a
+        ``conflate=True`` SUB queue. publish_final must therefore await
+        cancellation before publishing COMPLETE.
         """
         loop = asyncio.get_event_loop()
         publisher = MetricsPublisher(
diff --git a/tests/unit/async_utils/services/metrics_aggregator/test_registry.py b/tests/unit/async_utils/services/metrics_aggregator/test_registry.py
@@ -157,10 +157,11 @@ def test_final_histogram_handles_zero_value(self):
     def test_hdr_histogram_count_matches_total(self):
         """HDR-derived histogram bucket counts must sum to the recorded count.
 
-        Regression: an earlier implementation derived counts via
-        ``get_count_at_value(hi) - get_count_at_value(lo)`` which returns
-        single-bucket counts, not cumulative — total ended up far less than
-        the actual recorded count.
+        Without this invariant, deriving display-bucket counts via the
+        difference of two ``get_count_at_value`` queries would silently
+        under-count: ``get_count_at_value(v)`` returns the count of the
+        single sub-bucket containing ``v``, not a cumulative count, so
+        the subtraction is meaningless.
         """
         s = self._make()
         for v in range(1, 101):
diff --git a/tests/unit/load_generator/test_async_session.py b/tests/unit/load_generator/test_async_session.py
@@ -561,7 +561,7 @@ async def inject_error():
         # ERROR must be emitted BEFORE COMPLETE so the metrics aggregator can
         # observe the in-flight tracked row before set_field(...COMPLETE...)
         # removes it. Reverting this order would silently zero
-        # tracked_samples_failed. See metrics_pubsub_design_v5.md §3.
+        # tracked_samples_failed.
         error_idx = publisher.events.index(error_events[0])
         complete_idx = publisher.events.index(complete_events[0])
         assert error_idx < complete_idx, (
diff --git a/tests/unit/metrics/test_report_builder.py b/tests/unit/metrics/test_report_builder.py
@@ -15,10 +15,8 @@
 
 """Tests for ``Report.from_snapshot`` and display helpers.
 
-Migrated from the ``Report.from_kv_reader`` / ``compute_summary``
-surfaces (both removed in metrics_pubsub_design_v5). Reports are now
-built from a ``MetricsSnapshot`` produced by a populated
-``MetricsRegistry`` — no on-disk KV store is involved.
+Reports are built from a ``MetricsSnapshot`` produced by a populated
+``MetricsRegistry``.
 """
 
 from __future__ import annotations