Skip to content

Commit 85dfac6

Browse files
nv-alichengclaude
andcommitted
docs(agents): add reference-hygiene rules + clean up violations
Adds two new sections to AGENTS.md "Development Standards": 1. "Documentation references — no local-only artifacts" — docs and comments must not reference paths outside the repo (gitignored directories, local scratch dirs, contributor workstation paths). A reviewer fetching the PR should be able to follow every cited reference. 2. "Comments and docstrings — describe current state, not development history" — no comments narrating iteration on the codebase ("we tried X first", "an earlier implementation did Y"). Such pointers belong in the PR description and git log, not the source tree. Especially relevant under AI-assisted development where it's tempting to leave a paper trail of design pivots inline. Sweeps existing violations across both rules: Production code: drops cites to ``metrics_pubsub_design_v5.md`` from module/class docstrings (snapshot.py, registry.py, publisher.py) and inlines self-contained rationale where useful (aggregator.py HDR bounds, TOTAL_DURATION_NS comment). Tests: removes "Migrated to ..." / "The legacy tests ..." framing from rewritten test module docstrings; reframes regression-test docstrings (test_registry.py, test_publisher.py, test_aggregator.py) to describe the invariant being protected rather than narrating the prior bug's discovery. AGENTS.md: removes its own self-violation cite to the gitignored design doc. Behavior: no functional changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9b3f34b commit 85dfac6

15 files changed

Lines changed: 88 additions & 71 deletions

File tree

AGENTS.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ Multi-process, event-loop design optimized for throughput:
109109

110110
### Metrics Aggregator subprocess (pub/sub)
111111

112-
The aggregator is a separate process (`python -m inference_endpoint.async_utils.services.metrics_aggregator`) that subscribes to events and publishes `MetricsSnapshot` messages. State machine and wire contract are documented in `.cursor_artifacts/metrics_pubsub_design_v5.md` §1; key facts for working in this layer:
112+
The aggregator is a separate process (`python -m inference_endpoint.async_utils.services.metrics_aggregator`) that subscribes to events and publishes `MetricsSnapshot` messages. Key facts for working in this layer:
113113

114114
- **Series storage**: each `SeriesSampler` keeps three parallel views: O(1) cheap rollups (count/total/min/max/sum_sq, exact), an HDR Histogram (cheap live percentiles), and an in-memory `array.array` of raw values (for exact percentiles in the `COMPLETE` snapshot). Hot path is `registry.record(name, value)` — no allocation, no I/O.
115115
- **Counter API**: `registry.increment(name, delta=1)` for sample-event counters. `registry.set_counter(name, value)` only for the two duration counters (`total_duration_ns` max-of-elapsed, `tracked_duration_ns` sum-of-blocks).
@@ -343,6 +343,49 @@ These apply especially to code in the hot path (load generator, endpoint client,
343343
- `src/inference_endpoint/openai/openai_types_gen.py` — auto-generated, excluded from ruff/pre-commit
344344
- `src/inference_endpoint/openai/openapi.yaml` — OpenAI API spec, excluded from pre-commit
345345

346+
### Documentation references — no local-only artifacts
347+
348+
**Code, comments, docstrings, tests, and committed Markdown MUST NOT reference paths that aren't in the repository.** This includes anything under `.gitignore`d directories (e.g. `.cursor_artifacts/`, design scratch dirs, untracked working notes), absolute paths to a contributor's workstation, build outputs, or unmerged branch artifacts. A reviewer fetching the PR should be able to follow every reference cited in the diff.
349+
350+
**Why:** stale references compound — `See foo.md §3` is meaningless once `foo.md` is gone, renamed, or never existed in the merged tree, and rotting cross-references are how docs stop being trusted. AI agents reading the codebase later treat dangling pointers as ground truth and propagate confusion.
351+
352+
**Allowed:**
353+
354+
- Paths to files committed to the repo (`docs/...`, `src/...`, `tests/...`, `README.md`, etc.).
355+
- External URLs (issue trackers, PRs, RFCs, vendor docs).
356+
- Generic references to environment/setup that the reader is expected to create themselves (e.g. `source .venv/bin/activate` in a setup README, where `.venv` is the user's local venv).
357+
358+
**Disallowed examples:**
359+
360+
- `See .cursor_artifacts/foo_design.md §2``.cursor_artifacts/` is gitignored.
361+
- `See ~/work/notes/architecture.txt` — contributor-local.
362+
- `Tracked in metrics_pubsub_design_v5.md test impact section` — same gitignored doc.
363+
364+
If a design doc is worth referencing from the source tree, commit it to `docs/` or inline the relevant content into the code comment / docstring. For one-off rationale that won't survive the conversation, prefer a self-contained explanation in the comment itself rather than a pointer to ephemera.
365+
366+
### Comments and docstrings — describe current state, not development history
367+
368+
**Don't write comments or docstrings that narrate iteration on the codebase.** Pointers to abandoned approaches, prior implementations, or design pivots belong in the PR description and `git log`, not in the source tree. They rot quickly: the prior implementation is gone, the reader has no way to evaluate the comparison, and the scaffolding accumulates with every iteration. Future readers — humans and AI agents alike — treat the comment as if it describes load-bearing context when it's actually historical clutter.
369+
370+
This applies especially to AI-assisted development, where it's tempting to leave a paper trail of "we tried X first, then switched to Y" inside the source. That paper trail belongs in the PR description.
371+
372+
**Disallowed patterns:**
373+
374+
- `# Originally used X, but switched to Y for ...`
375+
- `# An earlier implementation did X — this version does Y`
376+
- `# Removed the foo parameter` / `# Replaced bar with baz`
377+
- `# Note: this used to be sync but is now async`
378+
- `# Regression: an earlier shape did X` — even in regression-test docstrings, drop the narrative framing.
379+
- `# An alternative design considered ... but was rejected because ...` (unless the rejected alternative is a _common_ path a future contributor might re-attempt — in that case, frame it as "**don't** do X because Y", not as developer history).
380+
381+
**Allowed:**
382+
383+
- **Current rationale**: `# Uses dict dispatch — hot path measured at sub-ms` (describes why the current design exists; no history).
384+
- **Regression context that doesn't narrate the prior bug's discovery**: `# Without this check, value > hdr_high silently corrupts the histogram total` (describes the bug being prevented, framed as a current invariant — not "we used to have a bug here").
385+
- **Inline TODO/FIXME** pointing at a tracking issue (URL or issue number, not "we plan to do X eventually").
386+
387+
**Rule of thumb:** if removing the comment would leave the code's intent unchanged for someone seeing it for the first time, the comment is fine. If the comment only makes sense to someone who saw the prior version, delete it.
388+
346389
## Keeping AGENTS.md Up to Date
347390

348391
**This file is the source of truth for AI agents working in this repo.** If it is stale or wrong, every AI-assisted session starts from a broken foundation.

src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -71,13 +71,10 @@ class MetricCounterKey(str, Enum):
7171
TRACKED_SAMPLES_FAILED = "tracked_samples_failed"
7272
TRACKED_DURATION_NS = "tracked_duration_ns"
7373
# Total wall-clock duration since session start. Updated on every event as
74-
# max(current, event_timestamp - session_start) to be defensive against
75-
# non-monotonic timestamps.
76-
#
77-
# An alternative design was considered: store session_start_ns once and
78-
# compute duration as (now - start) on read. This is infeasible because
79-
# time.monotonic_ns() has inconsistent epoch per process — a reader in
80-
# another process would get a meaningless value.
74+
# max(current, event_timestamp - session_start). Stored as a counter
75+
# rather than computed from (now - start) at read time because
76+
# time.monotonic_ns() has a process-local epoch — a reader in another
77+
# process would get a meaningless value.
8178
TOTAL_DURATION_NS = "total_duration_ns"
8279

8380

@@ -91,7 +88,9 @@ class MetricCounterKey(str, Enum):
9188
)
9289

9390

94-
# HDR bounds per series. See metrics_pubsub_design_v5.md §1 for rationale.
91+
# HDR bounds per series — chosen conservatively so realistic benchmark
92+
# values cannot fall outside [low, high]. Values outside the range are
93+
# clamped on insert and a warning is logged once per series.
9594
_NS_HDR_LOW: Final[int] = 1
9695
_NS_HDR_HIGH: Final[int] = 3_600_000_000_000 # 1 hour in ns
9796
_TOKEN_HDR_LOW: Final[int] = 1

src/inference_endpoint/async_utils/services/metrics_aggregator/publisher.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,7 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
"""``MetricsPublisher``: publish ``MetricsSnapshot`` over pub/sub + disk fallback.
17-
18-
See ``metrics_pubsub_design_v5.md`` §5 for the design and failure mode table.
19-
"""
16+
"""``MetricsPublisher``: publish ``MetricsSnapshot`` over pub/sub + disk fallback."""
2017

2118
from __future__ import annotations
2219

src/inference_endpoint/async_utils/services/metrics_aggregator/registry.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,6 @@
2626
1. Cheap exact rollups (count/total/min/max/sum_sq) — O(1), exact.
2727
2. HDR Histogram — supports cheap live percentiles/histogram.
2828
3. ``array.array`` of raw values — supports exact final percentiles.
29-
30-
See ``metrics_pubsub_design_v5.md`` §2 for full design.
3129
"""
3230

3331
from __future__ import annotations

src/inference_endpoint/async_utils/services/metrics_aggregator/snapshot.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,9 +20,6 @@
2020
``DRAINING`` between ``ENDED`` and the final publish, ``COMPLETE`` for the
2121
last snapshot). The snapshot is the only public wire format between the
2222
aggregator and any consumer (main process, future TUI).
23-
24-
See ``metrics_pubsub_design_v5.md`` §1 for invariants, field reference,
25-
and HDR bounds.
2623
"""
2724

2825
from __future__ import annotations
@@ -137,9 +134,6 @@ class MetricsSnapshot(
137134
metrics: Tagged union of ``CounterStat`` and ``SeriesStat``,
138135
ordered counters-first then series, registration
139136
order within each.
140-
141-
See ``metrics_pubsub_design_v5.md`` §1 for the full reference table and
142-
the state-machine diagram.
143137
"""
144138

145139
counter: int

tests/integration/commands/test_benchmark_command.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -207,10 +207,9 @@ def _resolve_template(template_path: Path, server_url: str) -> dict:
207207
raw = re.sub(r"http://localhost:\d+", server_url, raw)
208208
data = yaml.safe_load(raw)
209209

210-
# Swap any gated default model name for a non-gated tokenizer. The
211-
# generated templates' "eg: meta-llama/Llama-3.1-8B-Instruct" placeholder
212-
# points at a gated repo; substituting gpt2 lets these tests run in CI
213-
# without HF_TOKEN.
210+
# Swap the placeholder-default model name for a non-gated tokenizer
211+
# (see _TEST_MODEL_NAME above) so these tests can run in CI without
212+
# HF_TOKEN.
214213
if "model_params" in data and isinstance(data["model_params"], dict):
215214
data["model_params"]["name"] = _TEST_MODEL_NAME
216215

tests/unit/async_utils/services/metrics_aggregator/conftest.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,8 @@
1515

1616
"""Shared test doubles and factories for metrics aggregator tests.
1717
18-
Migrated for the registry/publisher refactor (metrics_pubsub_design_v5):
19-
no more ``InMemoryKVStore``. Tests that need to inspect emitted values
20-
build them directly off a ``MetricsRegistry`` and a ``MetricsSnapshot``.
18+
Tests that need to inspect emitted values build them directly off a
19+
``MetricsRegistry`` and a ``MetricsSnapshot``.
2120
2221
The helpers here are intentionally small — most reused-across-tests
2322
construction lives in ``_make_aggregator`` style fixtures local to each

tests/unit/async_utils/services/metrics_aggregator/test_aggregator.py

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@
1515

1616
"""Tests for ``MetricsAggregatorService.process()``.
1717
18-
Migrated to the registry/publisher refactor (metrics_pubsub_design_v5):
19-
events are injected directly via ``await agg.process([...])``; emitted
18+
Events are injected directly via ``await agg.process([...])``; emitted
2019
metrics are inspected by reading the ``MetricsRegistry``'s snapshot
2120
output. The aggregator is constructed with a real SUB socket (so the
2221
``ZmqMessageSubscriber`` base initializes cleanly) and a mocked
@@ -486,12 +485,7 @@ async def test_complete_removes_row(self, tmp_path):
486485

487486
@pytest.mark.asyncio
488487
async def test_session_ended_calls_publish_final(self, tmp_path):
489-
"""ENDED triggers ``publish_final`` on the publisher.
490-
491-
The legacy assertion was on ``store.closed``; with the registry/
492-
publisher refactor the ENDED handler invokes ``publish_final``
493-
and ``close`` on the (mocked) publisher.
494-
"""
488+
"""ENDED triggers ``publish_final`` and ``close`` on the publisher."""
495489
loop = asyncio.get_event_loop()
496490
with ManagedZMQContext.scoped(socket_dir=str(tmp_path)) as ctx:
497491
agg, _, publisher = make_aggregator(ctx, loop, "agg_ended_publish_final")
@@ -1023,5 +1017,4 @@ async def test_shutdown_drains_async_tasks(self, tmp_path):
10231017
# NOTE(agents): Trigger exception handling (logger.exception paths) is not
10241018
# exercised here. Adding a MockTokenizePool that raises on
10251019
# token_count_async would let us assert no metric is emitted, the
1026-
# aggregator does not crash, and the task set is cleaned up. Tracked as
1027-
# follow-up; see the same TODO in the pre-refactor file.
1020+
# aggregator does not crash, and the task set is cleaned up.

tests/unit/async_utils/services/metrics_aggregator/test_aggregator_e2e.py

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,9 @@
1515

1616
"""End-to-end pub/sub round-trip tests for the metrics aggregator.
1717
18-
The legacy E2E suite exercised the full ``EventPublisherService`` →
19-
``MetricsAggregatorService`` → ``InMemoryKVStore`` pipeline. With the
20-
registry/publisher refactor, the wire surface that matters at this layer
21-
is the snapshot pub/sub channel: aggregator → ``MetricsPublisher`` →
22-
ZMQ PUB → ``MetricsSnapshotSubscriber``.
18+
The wire surface that matters at this layer is the snapshot pub/sub
19+
channel: aggregator → ``MetricsPublisher`` → ZMQ PUB →
20+
``MetricsSnapshotSubscriber``.
2321
2422
These tests stand up a real ``MetricsPublisher`` and
2523
``MetricsSnapshotSubscriber`` against a single ``ManagedZMQContext.scoped``
@@ -104,11 +102,10 @@ async def test_publish_final_arrives_at_subscriber(
104102
):
105103
"""``publish_final`` produces a COMPLETE snapshot reachable over IPC.
106104
107-
This replaces the legacy single-sample pipeline assertion: the
108-
aggregator's ``publish_final`` is what crosses the wire, and the
109-
``MetricsSnapshotSubscriber`` is what the main process uses to
110-
observe the run's end. The exact metric values aren't the point
111-
here — the round-trip + state field is.
105+
The aggregator's ``publish_final`` is what crosses the wire, and
106+
the ``MetricsSnapshotSubscriber`` is what the main process uses
107+
to observe the run's end. The exact metric values aren't the
108+
point here — the round-trip + state field is.
112109
"""
113110
loop = asyncio.get_event_loop()
114111
publisher, subscriber = _make_pair(
@@ -143,7 +140,7 @@ async def test_live_tick_then_final(
143140
144141
Tracks the lifecycle the main process sees: subscriber's
145142
``latest`` is updated by every live tick, and ``complete`` is
146-
only set once. Mirrors the design v5 §1 state machine.
143+
only set once (when the COMPLETE-state snapshot arrives).
147144
"""
148145
loop = asyncio.get_event_loop()
149146
publisher, subscriber = _make_pair(

tests/unit/async_utils/services/metrics_aggregator/test_aggregator_error_handler.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -92,10 +92,10 @@ async def test_error_event_increments_tracked_failed_when_row_exists(tmp_path):
9292
"""ERROR for a tracked, in-flight sample increments BOTH total and
9393
tracked failure counters.
9494
95-
Regression for design v5 §3: this only works because session.py emits
96-
ERROR before COMPLETE — if the order regresses, the row is removed by
97-
set_field(...COMPLETE...) before the ERROR handler runs and
98-
``TRACKED_SAMPLES_FAILED`` silently stays at 0.
95+
This only works because session.py emits ERROR before COMPLETE — if
96+
the order regresses, the row is removed by set_field(...COMPLETE...)
97+
before the ERROR handler runs and ``TRACKED_SAMPLES_FAILED`` silently
98+
stays at 0.
9999
"""
100100
import asyncio
101101

0 commit comments

Comments
 (0)