Skip to content

Commit a99764a

Browse files
Inital verification of monitor shm message correctness
1 parent b4b60df commit a99764a

26 files changed

Lines changed: 1095 additions & 28 deletions

.cursor/hooks/hook_task_router.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,30 @@ class RouterEntry(TypedDict):
320320
"monitor-debug-shm.mdc",
321321
],
322322
},
323+
{
324+
"id": "monitor_shm_message_correctness",
325+
"patterns": [
326+
"HPCPerfStats/monitor/scripts/emit_build_capabilities.py",
327+
"HPCPerfStats/monitor/scripts/build_message_expectations.py",
328+
"HPCPerfStats/monitor/scripts/validate_shm_messages.py",
329+
"HPCPerfStats/monitor/scripts/lib/*",
330+
"HPCPerfStats/monitor/tests/test_shm_message_correctness.sh",
331+
"HPCPerfStats/monitor/tests/expected/shm_*",
332+
"HPCPerfStats/monitor/tests/expected/capabilities_*",
333+
"HPCPerfStats/monitor/tests/expected/expectations_*",
334+
"monitor/scripts/emit_build_capabilities.py",
335+
"monitor/scripts/build_message_expectations.py",
336+
"monitor/scripts/validate_shm_messages.py",
337+
"monitor/scripts/lib/*",
338+
"monitor/tests/test_shm_message_correctness.sh",
339+
"monitor/tests/expected/shm_*",
340+
"monitor/tests/expected/capabilities_*",
341+
"monitor/tests/expected/expectations_*",
342+
],
343+
"rules": [
344+
"monitor-shm-message-correctness.mdc",
345+
],
346+
},
323347
{
324348
"id": "monitor_emit_contract",
325349
"patterns": [

.github/workflows/monitor-verify.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,15 @@ jobs:
3636
working-directory: monitor/.build-static
3737
run: make check
3838

39+
- name: Emit build capabilities
40+
working-directory: monitor/.build-static
41+
run: make capabilities
42+
43+
- name: Shm message correctness (synthetic fixture; skip 77 if no slug goldens)
44+
working-directory: monitor
45+
run: |
46+
./tests/test_shm_message_correctness.sh .build-static || test $? -eq 77
47+
3948
- name: distclean static build tree
4049
working-directory: monitor/.build-static
4150
run: make distclean

monitor/Makefile.am

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
SUBDIRS = src tests
22

3+
.PHONY: capabilities
4+
capabilities:
5+
@cd "$(top_builddir)" && python3 "$(abs_top_srcdir)/scripts/emit_build_capabilities.py" \
6+
--build-dir "$(top_builddir)" \
7+
--tier "$${CAPABILITIES_TIER:-slowtier1}"
8+
39
# Monitor-specific trees (not created by Autoconf); removed by distclean.
410
distclean-local:
511
rm -rf .build-static

monitor/README.md

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -121,14 +121,44 @@ See `HPCPerfStats/docs/monitor_variable_rename_map.yaml` for rename-map entries.
121121

122122
### DEBUG `/dev/shm` mirror (daemon only)
123123

124-
When built with **`--enable-debug`**, the RabbitMQ daemon overwrites the latest
125-
`@fast` and `@full` sample payloads under **`/dev/shm/hpcperfstatsd-debug/`**
126-
(files `fast` and `full`). Override the base directory with
127-
**`HPCPERFSTATS_DEBUG_SHM_DIR`**. Schema/`$` rotation payloads (`write_hdr=1`)
128-
do not update shm.
129-
130-
Payloads contain job id, hostname, and workload metrics — treat as sensitive on
131-
shared nodes. Files are created mode `0600` under a `0700` directory.
124+
When built with **`--enable-debug`**, the RabbitMQ daemon mirrors the latest **full
125+
outbound payloads** under **`/dev/shm/hpcperfstatsd-debug/`**:
126+
127+
| File | Content |
128+
|------|---------|
129+
| `schema` | Complete `$` rotation payload (`write_hdr=1`) |
130+
| `fast` | Latest `@fast` sample |
131+
| `full` | Latest `@full` (or legacy full-width) sample |
132+
133+
Override the base directory with **`HPCPERFSTATS_DEBUG_SHM_DIR`**. Payloads
134+
contain job id, hostname, and workload metrics — treat as sensitive on shared
135+
nodes. Files are created mode `0600` under a `0700` directory (atomic `*.tmp` +
136+
`rename`).
137+
138+
### `/dev/shm` message correctness testing
139+
140+
On a data host with a DEBUG build:
141+
142+
```bash
143+
cd HPCPerfStats/monitor
144+
./scripts/build_static_bundle.sh --enable-debug # or add to configure args
145+
make -C .build-static capabilities
146+
# start hpcperfstatsd; wait for schema + fast + full updates under /dev/shm/...
147+
python3 scripts/build_message_expectations.py \
148+
--capabilities .build-static/monitor-build-capabilities.json
149+
python3 scripts/validate_shm_messages.py \
150+
--capabilities .build-static/monitor-build-capabilities.json \
151+
--manifest .build-static/expectations_<capability_slug>.json \
152+
--shm-dir /dev/shm/hpcperfstatsd-debug
153+
```
154+
155+
**Capability slug**`monitor-build-capabilities.json` includes
156+
`capability_slug` (compile flags + `slowtier0`/`slowtier1`). Golden fixtures
157+
and expectations must use the same slug in filenames
158+
(`tests/expected/shm_*_<slug>.txt`, `expectations_<slug>.json`). CI runs
159+
`tests/test_shm_message_correctness.sh` (synthetic fixture always; live slug
160+
goldens when present; **exit 77 skip** otherwise). Local run logs belong under
161+
**`<workspace-root>/test_runs/`** (see **test-runs-output-directory**).
132162

133163
## Power telemetry (DCGM, RAPL, and interpretation)
134164

monitor/cursor-rules/agent-discipline-core.mdc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ Use the **Read** tool on the listed `HPCPerfStats/monitor/cursor-rules/<file>.md
4848
| `HPCPerfStats/monitor/src/*ib*.c`, IB sysfs parsers | `monitor-ib-sysfs-parsing.mdc` |
4949
| `HPCPerfStats/monitor/src/*dcgm*`, `*gpu*` | `monitor-dcgm-integration.mdc` |
5050
| DEBUG shm mirror paths | `monitor-debug-shm.mdc` |
51+
| Shm message correctness scripts, slug goldens | `monitor-shm-message-correctness.mdc` |
5152

5253
### Emit contract, RabbitMQ, schema
5354

monitor/cursor-rules/implementation-workflow-discipline.mdc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ Use these during **plan**, **implement**, **self-review**, and **verification**
3131
| [monitor-schema-keys-headers.mdc](monitor-schema-keys-headers.mdc) | Shared KEYS headers |
3232
| [monitor-consumer-schema-migration.mdc](monitor-consumer-schema-migration.mdc) | Breaking output migrations |
3333
| [monitor-consumer-side-plan.mdc](monitor-consumer-side-plan.mdc) | Consumer handoff plans |
34-
| [monitor-debug-shm.mdc](monitor-debug-shm.mdc) | DEBUG `/dev/shm` mirror |
34+
| [monitor-debug-shm.mdc](monitor-debug-shm.mdc) | DEBUG `/dev/shm` mirror (`schema`/`fast`/`full`) |
35+
| [monitor-shm-message-correctness.mdc](monitor-shm-message-correctness.mdc) | Capability slug, validators, shm goldens |
3536
| [monitor-ib-sysfs-parsing.mdc](monitor-ib-sysfs-parsing.mdc) | IB port state parsers |
3637
| [monitor-ci-workflow.mdc](monitor-ci-workflow.mdc) | GitHub `monitor-verify.yaml` |
3738
| [monitor-cursor-rules-sync.mdc](monitor-cursor-rules-sync.mdc) | Edit both `.cursor/rules/` and `monitor/cursor-rules/` |
Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
description: DEBUG /dev/shm sample mirror — security, gating, and tests
2+
description: DEBUG /dev/shm full-message mirror — security, gating, and tests
33
globs:
44
- HPCPerfStats/monitor/src/stats_buffer_debug_shm.*
55
- HPCPerfStats/monitor/src/monitor_daemon.c
@@ -12,19 +12,26 @@ alwaysApply: false
1212
## Scope
1313

1414
- **DEBUG builds only** (`--enable-debug`); release uses inline no-ops in `stats_buffer_debug_shm.h`.
15-
- Updates **`fast`** and **`full`** under `HPCPERFSTATS_DEBUG_SHM_DIR` (default `/dev/shm/hpcperfstatsd-debug`).
15+
- Mirrors latest full payloads under `HPCPERFSTATS_DEBUG_SHM_DIR` (default `/dev/shm/hpcperfstatsd-debug`):
16+
- **`schema`** — complete `$` rotation payload (`write_hdr=1`)
17+
- **`fast`** — latest `@fast` sample
18+
- **`full`** — latest `@full` (or legacy full-width) sample
1619

1720
## Implementation
1821

1922
- **Permissions**: directory `0700`, files `0600` (payloads contain jobid, host, metrics).
2023
- **Atomic write**: tmp + `rename`; log failures via `monitor_log_debug` (do not `(void)` discard).
2124
- **Env**: copy `HPCPERFSTATS_DEBUG_SHM_DIR` into owned storage — do not retain `getenv()` pointer.
22-
- **Gating**: use `stats_buffer_debug_shm_sample_wanted(write_hdr, payload_ok)` — only routine samples (`!write_hdr`), not schema/`$` paths.
25+
- **Gating**:
26+
- `stats_buffer_debug_shm_schema_wanted(write_hdr, payload_ok)` → schema file when `write_hdr && payload_ok`
27+
- `stats_buffer_debug_shm_sample_wanted(write_hdr, payload_ok)` → sample files when `!write_hdr && payload_ok`
28+
- **Write API**: `stats_buffer_debug_shm_write_payload(sf, kind)` with `STATS_BUFFER_DEBUG_SHM_PAYLOAD_{SCHEMA,FAST,FULL}`.
2329

2430
## Tests
2531

26-
- Unit + golden tests under `tests/test_stats_buffer_debug_shm.c`, `test_debug_shm_emit_*`.
27-
- Assert **`stats_buffer_debug_shm_sample_wanted`** matches daemon behavior.
32+
- Unit + golden: `tests/test_stats_buffer_debug_shm.c`, `test_debug_shm_emit_*`, `test_debug_shm_schema_mirror.c`.
33+
- Assert gating helpers match `monitor_daemon_collect_to_ring` behavior.
34+
- Slug-named goldens: `tests/expected/shm_{schema,fast,full}_<capability_slug>.txt` (**monitor-shm-message-correctness**).
2835
- Document DEBUG requirement in **`monitor/README.md`** (**monitor-readme-maintenance**).
2936

30-
Cross-links: **monitor-debug-vs-symbols**, **monitor-readme-maintenance**.
37+
Cross-links: **monitor-debug-vs-symbols**, **monitor-readme-maintenance**, **monitor-shm-message-correctness**.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
description: /dev/shm full-message correctness harness — capability slug, validators, goldens
3+
globs:
4+
- HPCPerfStats/monitor/scripts/emit_build_capabilities.py
5+
- HPCPerfStats/monitor/scripts/build_message_expectations.py
6+
- HPCPerfStats/monitor/scripts/validate_shm_messages.py
7+
- HPCPerfStats/monitor/scripts/lib/*
8+
- HPCPerfStats/monitor/tests/test_shm_message_correctness.sh
9+
- HPCPerfStats/monitor/tests/expected/shm_*
10+
- HPCPerfStats/monitor/tests/expected/capabilities_*
11+
- HPCPerfStats/monitor/tests/expected/expectations_*
12+
alwaysApply: false
13+
---
14+
15+
# Monitor: /dev/shm message correctness
16+
17+
## Scope
18+
19+
- Validates **full outbound monitor payloads** mirrored under `HPCPERFSTATS_DEBUG_SHM_DIR` (`schema`, `fast`, `full`).
20+
- **Out of scope:** listend, archive files, RabbitMQ consumer path.
21+
- Requires **DEBUG build** (`--enable-debug`) for live shm capture (**monitor-debug-shm**).
22+
23+
## Capability slug
24+
25+
- Emitted by `scripts/emit_build_capabilities.py` → `<builddir>/monitor-build-capabilities.json`.
26+
- `make capabilities` in configured build tree; `build_static_bundle.sh` invokes unless `SKIP_CAPABILITIES=1`.
27+
- Slug tokens (omit absent): `arch`, `ver`, `debug`, `hw`, `ib`, `nvgpu`, `amdgpu`, `lustre`, `opa`, `dcgm|likwid|none`, `slowtier0|slowtier1`.
28+
- **All fixtures** use slug in filename: `expectations_<slug>.json`, `tests/expected/shm_{schema,fast,full}_<slug>.txt`.
29+
- Validator **layer 0**: manifest slug must match capabilities JSON; exit 2 on mismatch.
30+
- CI/integration: **exit 77 skip** when no goldens for live slug — never fall back to generic golden.
31+
32+
## Artifacts
33+
34+
| Local (gitignored) | Committed |
35+
|--------------------|-----------|
36+
| `test_runs/monitor/validate_<slug>_*.txt` | `tests/expected/shm_*_<slug>.txt` |
37+
| `test_runs/monitor/shm_snapshot_<host>_<slug>/` | `scripts/lib/*.py` |
38+
39+
## Data-host runbook
40+
41+
1. `build_static_bundle.sh` with `--enable-debug` → `make -C .build-static capabilities`.
42+
2. Run daemon; read `schema`/`fast`/`full` from shm.
43+
3. `build_message_expectations.py --capabilities …` → `expectations_<slug>.json`.
44+
4. `validate_shm_messages.py` (optional `--spot-check-values`).
45+
46+
## Tests
47+
48+
- C: `test_stats_buffer_debug_shm.c`, `test_debug_shm_schema_mirror.c` (**monitor-debug-shm**).
49+
- Shell: `test_shm_message_correctness.sh` (slug-aware).
50+
- Python via repo `.venv` (**python-venv-enforcement**).
51+
52+
Cross-links: **monitor-collect-tier-gating**, **monitor-schema-keys-headers**, **monitor-c-testing-standards**, **test-runs-output-directory**.

0 commit comments

Comments
 (0)