You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
agent: add membw command for bare-metal DDR bandwidth test (#100)
## Summary
Closes#99.
- New `CMD_MEMBW` (0x0D) / `RSP_MEMBW` (0x87) — runs memset / read-scan
/ memcpy kernels via ARM `ldmia`/`stmia` (r4-r11, 32 B per loop iter)
against a scratch DDR buffer with the MMU cache on, timed with the ARMv7
PMU cycle counter (CCNT).
- New `defib agent membw [--size 4MB] [--iters 8] [--addr 0] [--port
...] [--output human|json]` CLI command.
- Reports `cycles/byte` (CPU-clock-invariant — the metric that actually
isolates DDR fabric from CPU clock variance) plus `MB/s` when the
architectural generic timer's `CNTFRQ` is set by an earlier boot stage.
If `CNTFRQ == 0` the host transparently falls back to cycles/byte.
- Agent v4: bumps `AGENT_VERSION`, advertises `CAP_MEMBW` in INFO so the
host can check support before sending the command.
- ARMv7 (Cortex-A7 V4 / V5 / V6 family) only. ARMv5 (ARM926,
`hi3516cv300`) cleanly rejects with `ACK_FLASH_ERROR` via `#ifdef
CPU_ARM926` — different PMU register layout, out of scope for the
motivating use case.
## Why
From the issue: when investigating an encoder fps gap between OpenIPC
and vendor firmware on identical `gk7205v300` silicon, the key question
— *"is the DDR fabric slow, or is Linux slow on top of it?"* — can't be
cleanly answered from inside Linux. CMA reservations, cache attributes,
libc memcpy variance and scheduler noise all muddy any userspace number.
defib already runs a bare-metal agent in DDR right after SPL brings
memory up. That's the exact moment we want to measure raw DDR
throughput, before any kernel/ISP/VENC traffic. `defib agent membw`
gives a reproducible apples-to-apples bandwidth number per firmware.
## How
### Agent C (`agent/main.c`, `agent/protocol.h`)
Three inline-asm kernels with `ldmia`/`stmia` over r4-r11 (8 words = 32
B per memory operation), so OpenIPC vs vendor builds produce identical
instruction streams. Cache is on (write-back / write-allocate per
`startup.S` page-table fill); the buffer is sized well above L1+L2 so
DDR is the actual bottleneck.
CCNT is calibrated against `CNTPCT` (architectural generic timer, fixed
frequency from `CNTFRQ`) over a 10 ms window. If `CNTFRQ` was never
written by the bootrom — and on the V4 family it isn't — the agent
returns `timer_hz = 0` and the host falls back to the cycles/byte
metric. That number alone already answers the original question because
it normalises for CPU-clock differences across firmwares, which is the
gotcha that bit the reporter in the original investigation.
### Agent footprint guard
The default scratch sits at `LOAD_ADDR + 8 MiB` (a new `AGENT_LOAD_ADDR`
macro is passed in via Makefile `CFLAGS`). `handle_membw` rejects any
user-supplied `addr` whose `[addr, addr + 2*size)` range overlaps
`[LOAD_ADDR - 64 KB, LOAD_ADDR + 8 MiB]` — otherwise an 8 MiB memcpy on
the default V4 layout would stomp the running agent's own code. This was
found during real-hardware testing — see the validation section.
### Python host (`src/defib/agent/client.py`, `cli/app.py`)
- `MembwResult` dataclass with `cycles_per_byte(ticks, write_amp=1)` and
`mbps(ticks, write_amp=1)` helpers (returns `None` for `mbps` when
`timer_hz == 0`).
- `FlashAgentClient.membw(size_bytes, iters, addr)` async method.
- `defib agent membw` Typer command with `human` and `json` output
modes.
- `agent info` now lists `membw` in the capabilities line when reported
by the agent.
### Tests
- **Agent C** (`agent/test_agent.c`): round-trip framing tests for the
12 B request and 32 B response packets.
- **Python** (`tests/test_agent_protocol.py::TestMembw`): four tests
using `MockTransport` — field parsing, MB/s + cycles/byte math,
`timer_hz == 0` graceful degradation, ARMv5 (`ACK_FLASH_ERROR`)
rejection path.
## Validation
**Real hardware, 2026-05-14:**
| Test | hi3516ev300 (V4) | gk7205v300 (V4) |
|---|---|---|
| Agent v4 advertises `membw` | ✓ | ✓ |
| memset 4 MiB × 8 | 0.345 cyc/B | 0.345 cyc/B |
| read 4 MiB × 8 | 0.512 cyc/B | 0.513 cyc/B |
| memcpy 4 MiB × 8 (R+W) | 0.446 cyc/B | 0.446 cyc/B |
| 8 MiB × 16 + 16 MiB × 8 | — | flat to 0.2% — past cache |
Both SoCs agree to 0.2% — expected, same V4 silicon family with the same
DDR config. `CNTFRQ == 0` on both, so MB/s shows `n/a` and the
cycles/byte fallback activates automatically.
**Tests / lint / cross-build (all green):**
- `make -C agent test HOST_CC=gcc` — 5412/5412 pass (includes 2 new
framing tests)
- `uv run pytest tests/ -x --ignore=tests/fuzz` — 494 pass, 2 skip
(includes 4 new TestMembw tests)
- `uv run ruff check src/ tests/` — clean
- `uv run mypy src/defib/ --ignore-missing-imports` — clean
- Cross-build verified: `gk7205v300`, `hi3516ev300`, `hi3516cv300`
(ARMv5 reject path), `hi3516cv610`; `make all-socs` builds all four
default targets.
## Test plan
- [x] Agent C unit tests pass (`make -C agent test HOST_CC=gcc`)
- [x] Python tests pass (`uv run pytest tests/`)
- [x] Ruff + mypy clean
- [x] Cross-compile every default SoC
- [x] Real-hardware smoke on hi3516ev300 (ARMv7)
- [x] Real-hardware smoke on gk7205v300 (ARMv7, the motivating SoC)
- [x] Real-hardware edge cases: 4 MiB×8, 8 MiB×16, 16 MiB×8 —
cycles/byte stable
- [ ] Real-hardware smoke on hi3516cv300 (ARMv5 reject path — needs an
ARMv5 board)
- [ ] Run from OpenIPC U-Boot and vendor U-Boot on the same `gk7205v300`
silicon, diff `cycles_per_byte` — that's the motivating measurement the
issue was asking for.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments