burn/install: route through pod-side fastboot when power=rack#88
Merged
Conversation
`defib burn` and `defib install` drove the HiSilicon SPL upload from
the host even when the transport went through a rack pod's WiFi-bridged
UART, where the per-frame ACK loop (150 ms × dozens of frames) doesn't
survive the round-trip latency. Both commands failed at the very first
PRESTEP0 frame.
Now, when `power_controller` is a `RackController`, the CLI calls the
new `defib.recovery.rack_fastboot.run_rack_fastboot()` helper instead
of `session.run()`. The helper:
1. Loads the SoC profile.
2. Detects the SPL boundary in the firmware (same `_detect_spl_size`
+ `_zero_long_ff_runs` logic the host path uses — both paths stay
byte-identical).
3. Calls `RackController.fastboot(...)`, which POSTs profile + SPL +
agent bytes as a single binary blob to the pod's `/fastboot`
endpoint. The pod runs handshake / DDR step / SPL / U-Boot upload
locally on its UART (microsecond ACK latency).
4. Returns a `RecoveryResult` so the rest of the CLI (terminal mode,
download_process detection, TFTP scripting, etc.) stays unchanged.
The pod takes exclusive UART access during the upload, so the host
transport is opened only after fastboot returns.
End-to-end verification on the prototype at 10.216.128.69:
$ DEFIB_POWER_TYPE=rack DEFIB_RACK_HOST=10.216.128.69 \
defib burn -c hi3516ev300 -p tcp://10.216.128.69:9000 \
--power-cycle --break
Power: rack pod HTTP API
Pod-side fastboot in progress…
rack fastboot: spl=17408 agent=236195 spl_addr=0x4010500
ddr_addr=0x4013000 uboot_addr=0x41000000
Done! (25678ms)
$ # camera halted at the freshly-uploaded U-Boot prompt
> version
U-Boot 2016.11-g131d3f2 (May 08 2026 - 11:58:25 +0000) hi3516ev300
OpenIPC #
Build `g131d3f2` ≠ the in-flash `g6d2ed0c-dirty` — proves the burn
landed and the chip jumped to the new image.
`install`'s Phase 1 (burn-to-RAM) now uses the same fastboot path;
Phase 2 (U-Boot TFTP scripting + sf write) goes over the bridge as
ordinary text commands and already works (TFTP-through-pod-NAPT was
verified during the earlier manual kernel restore).
`restore`'s shape (frame-blast started pre-power-on) doesn't map
cleanly onto fastboot's all-in-one semantics — left out of scope here.
4 new tests for `run_rack_fastboot` cover success / PRESTEP0 failure
attribution / profile-address packing / agent_payload override.
Suite: 461 passed / 2 skipped; ruff + mypy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
widgetii
added a commit
that referenced
this pull request
May 12, 2026
## Summary Brings `defib restore` to parity with `defib install` (#88 + #93) for rack-controlled cameras. Three pieces: ### Phase 1 — fastboot when `power=rack` The previous host-side frame-blast race (power-off → open serial → start session → power-on) is RouterOS-only. Rack pods don't expose independent `power_off`/`power_on` and don't need to — the pod's `/fastboot` endpoint does the whole sequence locally with microsecond ACK latency. Drop the hard-coded *"restore needs RouterOSController only"* reject — `RackController` is now an accepted alternative. Vectis stays rejected. ### Phase 5 — `--tftp-via=auto|pod|host` (default auto) Same flag as `install`. Auto → pod when `power=rack`, host otherwise. Pod path stages every partition via `RackController.tftp_put`, sets `serverip=192.168.1.1` (the pod), and unifies the UBI rootfs file-swap through `_replace_in_tftp(name, data)`. Two robustness improvements: - **`tftp_clear` BEFORE staging.** A prior aborted run leaves PSRAM occupied; if the next run can't allocate, the 4 MB rootfs OOMs at 256 KB largest-free. Wipe first. - **`try/finally` around Phase 5 + 6.** A mid-loop write failure skipped `__aexit__` and leaked ~7 MB of pod PSRAM until the next install. The `try/finally` (with the cleanup hooks pre-registered on the `AsyncExitStack`) makes cleanup unconditional. ### Live verification on rack pod `10.216.128.69` (hi3516ev300) Synthetic dump dir at `/tmp/cam_dump/` (mtd0..3 sized to match the 16 MB NOR layout): ``` $ DEFIB_POWER_TYPE=rack DEFIB_RACK_HOST=10.216.128.69 \ defib restore -c hi3516ev300 -i /tmp/cam_dump/ \ -p rack://10.216.128.69 --power-cycle --flash-type nor Power: rack pod HTTP API Phase 1: Loading U-Boot to RAM Pod-side fastboot in progress… Phase 4: Network setup — Network OK (attempt 1) Phase 5: Writing flash Staging 7664 KB in pod PSRAM via POST /tftp/<name>... Pod TFTP ready on 192.168.1.1:69 mtd1: 64KB → 0x40000 Written (7.5 s) mtd2: 3072KB → 0x50000 Written (11.7 s) mtd3: 4272KB → 0x350000 Written (15.7 s) mtd0: 256KB → 0x0 Written (8.3 s) Restore complete! ``` Camera reaches `openipc-hi3516ev300 login:` cleanly. `exit=0`. ### Companion rack-firmware change (local-only) `UART_IDLE_TIMEOUT_S` **60 → 600**. The 60-second idle timer was killing the bridge socket mid-staging — ~50 s of HTTP `/tftp` uploads counts as "idle" to the bridge (no host→pod UART traffic during that window). 600 s comfortably covers full installs and restores. ## Test plan - [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` — 486 passed / 2 skipped (no new unit tests; `_restore_async` is integration-only) - [ ] `uv run ruff check src/defib/cli/app.py` — clean - [ ] `uv run mypy src/defib/cli/app.py --ignore-missing-imports` — clean - [ ] Regression: `defib restore --tftp-via host …` still works on existing RouterOS+host-TFTP setups — host branch is byte-identical except for being inside the shared `AsyncExitStack`. - [ ] `--tftp-via pod` without `DEFIB_POWER_TYPE=rack` → clean error message. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
defib burnanddefib installdrove the HiSilicon SPL upload from the host even when the transport went through a rack pod's WiFi-bridged UART, where the per-frame ACK loop (150 ms × dozens of frames per upload) doesn't survive the round-trip latency — both commands failed at the very first PRESTEP0 frame.Now, when
power_controlleris aRackController, the CLI calls the newdefib.recovery.rack_fastboot.run_rack_fastboot()helper instead ofsession.run(). The helper packages profile + SPL + agent into the binary blob the pod'sPOST /fastbootexpects, posts it, and turns the pod's phase-by-phase JSON into aRecoveryResultso the rest of the CLI (terminal mode, download_process detection, TFTP scripting) stays unchanged.The pod takes exclusive UART access during the upload, so the host transport is opened only after fastboot returns.
Live verification on the prototype
Build
g131d3f2is distinct from the in-flash build (g6d2ed0c-dirty, Mar 2023) — proves the burn landed in RAM and the chip jumped to the new image rather than falling through to flash.Install + restore scope
install's Phase 1 (burn-to-RAM) now uses the same fastboot path; Phase 2 (U-Boottftp+sf writescripting) goes over the bridge as ordinary text commands and is already known to work — TFTP-through-pod-NAPT was verified during the earlier manual kernel restore at 167 KB/s.restorehas its own shape (frame-blast started before power-on, then power-on triggers the catch) that doesn't map cleanly onto fastboot's all-in-one semantics. Left out of scope for this PR; can be a follow-up if needed.Architecture note
The SPL-boundary detection (
HiSiliconStandard._detect_spl_size) and the 0xFF-run zeroing (_zero_long_ff_runs) stay on the host. The pod gets ready-to-send bytes. This keeps the pod firmware minimal and ensures the two paths (host-driven and pod-driven) stay byte-identical for any chip we test.Test plan
uv run pytest tests/ -x -v --ignore=tests/fuzz(461 passed / 2 skipped)uv run ruff check src/defib/ tests/uv run mypy src/defib/cli/app.py src/defib/recovery/rack_fastboot.py --ignore-missing-importsTestRunRackFastbootcases cover success path, PRESTEP0 failure attribution, profile-address packing, and theagent_payloadoverride used by agent-flash.session.runwhen power controller is RouterOS / Vectis / None.🤖 Generated with Claude Code