Skip to content

burn/install: route through pod-side fastboot when power=rack#88

Merged
widgetii merged 1 commit into
masterfrom
burn-install-rack-fastboot
May 11, 2026
Merged

burn/install: route through pod-side fastboot when power=rack#88
widgetii merged 1 commit into
masterfrom
burn-install-rack-fastboot

Conversation

@widgetii
Copy link
Copy Markdown
Member

Summary

defib burn and defib install drove the HiSilicon SPL upload from the host even when the transport went through a rack pod's WiFi-bridged UART, where the per-frame ACK loop (150 ms × dozens of frames per upload) doesn't survive the round-trip latency — both commands failed at the very first PRESTEP0 frame.

Now, when power_controller is a RackController, the CLI calls the new defib.recovery.rack_fastboot.run_rack_fastboot() helper instead of session.run(). The helper packages profile + SPL + agent into the binary blob the pod's POST /fastboot expects, posts it, and turns the pod's phase-by-phase JSON into a RecoveryResult so the rest of the CLI (terminal mode, download_process detection, TFTP scripting) stays unchanged.

The pod takes exclusive UART access during the upload, so the host transport is opened only after fastboot returns.

Live verification on the prototype

$ DEFIB_POWER_TYPE=rack DEFIB_RACK_HOST=10.216.128.69 \
  defib burn -c hi3516ev300 -p tcp://10.216.128.69:9000 \
             --power-cycle --break

Power: rack pod HTTP API
Pod-side fastboot in progress…
rack fastboot: spl=17408 agent=236195 spl_addr=0x4010500
               ddr_addr=0x4013000 uboot_addr=0x41000000
Done! (25678ms)
$ # camera halted at the freshly-uploaded U-Boot prompt
> version
U-Boot 2016.11-g131d3f2 (May 08 2026 - 11:58:25 +0000) hi3516ev300
OpenIPC #

Build g131d3f2 is distinct from the in-flash build (g6d2ed0c-dirty, Mar 2023) — proves the burn landed in RAM and the chip jumped to the new image rather than falling through to flash.

Install + restore scope

  • install's Phase 1 (burn-to-RAM) now uses the same fastboot path; Phase 2 (U-Boot tftp + sf write scripting) goes over the bridge as ordinary text commands and is already known to work — TFTP-through-pod-NAPT was verified during the earlier manual kernel restore at 167 KB/s.
  • restore has its own shape (frame-blast started before power-on, then power-on triggers the catch) that doesn't map cleanly onto fastboot's all-in-one semantics. Left out of scope for this PR; can be a follow-up if needed.

Architecture note

The SPL-boundary detection (HiSiliconStandard._detect_spl_size) and the 0xFF-run zeroing (_zero_long_ff_runs) stay on the host. The pod gets ready-to-send bytes. This keeps the pod firmware minimal and ensures the two paths (host-driven and pod-driven) stay byte-identical for any chip we test.

Test plan

  • uv run pytest tests/ -x -v --ignore=tests/fuzz (461 passed / 2 skipped)
  • uv run ruff check src/defib/ tests/
  • uv run mypy src/defib/cli/app.py src/defib/recovery/rack_fastboot.py --ignore-missing-imports
  • 4 new TestRunRackFastboot cases cover success path, PRESTEP0 failure attribution, profile-address packing, and the agent_payload override used by agent-flash.
  • Regression: existing local-UART burn / install paths unchanged — both still go through session.run when power controller is RouterOS / Vectis / None.

🤖 Generated with Claude Code

`defib burn` and `defib install` drove the HiSilicon SPL upload from
the host even when the transport went through a rack pod's WiFi-bridged
UART, where the per-frame ACK loop (150 ms × dozens of frames) doesn't
survive the round-trip latency. Both commands failed at the very first
PRESTEP0 frame.

Now, when `power_controller` is a `RackController`, the CLI calls the
new `defib.recovery.rack_fastboot.run_rack_fastboot()` helper instead
of `session.run()`. The helper:

  1. Loads the SoC profile.
  2. Detects the SPL boundary in the firmware (same `_detect_spl_size`
     + `_zero_long_ff_runs` logic the host path uses — both paths stay
     byte-identical).
  3. Calls `RackController.fastboot(...)`, which POSTs profile + SPL +
     agent bytes as a single binary blob to the pod's `/fastboot`
     endpoint. The pod runs handshake / DDR step / SPL / U-Boot upload
     locally on its UART (microsecond ACK latency).
  4. Returns a `RecoveryResult` so the rest of the CLI (terminal mode,
     download_process detection, TFTP scripting, etc.) stays unchanged.

The pod takes exclusive UART access during the upload, so the host
transport is opened only after fastboot returns.

End-to-end verification on the prototype at 10.216.128.69:

    $ DEFIB_POWER_TYPE=rack DEFIB_RACK_HOST=10.216.128.69 \
      defib burn -c hi3516ev300 -p tcp://10.216.128.69:9000 \
                 --power-cycle --break
    Power: rack pod HTTP API
    Pod-side fastboot in progress…
    rack fastboot: spl=17408 agent=236195 spl_addr=0x4010500
                   ddr_addr=0x4013000 uboot_addr=0x41000000
    Done! (25678ms)

    $ # camera halted at the freshly-uploaded U-Boot prompt
    > version
    U-Boot 2016.11-g131d3f2 (May 08 2026 - 11:58:25 +0000) hi3516ev300
    OpenIPC #

Build `g131d3f2` ≠ the in-flash `g6d2ed0c-dirty` — proves the burn
landed and the chip jumped to the new image.

`install`'s Phase 1 (burn-to-RAM) now uses the same fastboot path;
Phase 2 (U-Boot TFTP scripting + sf write) goes over the bridge as
ordinary text commands and already works (TFTP-through-pod-NAPT was
verified during the earlier manual kernel restore).

`restore`'s shape (frame-blast started pre-power-on) doesn't map
cleanly onto fastboot's all-in-one semantics — left out of scope here.

4 new tests for `run_rack_fastboot` cover success / PRESTEP0 failure
attribution / profile-address packing / agent_payload override.
Suite: 461 passed / 2 skipped; ruff + mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii merged commit deabdc2 into master May 11, 2026
13 checks passed
@widgetii widgetii deleted the burn-install-rack-fastboot branch May 11, 2026 19:10
widgetii added a commit that referenced this pull request May 12, 2026
## Summary

Brings `defib restore` to parity with `defib install` (#88 + #93) for
rack-controlled cameras. Three pieces:

### Phase 1 — fastboot when `power=rack`

The previous host-side frame-blast race (power-off → open serial → start
session → power-on) is RouterOS-only. Rack pods don't expose independent
`power_off`/`power_on` and don't need to — the pod's `/fastboot`
endpoint does the whole sequence locally with microsecond ACK latency.
Drop the hard-coded *"restore needs RouterOSController only"* reject —
`RackController` is now an accepted alternative. Vectis stays rejected.

### Phase 5 — `--tftp-via=auto|pod|host` (default auto)

Same flag as `install`. Auto → pod when `power=rack`, host otherwise.
Pod path stages every partition via `RackController.tftp_put`, sets
`serverip=192.168.1.1` (the pod), and unifies the UBI rootfs file-swap
through `_replace_in_tftp(name, data)`.

Two robustness improvements:

- **`tftp_clear` BEFORE staging.** A prior aborted run leaves PSRAM
occupied; if the next run can't allocate, the 4 MB rootfs OOMs at 256 KB
largest-free. Wipe first.
- **`try/finally` around Phase 5 + 6.** A mid-loop write failure skipped
`__aexit__` and leaked ~7 MB of pod PSRAM until the next install. The
`try/finally` (with the cleanup hooks pre-registered on the
`AsyncExitStack`) makes cleanup unconditional.

### Live verification on rack pod `10.216.128.69` (hi3516ev300)

Synthetic dump dir at `/tmp/cam_dump/` (mtd0..3 sized to match the 16 MB
NOR layout):

```
$ DEFIB_POWER_TYPE=rack DEFIB_RACK_HOST=10.216.128.69 \
  defib restore -c hi3516ev300 -i /tmp/cam_dump/ \
                -p rack://10.216.128.69 --power-cycle --flash-type nor

  Power: rack pod HTTP API
Phase 1: Loading U-Boot to RAM
  Pod-side fastboot in progress…
Phase 4: Network setup — Network OK (attempt 1)
Phase 5: Writing flash
  Staging 7664 KB in pod PSRAM via POST /tftp/<name>...
  Pod TFTP ready on 192.168.1.1:69
  mtd1: 64KB    → 0x40000     Written (7.5 s)
  mtd2: 3072KB  → 0x50000     Written (11.7 s)
  mtd3: 4272KB  → 0x350000    Written (15.7 s)
  mtd0: 256KB   → 0x0         Written (8.3 s)
Restore complete!
```

Camera reaches `openipc-hi3516ev300 login:` cleanly. `exit=0`.

### Companion rack-firmware change (local-only)

`UART_IDLE_TIMEOUT_S` **60 → 600**. The 60-second idle timer was killing
the bridge socket mid-staging — ~50 s of HTTP `/tftp` uploads counts as
"idle" to the bridge (no host→pod UART traffic during that window). 600
s comfortably covers full installs and restores.

## Test plan

- [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` — 486 passed / 2
skipped (no new unit tests; `_restore_async` is integration-only)
- [ ] `uv run ruff check src/defib/cli/app.py` — clean
- [ ] `uv run mypy src/defib/cli/app.py --ignore-missing-imports` —
clean
- [ ] Regression: `defib restore --tftp-via host …` still works on
existing RouterOS+host-TFTP setups — host branch is byte-identical
except for being inside the shared `AsyncExitStack`.
- [ ] `--tftp-via pod` without `DEFIB_POWER_TYPE=rack` → clean error
message.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant