Skip to content

Commit 3f7f3b1

Browse files
widgetiiclaude
andauthored
agent+transport: 5.6× flash throughput over rack pod via baud switching (#89)
## Summary Three coordinated changes that make the flash agent's high-speed-UART mode (`DEFAULT_FAST_BAUD = 921600`) actually work over rack-pod WiFi-bridged links. Previously the host-side baud switch path (`port.baudrate = baud` — pyserial) silently failed on `SocketTransport`, so `defib agent {read,write,scan}` over `tcp://<pod>:9000` fell back to `FALLBACK_BAUD = 115200` and ran at ~10 KB/s. ## Live result on the prototype 256 KB sustained flash read at 0x14000000 through the agent over `rack://10.216.128.69`: | Path | Rate | Speedup | |---|---|---| | 115200 (fallback) | 11.1 KB/s | 1.0× | | 921600 (rack baud switch) | **61.9 KB/s** | **5.57×** | ~70 % of the theoretical 8× ceiling — the rest is COBS + windowed-ACK protocol overhead, which is the same as the on-serial path. ## What changed ### 1. `Transport.set_baudrate(baud)` abstraction New method on `defib.transport.base.Transport`. Default raises `NotImplementedError`. Overrides: - **`SerialTransport`** — sets `self._port.baudrate` (was inlined in `FlashAgentClient.set_baud`). - **`Rfc2217Transport`** — already had `set_baudrate` from PR #64 (Vectis); just exposed through the ABC. - **New `RackTransport(SocketTransport)`** — captures the pod's HTTP base URL at construction; `set_baudrate` POSTs `{"rate": baud}` to `/uart/baud`. New `rack://host[:bridge_port][?api=http_port]` URL scheme in `serial_platform.create_transport` (defaults 9000 / 8080). `FlashAgentClient.set_baud` now `await transport.set_baudrate(baud)` — works across all four transport flavours; cleanly returns `False` when the transport refuses (was: raw `AttributeError`). ### 2. Agent: stop auto-reverting on the post-switch verification window `handle_set_baud` used to switch UART, then `proto_recv(timeout=3000)` for a verification packet from the host, reverting to 115200 if nothing arrived. The "3000 ms" budget is a **CPU-speed-dependent busy-wait** — `for (volatile int d=25; d>0; d--) {}` × `timeout_ms*100` iterations — and on a fast Cortex-A7 the actual window collapses to **~300 ms**. Over a rack pod the host's `POST /uart/baud` itself takes ~1 s (WiFi RTT + httpd dispatch), so the agent reverted to 115200 long before any verification packet could land. Result: agent at 115200, bridge at 921600, host reading 35 bytes of misclocked `0x80 0x00 …` garbage forever. Fix: drop the verification window. The agent stays at whatever baud the last `CMD_SET_BAUD` selected. If the new rate doesn't work the agent is unreachable until the next power-cycle / fastboot — both of which the rack pod and RouterOS trivially provide. (This also matches the local-UART experience: defib has been using the same `set_baud` against MikroTik+pyserial-attached cameras successfully because pyserial's `port.baudrate=` is microsecond-fast, easily landing within the agent's collapsed ~300 ms window. The bug only surfaces when the host-side switch is on the wrong side of a high-RTT control plane.) ### 3. Pod firmware (rack repo, local-only — `uart-bridge-flush-rx-on-accept` branch) Defensive UART hygiene around `/uart/baud`: drain the TX FIFO at the old rate before `uart_set_baudrate`, and read back the actual divisor via `uart_get_baudrate`. Belt + braces — even with the agent fix, leaving in-flight bytes from the old rate gets clocked out at the new rate and corrupts the agent's RX window. ## Tests 7 new `tests/test_transport_rack.py`: - `set_baudrate` POSTs correct URL + body - HTTP / URL errors surface as `TransportError` - `rack://` URL parsing with default + custom + `?api=` query - Reject missing host Suite: **468 passed / 2 skipped**; ruff + mypy clean. ## Test plan - [ ] `uv run pytest tests/ -x -v --ignore=tests/fuzz` - [ ] `uv run ruff check src/defib/ tests/` - [ ] `uv run mypy src/defib/ --ignore-missing-imports` - [ ] Regression: agent baud switch on local serial (pyserial path) — same protocol changes, just no `RackTransport`. Confirm reading 256 KB at 921600 still works on a USB-serial-attached camera. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Dmitry Ilyin <widgetii@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent deabdc2 commit 3f7f3b1

7 files changed

Lines changed: 355 additions & 34 deletions

File tree

agent/main.c

Lines changed: 14 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -969,30 +969,20 @@ static void handle_set_baud(const uint8_t *data, uint32_t len) {
969969
/* Drain any garbage from baud rate transition */
970970
while (uart_readable()) uart_getc();
971971

972-
/* Wait for host to confirm with any valid command within 3 seconds.
973-
* If nothing arrives, revert to 115200 — the host may have failed
974-
* to switch or the new baud rate doesn't work on this link. */
975-
uint8_t pkt[MAX_PAYLOAD + 16];
976-
uint32_t pkt_len = 0;
977-
uint8_t cmd = proto_recv(pkt, &pkt_len, 3000);
978-
if (cmd == 0) {
979-
/* No valid command — revert */
980-
uart_set_baud(115200);
981-
while (uart_readable()) uart_getc();
982-
at_default_baud = 1;
983-
} else {
984-
/* Got a valid command at new baud — confirmed working */
985-
at_default_baud = (baud == 115200);
986-
switch (cmd) {
987-
case CMD_INFO: handle_info(); break;
988-
case CMD_READ: handle_read(pkt, pkt_len); break;
989-
case CMD_WRITE: handle_write(pkt, pkt_len); break;
990-
case CMD_CRC32: handle_crc32_cmd(pkt, pkt_len); break;
991-
case CMD_SCAN: handle_scan(pkt, pkt_len); break;
992-
case CMD_MARK_BAD: handle_mark_bad(pkt, pkt_len); break;
993-
default: proto_send_ack(ACK_OK); break;
994-
}
995-
}
972+
/* Stay at the new baud unconditionally. Earlier versions waited up
973+
* to 3 s for a verification packet and reverted to 115200 otherwise,
974+
* but proto_recv's "3 s" deadline is a CPU-speed-dependent busy-wait
975+
* (≈25-cycle loop × 100·timeout_ms iterations) — on a fast Cortex-A7
976+
* the actual window collapses to <300 ms, which is shorter than the
977+
* host-side WiFi-RTT for the rack pod's `POST /uart/baud` (≈1 s).
978+
* The agent reverted before the host's verification packet could
979+
* arrive at the new rate, leaving host/agent permanently mismatched
980+
* and reading misclocked garbage.
981+
*
982+
* Failure mode if the host can't reach us at the new baud: agent is
983+
* unrecoverable until the next power-cycle / fastboot, which the
984+
* rack pod or RouterOS can both do trivially. */
985+
at_default_baud = (baud == 115200);
996986
}
997987

998988
int main(void) {

src/defib/agent/client.py

Lines changed: 44 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -753,18 +753,20 @@ async def set_baud(self, baud: int) -> bool:
753753
754754
Protocol: send SET_BAUD command, receive ACK at current baud,
755755
then both sides switch. Verifies with INFO at new baud.
756-
Falls back to original baud on failure.
756+
Falls back to ``FALLBACK_BAUD`` on failure.
757+
758+
Routes through :meth:`Transport.set_baudrate` — pyserial
759+
transports update their port; RFC 2217 sends SET-BAUDRATE; the
760+
rack pod's :class:`RackTransport` POSTs to ``/uart/baud``.
761+
Transports without out-of-band baud signalling raise
762+
``NotImplementedError`` and we abort cleanly so the caller can
763+
stay at ``FALLBACK_BAUD``.
757764
"""
758765
self._clear_rx_buffers()
759766

760767
import asyncio
761768

762-
port = getattr(self._transport, '_port', None)
763-
if port is None:
764-
logger.error("set_baud requires serial transport with _port")
765-
return False
766-
767-
old_baud = port.baudrate
769+
old_baud = self._current_baud
768770
payload = struct.pack("<I", baud)
769771
await send_packet(self._transport, CMD_SET_BAUD, payload)
770772

@@ -775,22 +777,54 @@ async def set_baud(self, baud: int) -> bool:
775777

776778
# Agent has switched — now switch host side
777779
await asyncio.sleep(0.05) # Brief pause for agent to complete switch
778-
port.baudrate = baud
780+
try:
781+
await self._transport.set_baudrate(baud)
782+
except NotImplementedError:
783+
logger.error(
784+
"set_baud: transport has no out-of-band baud control; "
785+
"cannot sync host side. Wire mismatch — staying at %d.",
786+
old_baud,
787+
)
788+
# Best-effort: nudge the agent back to fallback so we don't
789+
# end up with a permanently mismatched link.
790+
try:
791+
fallback = struct.pack("<I", FALLBACK_BAUD)
792+
await send_packet(self._transport, CMD_SET_BAUD, fallback)
793+
except Exception:
794+
pass
795+
return False
779796

780-
# Verify communication at new baud
797+
# Verify communication at new baud. Drain first — any bytes that
798+
# were on the wire DURING the baud transition (e.g. the agent's
799+
# post-ACK drain residue, or bridge UART RX bytes clocked at the
800+
# wrong rate during the host→pod /uart/baud RTT) would be parsed
801+
# as junk at the new rate and corrupt the next packet.
802+
await self._transport.flush_input()
803+
# Clear the async-leftover buffer the agent protocol parser keeps
804+
# so any half-packet bytes left from the previous rate don't
805+
# contaminate the verification read.
806+
try:
807+
from defib.agent.protocol import _async_leftover
808+
_async_leftover.pop(id(self._transport), None)
809+
except ImportError:
810+
pass
781811
await asyncio.sleep(0.05)
782812
try:
783813
await send_packet(self._transport, CMD_INFO)
784814
cmd, data = await recv_response(self._transport, timeout=3.0)
785815
if cmd == RSP_INFO:
786816
logger.info("Baud rate switched to %d", baud)
817+
self._current_baud = baud
787818
return True
788819
except Exception:
789820
pass
790821

791822
# Failed — switch back
792823
logger.warning("Verification at %d baud failed, reverting to %d", baud, old_baud)
793-
port.baudrate = old_baud
824+
try:
825+
await self._transport.set_baudrate(old_baud)
826+
except NotImplementedError:
827+
pass
794828
return False
795829

796830
async def mark_bad_block(self, block: int) -> bool:

src/defib/transport/base.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,5 +133,19 @@ async def unread(self, data: bytes) -> None:
133133
"""
134134
raise NotImplementedError("This transport does not support unread()")
135135

136+
async def set_baudrate(self, baud: int) -> None:
137+
"""Change the UART baud rate on both ends of the link.
138+
139+
Real serial transports set their pyserial ``baudrate`` property.
140+
RFC 2217 sends a SET-BAUDRATE sub-option to the remote bridge.
141+
Bridges that expose an out-of-band control channel (e.g. the
142+
rack pod's ``POST /uart/baud``) call into it.
143+
144+
Plain TCP-bridged UARTs that have no signalling for baud rate
145+
changes raise ``NotImplementedError`` and the caller must keep
146+
the wire at ``115200``.
147+
"""
148+
raise NotImplementedError("This transport does not support set_baudrate()")
149+
136150
async def close(self) -> None:
137151
"""Close the transport. Default implementation does nothing."""

src/defib/transport/rack.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
"""TCP transport for rack pods, with out-of-band baud rate control.
2+
3+
A rack pod's TCP UART bridge passes bytes verbatim — no in-band signal
4+
for the bridge to change its UART baud rate. The pod exposes a separate
5+
HTTP control plane (``POST /uart/baud``) for that, so callers like
6+
:class:`~defib.agent.client.FlashAgentClient.set_baud` can sync both
7+
ends of the link when the on-device agent jumps to a faster rate.
8+
9+
``RackTransport`` extends :class:`~defib.transport.socket.SocketTransport`
10+
with the HTTP base URL of the controlling pod and an
11+
:meth:`set_baudrate` override that POSTs the new rate. URL scheme:
12+
13+
``rack://host[:bridge_port][?api=http_port]``
14+
15+
defaults: ``bridge_port=9000``, ``http_port=8080``.
16+
"""
17+
18+
from __future__ import annotations
19+
20+
import asyncio
21+
import json
22+
import logging
23+
import socket as sock_mod
24+
import urllib.error
25+
import urllib.request
26+
27+
from defib.transport.base import TransportError
28+
from defib.transport.socket import SocketTransport
29+
30+
logger = logging.getLogger(__name__)
31+
32+
33+
class RackTransport(SocketTransport):
34+
"""SocketTransport + HTTP control channel for the pod's /uart/baud."""
35+
36+
def __init__(self, conn: sock_mod.socket, http_base: str) -> None:
37+
super().__init__(conn)
38+
self._http_base = http_base.rstrip("/")
39+
40+
@classmethod
41+
async def create_rack(
42+
cls,
43+
host: str,
44+
bridge_port: int = 9000,
45+
http_port: int = 8080,
46+
) -> RackTransport:
47+
try:
48+
s = sock_mod.socket(sock_mod.AF_INET, sock_mod.SOCK_STREAM)
49+
s.setblocking(False)
50+
s.setsockopt(sock_mod.IPPROTO_TCP, sock_mod.TCP_NODELAY, 1)
51+
loop = asyncio.get_event_loop()
52+
await loop.sock_connect(s, (host, bridge_port))
53+
except OSError as e:
54+
raise TransportError(
55+
f"Failed to connect to rack pod {host}:{bridge_port}: {e}"
56+
) from e
57+
http_base = f"http://{host}:{http_port}"
58+
logger.info(
59+
"Connected to rack pod: tcp://%s:%d (control %s)",
60+
host, bridge_port, http_base,
61+
)
62+
return cls(s, http_base)
63+
64+
async def set_baudrate(self, baud: int) -> None:
65+
"""Sync the pod's UART side to ``baud`` via POST /uart/baud.
66+
67+
The on-device agent flips to ``baud`` after its own CMD_SET_BAUD
68+
handler; we POST here to bring the bridge's UART side in line.
69+
Without this, the host writes at host-imagined ``baud`` but the
70+
bridge keeps clocking at 115200 — every byte gets mangled.
71+
"""
72+
url = f"{self._http_base}/uart/baud"
73+
body = json.dumps({"rate": int(baud)}).encode("ascii")
74+
logger.info("rack POST %s rate=%d", url, baud)
75+
await asyncio.to_thread(self._post_baud_sync, url, body)
76+
77+
@staticmethod
78+
def _post_baud_sync(url: str, body: bytes) -> None:
79+
req = urllib.request.Request(
80+
url, data=body, method="POST",
81+
headers={"Content-Type": "application/json"},
82+
)
83+
try:
84+
with urllib.request.urlopen(req, timeout=5.0) as resp:
85+
resp.read()
86+
except urllib.error.HTTPError as e:
87+
detail = e.read().decode("utf-8", "replace")[:200]
88+
raise TransportError(
89+
f"rack HTTP {e.code} on {url}: {detail}"
90+
) from e
91+
except (urllib.error.URLError, TimeoutError, OSError) as e:
92+
raise TransportError(
93+
f"rack unreachable at {url}: {e}"
94+
) from e

src/defib/transport/serial.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,9 @@ async def flush_input(self) -> None:
108108
async def flush_output(self) -> None:
109109
self._port.reset_output_buffer()
110110

111+
async def set_baudrate(self, baud: int) -> None:
112+
self._port.baudrate = baud
113+
111114
async def bytes_waiting(self) -> int:
112115
return int(self._port.in_waiting)
113116

src/defib/transport/serial_platform.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,44 @@ async def create_transport(
164164
logger.info("Using RFC 2217 transport: %s", device)
165165
return await Rfc2217Transport.create(device, baudrate=baudrate)
166166

167+
# Rack pod: TCP UART bridge + HTTP control plane for baud sync.
168+
# URL form: rack://host[:bridge_port][?api=http_port]. Defaults
169+
# are 9000 / 8080. Differs from tcp:// only in that set_baudrate()
170+
# POSTs to /uart/baud, so the on-device agent's set_baud rendezvous
171+
# actually syncs both ends.
172+
if device.startswith("rack://"):
173+
from defib.transport.rack import RackTransport
174+
endpoint = device[len("rack://"):]
175+
# Optional ?api=NNN suffix
176+
api_port = 8080
177+
if "?" in endpoint:
178+
endpoint, _, query = endpoint.partition("?")
179+
for kv in query.split("&"):
180+
if kv.startswith("api="):
181+
try:
182+
api_port = int(kv[len("api="):])
183+
except ValueError as e:
184+
raise TransportError(
185+
f"rack:// api port is not a number: {kv!r}"
186+
) from e
187+
if ":" in endpoint:
188+
host, _, bp = endpoint.partition(":")
189+
try:
190+
bridge_port = int(bp)
191+
except ValueError as e:
192+
raise TransportError(
193+
f"rack:// bridge port is not a number: {bp!r}"
194+
) from e
195+
else:
196+
host = endpoint
197+
bridge_port = 9000
198+
if not host:
199+
raise TransportError(f"rack:// transport needs a host (got '{device}')")
200+
logger.info(
201+
"Using RackTransport: %s:%d (api :%d)", host, bridge_port, api_port,
202+
)
203+
return await RackTransport.create_rack(host, bridge_port, api_port)
204+
167205
platform = force_platform or sys.platform
168206

169207
if platform == "darwin":

0 commit comments

Comments
 (0)