Skip to content

transport/common/hci_h4: propagate OOM instead of asserting#2233

Open
gmarull wants to merge 1 commit into
apache:masterfrom
teslabs:oom-no-assert
Open

transport/common/hci_h4: propagate OOM instead of asserting#2233
gmarull wants to merge 1 commit into
apache:masterfrom
teslabs:oom-no-assert

Conversation

@gmarull
Copy link
Copy Markdown
Contributor

@gmarull gmarull commented May 13, 2026

When hci_h4_sm_w4_header() or hci_h4_sm_w4_payload() returns -1 because the underlying allocator returned NULL (transport ACL pool exhausted), hci_h4_sm_rx() currently asserts via two assert(rc >= 0) checks. This turns a recoverable pool-pressure event into a fatal crash.

The condition is reachable in practice on host-side transports such as SiFli SF32LB52: the H4 RX task allocates ACL mbufs from mpool_acl and queues them on ble_hs_rx_q for the host event loop to drain. If the host is parked (e.g. blocked while serving GATT notifications into a downstream buffer that drains slowly) the queue grows until the pool is empty; the next allocation returns NULL and the assert fires.

Remove both assert(rc >= 0) so -1 propagates through hci_h4_sm_rx() to the transport. Callers can then back off (e.g. block until a buffer is returned to the pool via the put callback) and retry with the same input. The existing assert(rc < 0) at the bottom of hci_h4_sm_rx() still enforces the legitimate "consumed nothing => rc indicates error" invariant.

Note that when the input source is destructive, callers must retry with the same buffer rather than drop bytes on -1, otherwise the H4 framing desyncs. The state machine already preserves partial state across calls so retry is safe.

When hci_h4_sm_w4_header() or hci_h4_sm_w4_payload() returns -1 because
the underlying allocator returned NULL (transport ACL pool exhausted),
hci_h4_sm_rx() currently asserts via two `assert(rc >= 0)` checks. This
turns a recoverable pool-pressure event into a fatal crash.

The condition is reachable in practice on host-side transports such as
SiFli SF32LB52: the H4 RX task allocates ACL mbufs from mpool_acl and
queues them on ble_hs_rx_q for the host event loop to drain. If the host
is parked (e.g. blocked while serving GATT notifications into a
downstream buffer that drains slowly) the queue grows until the pool is
empty; the next allocation returns NULL and the assert fires.

Remove both `assert(rc >= 0)` so -1 propagates through hci_h4_sm_rx()
to the transport. Callers can then back off (e.g. block until a buffer
is returned to the pool via the put callback) and retry with the same
input. The existing `assert(rc < 0)` at the bottom of hci_h4_sm_rx()
still enforces the legitimate "consumed nothing => rc indicates error"
invariant.

Note that when the input source is destructive, callers must retry with
the same buffer rather than drop bytes on -1, otherwise the H4 framing
desyncs. The state machine already preserves partial state across calls
so retry is safe.

Signed-off-by: Gerard Marull-Paretas <gerard@teslabs.com>
@gmarull
Copy link
Copy Markdown
Contributor Author

gmarull commented May 13, 2026

ref. coredevices/PebbleOS#1277

@sjanc
Copy link
Copy Markdown
Contributor

sjanc commented May 13, 2026

H4 has no way of recovery if data is dropped and this was added on purpose [1]
However I'm fine with removing assert() if you are able to handle this

@andrzej-kaczmarek thoughts?

[1] 2a0379d

@gmarull
Copy link
Copy Markdown
Contributor Author

gmarull commented May 13, 2026

H4 has no way of recovery if data is dropped and this was added on purpose [1] However I'm fine with removing assert() if you are able to handle this

@andrzej-kaczmarek thoughts?

[1] 2a0379d

We're still analyzing this in more detail; we just raised this to get some initial feedback (see the other side here coredevices/PebbleOS#1277). Not all users are experiencing this problem, but some exhaust the ACL pool easily (despite having increased it a few times).

@sjanc
Copy link
Copy Markdown
Contributor

sjanc commented May 13, 2026

With H4 you either have infinite memory or some sort of flow control (ie for RS232 Bluetooth HCI UART transport layer requires HWFC). No sure how this is on SiFli but if IPC/mailbox is used with H4 ontop than those should have flow control in transport

Also, Controller to Host data flow control should probably be used (BLE_HS_FLOW_CTRL)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants