|
| 1 | +# AsyncWebSocketClient Safe Usage Rules for `_client` Pointer |
| 2 | + |
| 3 | +## Current Ownership Model |
| 4 | +- `AsyncClient* _client` is acquired in constructor via `AsyncWebServerRequest::clientRelease()` |
| 5 | +- Ownership is transferred to AsyncWebSocketClient |
| 6 | +- Destruction is triggered by AsyncClient's onDisconnect callback: `delete c;` in lambda at src/AsyncWebSocket.cpp:270 |
| 7 | +- After destruction signal, `_client` is nulled in `_onDisconnect()` at src/AsyncWebSocket.cpp:561 |
| 8 | +- **Key**: AsyncWebSocketClient is NOT the sole owner; the AsyncClient destructor is externally managed |
| 9 | + |
| 10 | +## Current Lock Architecture |
| 11 | +- `_queue_lock` (recursive_mutex): Guards **only** `_controlQueue` and `_messageQueue` |
| 12 | +- **NOT** intended to guard `_client` pointer itself (per comment in locking branch) |
| 13 | +- Each execution context (onData, onPoll, onAck, onTimeout, user API calls) may run on different threads |
| 14 | + |
| 15 | +## Safe Access Rules |
| 16 | + |
| 17 | +### 1. Construction Phase (Constructor) |
| 18 | +**Context**: Called from AsyncWebSocket::_newClient under `_ws_clients_lock` |
| 19 | +**Rule**: Safe to set up all callback handlers on `_client` without synchronization, as no other thread can yet access this object |
| 20 | +```cpp |
| 21 | +// src/AsyncWebSocket.cpp:248-295 |
| 22 | +_client->setRxTimeout(0); |
| 23 | +_client->onError(...); |
| 24 | +_client->onAck(...); |
| 25 | +_client->onDisconnect(...); // <-- This lambda will call delete c; at line 270 |
| 26 | +_client->onTimeout(...); |
| 27 | +_client->onData(...); |
| 28 | +_client->onPoll(...); |
| 29 | +``` |
| 30 | +
|
| 31 | +### 2. AsyncClient Callback Context (_onData, _onPoll, _onAck, _onTimeout, _onError) |
| 32 | +**Context**: Runs from AsyncClient's internal task (typically AsyncTCP/same core) |
| 33 | +**Rule**: |
| 34 | +- These methods are called FROM AsyncClient; they are NOT called from application/user context |
| 35 | +- Only AsyncClient callbacks can change `_client` state to null (via onDisconnect callback) |
| 36 | +- **Within these methods**, if you read `_client`, it won't become null mid-execution (single-threaded from client perspective) |
| 37 | +- **BUT**: If you release `_queue_lock` and then dereference `_client` again, or call outward to user callbacks while holding the lock, risk of re-entrancy or callback-triggered disconnect increases |
| 38 | +
|
| 39 | +**Current violations in PR #424**: |
| 40 | +- `_onDisconnect()` at line 556-562: Holds `_queue_lock`, calls `_server->_handleDisconnect(this)` which invokes user event callback |
| 41 | + - If user callback calls back into WebSocket API, lock-order inversion or re-entrancy can occur |
| 42 | + - **Fix needed**: Release `_queue_lock` before invoking user callback |
| 43 | +
|
| 44 | +- `_onData()` at line 565+: Captures `const AwsClientStatus client_status = status()` at line 569 |
| 45 | + - `status()` method **currently unguarded** (returns `_status` directly without lock) |
| 46 | + - Later at line 691-700 acquires lock and uses `_client` |
| 47 | + - **Fix needed**: Ensure `_client` null-checks are done under the same lock that prevents concurrent null-ing |
| 48 | +
|
| 49 | +### 3. User-Facing Methods (ping, text, binary, close) |
| 50 | +**Context**: Called from application code (arbitrary thread) |
| 51 | +**Rule**: |
| 52 | +- **MUST NOT** directly dereference `_client` without synchronization |
| 53 | +- **MUST** not read `_client` once, release lock, then use it |
| 54 | +- **Pattern**: Only through `find_connected_client_locked()` helper which is called under `_ws_clients_lock` |
| 55 | +
|
| 56 | +**Current violations in PR #424**: |
| 57 | +- `close()` at src/AsyncWebSocket.cpp:500-536: |
| 58 | + - Line 502-507: Acquires `_queue_lock`, checks and sets `_status` |
| 59 | + - Line 532: **Direct dereference `_client->abort()` OUTSIDE any lock after releasing `_queue_lock`** |
| 60 | + - Risk: `_onDisconnect()` could run concurrently and null `_client`, causing UAF |
| 61 | + - **Fix needed**: Either re-lock before `_client->abort()`, or capture `_client` pointer under lock and null-check outside |
| 62 | +
|
| 63 | +### 4. Queue Operations (_queueControl, _queueMessage, _runQueue) |
| 64 | +**Context**: Called from both AsyncClient callbacks and user API |
| 65 | +**Rule**: |
| 66 | +- `_queueControl()` at line 448: Acquires `_queue_lock` |
| 67 | + - Line 457: Reads `_client && _client->canSend()` |
| 68 | + - **Risk**: `_client` can be nulled by concurrent `_onDisconnect()` |
| 69 | + - **Fix needed**: `_client` null check must be atomic with state update, or re-check after lock re-acquire |
| 70 | +
|
| 71 | +- `_runQueue()` at line 404+: Called from within locked context |
| 72 | + - Line 366 and 457: Dereferences `_client->canSend()` |
| 73 | + - Assumes caller holds `_queue_lock` |
| 74 | + - **Safe IF** caller ensures no unlock between lock acquire and `_runQueue()` call |
| 75 | +
|
| 76 | +### 5. Disconnect Path (_onDisconnect, _handleDisconnect) |
| 77 | +**Context**: Called from AsyncClient::onDisconnect callback (AsyncTCP task context) |
| 78 | +**Rule**: |
| 79 | +- `_onDisconnect()` at line 556-562: |
| 80 | + - Acquires `_queue_lock` |
| 81 | + - Sets `_status = WS_DISCONNECTED` (should be under lock; currently unguarded until explicit `status()` lock added) |
| 82 | + - Nulls `_client` |
| 83 | + - Calls `_server->_handleDisconnect(this)` which fires user event under the lock |
| 84 | + - **Problem**: User callback can call back into WebSocket APIs, causing re-entrancy and potential deadlock |
| 85 | + - **Fix needed**: Unlock before invoking `_handleDisconnect()` |
| 86 | +
|
| 87 | +### 6. Cleanup Context (cleanupClients) |
| 88 | +**Context**: Called from application event loop (user's cleanup cadence) |
| 89 | +**Rule**: |
| 90 | +- Acquires `_ws_clients_lock` (server-level lock) |
| 91 | +- Calls `close()` on connected clients at src/AsyncWebSocket.cpp:1073 |
| 92 | +- Splices deleted clients into temp list to destroy after releasing lock (good pattern) |
| 93 | +- But `close()` method has the UAF issue at line 532 |
| 94 | +
|
| 95 | +## Correct Safe Pattern for _client Dereference |
| 96 | +
|
| 97 | +```cpp |
| 98 | +// Pattern 1: Short-lived operation under lock (AsyncClient callback context) |
| 99 | +void AsyncWebSocketClient::_onData_SAFE(...) { |
| 100 | + // We are already running from AsyncClient callback; no concurrent nulling from _onData path |
| 101 | + // But be careful: _onDisconnect can still run if network tears down |
| 102 | + asyncsrv::lock_guard_type lock(_queue_lock); |
| 103 | + if (_client) { |
| 104 | + // Safe to dereference once check passes, because we're in callback and null only happens from _onDisconnect |
| 105 | + _client->send_something(); |
| 106 | + } |
| 107 | +} |
| 108 | +
|
| 109 | +// Pattern 2: Capture under lock, use outside lock (user API context) |
| 110 | +bool AsyncWebSocketClient::ping_SAFE(const uint8_t *data, size_t len) { |
| 111 | + AsyncClient* client_snapshot = nullptr; |
| 112 | + { |
| 113 | + asyncsrv::lock_guard_type lock(_queue_lock); |
| 114 | + // Check both pointer and state under lock |
| 115 | + if (_client && _status == WS_CONNECTED) { |
| 116 | + client_snapshot = _client; |
| 117 | + } |
| 118 | + } |
| 119 | + // Now, outside lock, use snapshot |
| 120 | + // Risk: snapshot could become invalid, but at worst we call a dead pointer |
| 121 | + if (client_snapshot) { |
| 122 | + return _queuePing(...); // safer to queue it |
| 123 | + } |
| 124 | + return false; |
| 125 | +} |
| 126 | +
|
| 127 | +// Pattern 3: Lock around entire operation including callback (safest for short ops) |
| 128 | +void AsyncWebSocketClient::close_SAFE(uint16_t code, const char *message) { |
| 129 | + AsyncClient* client_to_close = nullptr; |
| 130 | + { |
| 131 | + asyncsrv::lock_guard_type lock(_queue_lock); |
| 132 | + if (_status != WS_CONNECTED) { |
| 133 | + return; |
| 134 | + } |
| 135 | + _status = WS_DISCONNECTING; |
| 136 | + client_to_close = _client; |
| 137 | + // Do NOT null _client here; let _onDisconnect do it |
| 138 | + } |
| 139 | + |
| 140 | + // Now outside lock, we can safely call client_to_close because: |
| 141 | + // 1. We captured it under lock |
| 142 | + // 2. It may be freed by the onDisconnect callback, but that's AsyncClient's responsibility |
| 143 | + // 3. We do NOT dereference unless we re-check _client under lock |
| 144 | + |
| 145 | + async_ws_log_w("[%s][%" PRIu32 "] CLOSE", _server->url(), _clientId); |
| 146 | + |
| 147 | + if (code) { |
| 148 | + // ... queue close frame ... |
| 149 | + } |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +## Summary of Current PR #424 Issues |
| 154 | + |
| 155 | +| Issue | Location | Risk | Fix | |
| 156 | +|-------|----------|------|-----| |
| 157 | +| _client deref outside lock | close() L532 | UAF if _onDisconnect happens | Capture under lock or re-check | |
| 158 | +| User callback in _onDisconnect under lock | L561->_handleDisconnect | Re-entrancy/deadlock | Unlock before callback | |
| 159 | +| status() call in _onData | L569 | Unguarded read of _status | Guard with per-client lock or atomic | |
| 160 | +| _client reads in _queueControl | L457 | Concurrent null from _onDisconnect | Add null-check pattern or atomic | |
| 161 | + |
| 162 | +## Recommended Next Steps |
| 163 | + |
| 164 | +1. **Separate concerns**: Use `_queue_lock` ONLY for queues, not for `_client` state |
| 165 | +2. **Add per-client coordination**: Either: |
| 166 | + - A second mutex for `_client` and `_status` safety, or |
| 167 | + - Make `_status` atomic<AwsClientStatus> if available |
| 168 | +3. **Unlock critical sections before user callbacks**: Always release locks before invoking `_handleEvent()` and user-registered callbacks |
| 169 | +4. **Document ownership explicitly**: Add comments stating `_client` is borrowed, not owned, and lifecycle assumptions |
| 170 | +5. **Consider async-safe patterns**: Queue disconnect events and process them later outside callback context, rather than invoking user code from within stack of AsyncClient callbacks |
0 commit comments