Improve device controls responsiveness and WebSocket connection resilience#6678
Improve device controls responsiveness and WebSocket connection resilience#6678bobaoapae wants to merge 17 commits into
Conversation
- Keep entity subscriptions alive permanently so controls data is always up to date in background, not just when panel is open - Cache area/device/entity registries with 5min TTL to avoid redundant WebSocket round-trips on each panel open - Send cached controls synchronously in request() to eliminate loading state on subsequent opens - Pre-fetch registries and camera thumbnails in WebsocketManager when WebSocket connects - Background sync of control entities via WebsocketManager so cache stays fresh between panel opens - Re-send all entities on any state change to prevent Android from marking unchanged controls as stale - Add dedicated OkHttpClient for WebSocket with pingInterval(15s) and connectTimeout(5s) for faster dead connection detection - Reduce DELAY_BEFORE_RECONNECT from 10s to 1s for quicker recovery - Add WiFi RSSI monitoring to proactively switch to external URL when signal drops below -80dBm - Cache camera thumbnails with interval-based refresh (10s local, 10min external) and force refresh on panel open - Add retryOnConnectionFailure to OkHttpClient - Disable FailFast crash handler in debug builds (Oppo StrictMode)
There was a problem hiding this comment.
Hi @bobaoapae
It seems you haven't yet signed a CLA. Please do so here.
Once you do that we will be able to review and accept this pull request.
Thanks!
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
There was a problem hiding this comment.
Pull request overview
This PR targets faster, more reliable Android Device Controls by caching registry/state/thumbnail data and tightening WebSocket reconnection/connection health behavior, including smarter internal-vs-external URL selection under weak Wi‑Fi.
Changes:
- Add in-memory caching for registry data and keep select WebSocket subscriptions alive longer to reduce repeated fetch/subscription churn
- Keep control entity state synced in the background and render cached controls immediately on panel open (plus background camera thumbnail prefetch)
- Improve connection resilience (OkHttp WS pings/timeouts, faster reconnect delay) and refine “internal network” detection using Wi‑Fi RSSI
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/WebSocketCore.kt | Use a dedicated OkHttpClient configuration for WebSocket connections (ping/connect/read timeouts) |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketCoreImpl.kt | Reduce reconnect delay to improve recovery time |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImpl.kt | Add registry caching + “infinite” keepalive timeouts for selected subscriptions |
| common/src/test/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImplTest.kt | Add tests intended to verify registry cache behavior |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImpl.kt | Treat very weak Wi‑Fi as external to avoid degraded internal connections |
| common/src/test/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImplTest.kt | Add coverage for weak Wi‑Fi behavior |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelper.kt | Add Wi‑Fi RSSI API + weak-signal threshold constant |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelperImpl.kt | Implement RSSI retrieval |
| common/src/main/kotlin/io/homeassistant/companion/android/common/data/HomeAssistantApis.kt | Enable OkHttp retry-on-connection-failure for API client |
| app/src/main/kotlin/io/homeassistant/companion/android/controls/HaControlsProviderService.kt | Immediate cached control emission, background camera refresh, and entity resend behavior changes |
| app/src/main/kotlin/io/homeassistant/companion/android/controls/CameraControl.kt | Add in-memory thumbnail cache + prefetch path |
| app/src/test/kotlin/io/homeassistant/companion/android/controls/CameraControlTest.kt | Add tests intended for thumbnail caching/prefetch |
| app/src/main/kotlin/io/homeassistant/companion/android/websocket/WebsocketManager.kt | Background sync for control entities + background registry/thumbnail prefetch |
| app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt | Add StrictMode ignore rule for Oppo/OnePlus DiskReadViolation stack traces |
- Map HA 'auto' hvac mode to Android's MODE_HEAT_COOL so climate entities with auto/dry/fan_only modes are presented as thermostats instead of falling back to a plain slider - Relax entityShouldBePresentedAsThermostat to require at least one mappable mode instead of all modes, and use safe Number cast for supported_features - Cycle only through Android-mappable modes on toggle action - Replace per-class OEM StrictMode ignore rules with a generic rule that ignores DiskRead/DiskWrite violations with no app frames in the stack trace, covering all OEM components (Oppo, OnePlus, etc.)
…rashes" This reverts commit 63301df.
Ignore DiskReadViolation and DiskWriteViolation that have no app frames in the stack trace, as these originate from OEM system components during binder transactions. Replaces the Oppo-specific rule with a generic approach that covers all OEM vendors.
- Use ConcurrentHashMap for all shared caches (thumbnails, entities, control IDs, base URLs) to prevent concurrent modification - Scope entity cache by serverId to avoid multi-server conflicts - Rethrow CancellationException in prefetchRegistries to preserve cooperative cancellation - Add connect/read timeout (5s) to camera thumbnail downloads - Support legacy control IDs without server prefix in cached send - Change SUBSCRIPTION_KEEPALIVE from INFINITE to 30 minutes - Fix registry cache tests to provide proper mock responses - Remove CameraControlTest (impractical to mock URL/BitmapFactory) - Fix KDoc on refreshCameraThumbnails
- Persist lastControlEntityIds to SharedPreferences so WebsocketManager can resume background entity sync after process restart - Resolve baseUrl dynamically in prefetchCameraThumbnails when not cached - Add SystemWebViewGoogle to Chromium IncorrectContextUseViolation ignore rule (existing rule only covered TrichromeWebViewGoogle variant) - Add logging for cached controls send diagnostics
The panel would hang on loading spinners because the subscribe_entities
initial state event was silently dropped by the subscription pipeline.
createSubscriptionFlow used callbackFlow{...}.shareIn(WhileSubscribed,
replay=0), which only starts the upstream on first subscriber. The
initial event arrived between sendMessage returning and the caller's
.collect — with no subscriber yet and replay=0, the event was lost and
.collect waited forever. Replaced with MutableSharedFlow(replay=1) and
a relay job started before sendMessage so events buffered in the replay
cache reach late subscribers.
Additional hardening in HaControlsProviderService:
- Send placeholder controls with STATUS_UNKNOWN when the in-memory
cache is empty, so the panel never shows infinite spinners.
- SupervisorJob on ioScope and webSocketScope so one child failure
does not cancel siblings.
- try-catch around the async work path and the compressed-state
collect to surface exceptions via Timber instead of dying silently.
- Hoist serverManager.servers() out of the groupBy lambda to avoid
O(controlIds) Room queries per panel open.
- Fix entities.remove("ha_failed.\$it") which was using Map.Entry's
toString representation instead of the key.
HaFailedControl: map "loading" state to STATUS_UNKNOWN.
WebSocketCoreImpl.onMessage: truncate payload preview to 200 chars.
Registry payloads can be megabytes and logcat splits them into
thousands of chunks, blocking the WebSocket thread for tens of seconds.
CameraControl.provideControlFeatures: remove the runBlocking + URL.openStream() fallback for cache misses. provideControlFeatures can execute on the main thread via sendCachedControlsImmediately, and the 2s runBlocking would trip StrictMode + CrashFailFastHandler. Thumbnails are populated asynchronously by prefetchThumbnail (WebsocketManager background sync and refreshCameraThumbnails on panel open), so cache misses now show the placeholder icon until the prefetch completes. IgnoreViolationRules.IgnoreChromiumTrichomeWrongContextUsage: wrap the multi-line OR expression on its own indent level to satisfy ktlint's paren/newline rules.
sendCachedControlsImmediately rendered cached entity state as live whenever WebSocketState was Active, but OkHttp only transitions the socket to Closed after a ping/read timeout (~45s) once the link drops. During that window the panel showed hours-old values as if they were current. - Add isEntityStateTrustworthy(WebSocketState, hasActiveNetwork) — the cache is only safe to display when both signals agree we are online: the WebSocket reports Active AND ConnectivityManager reports an active network. - HaControlsProviderService injects NetworkHelper for a synchronous main-thread ConnectivityManager check, and caches the per-server WebSocketRepository in webSocketRepositoryCache so the main thread can read getConnectionState() without suspending. The cache is primed from suspend contexts (subscribeToEntitiesForServer and WebsocketManager.syncControlEntities), so the first cold-start panel open after process restart already has both signals available. - When either signal says we are offline, emit STATUS_UNKNOWN placeholders instead of cached values; the async path replaces them once the subscription delivers events. - WebSocketCoreImpl: subscribe to NetworkChangeObserver and force-close the live connection when ConnectivityManager reports no network. This short-circuits the OkHttp ping timeout and unblocks any pendingConnectDeferred waiting on the 30s auth timeout when the user re-enables the network. Tests: - New IsEntityStateTrustworthyTest covers the truth table for the helper, including the WebSocket-Active-but-no-network bug scenario. - WebSocketCoreImplTest gains three cases for the network listener: force-close on loss, no-op when nothing is connected, and no-op on a capabilities-only emission while still online.
# Conflicts: # app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt
The Frontend connection-error screen surfaced even for transient WebView failures (slow page load, expired token mid-load, momentary SSL glitch in Chromium). The background WebSocket pipeline that powers the system controls was unaffected — the user could see device controls work fine while the app greeted them with "Failed to connect" and a Retry button on cold start. When `FrontendViewModel.onError` receives an `UnreachableError` and both ConnectivityManager and the per-server WebSocket report Active, we now schedule a silent retry instead of transitioning to the Error state. The view stays on `LoadingScreen` (which already renders for both LoadServer and Loading) for the duration of the backoff window, so the error screen never flashes when the retry succeeds. - shouldAutoRetry: synchronous gate based on hasActiveNetwork() and a retry budget (3 attempts). Suspended ServerManager.webSocketRepository(...) lookups happen inside the retry coroutine after the backoff delay. - scheduleAutoRetry: launches in viewModelScope, delays with exponential backoff (1s -> 2s -> 4s, capped at 5s) plus 0..500ms jitter, re-checks the gate strictly via isLiveConnectionTrustworthy at fire time, and bails to the Error state if the gate fails. - AuthenticationError, UnknownError, and UnrecoverableError do not auto-retry. Their root cause will not be solved by retrying and surfacing the error immediately is the better UX. - onRetry, switchServer, and the Connected -> Content transition all reset the budget and cancel any pending retry job. The retry job lives in viewModelScope so onCleared cancels it automatically. Promoted the existing `isEntityStateTrustworthy` helper (introduced in the prior commit on this branch for the controls layer) to a shared module-level `isLiveConnectionTrustworthy` in `common/.../websocket/WebSocketHealth.kt`. The helper now gates two distinct decisions — "render cached entity state as live" and "auto-retry instead of erroring" — both of which require the same "WebSocket Active AND ConnectivityManager reports network" predicate. The test moved with it. Tests in `FrontendViewModelTest.AutoRetry`: - UnreachableError + healthy gate -> state stays out of Error, retry fires. - UnreachableError + WebSocket Closed -> state becomes Error after one backoff (synchronous bail by hasActiveNetwork is not enough; the strict gate inside scheduleAutoRetry catches the WS state). - UnreachableError + no network -> Error immediately, no suspend WS lookup. - AuthenticationError -> Error immediately, no auto-retry scheduled. - 4 consecutive UnreachableErrors -> 4th surfaces Error. - onRetry while retry pending -> cancels job, fresh load fires. - switchServer cancels pending retry on the previous server. - Manual onRetry after exhaustion resets the budget for a fresh round.
- WebSocketHealth: rewrite KDoc in neutral voice, no first-person plural. - FrontendViewModel: drop the outer Throwable catch around the auto-retry body. The only realistically-thrown exception (IllegalStateException from webSocketRepository(serverId) when the server is unknown) is already handled inline; anything else is unexpected and should bubble up to viewModelScope's CoroutineExceptionHandler instead of being swallowed into emitErrorState. - FrontendViewModelTest: trim the conversational phrasing on the auto- retry test comments to match the rest of the file.
…fetch refreshCameraThumbnails captured the cached entity at the top of each iteration and reused it after prefetchThumbnail returned. The prefetch runs a network round-trip (up to 5s on a stuck camera), and during that window the parallel subscribeToEntitiesForServer flow can write a fresh entity state into cachedEntities. Sending the original snapshot to the subscriber after the prefetch then races against — and visibly overwrites — the live control that subscribeToEntitiesForServer just delivered. Re-read cachedEntities after the prefetch, falling back to the original snapshot only when the cache was somehow cleared in the meantime.
# Conflicts: # app/src/main/kotlin/io/homeassistant/companion/android/frontend/FrontendViewModel.kt # app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt # app/src/test/kotlin/io/homeassistant/companion/android/frontend/FrontendViewModelTest.kt
| override fun onMessage(webSocket: WebSocket, text: String) { | ||
| Timber.d("Websocket: onMessage (text: ${sensitive(text)})") | ||
| // Logging huge payloads (registries can be megabytes) blocks this thread for tens of seconds | ||
| // because logcat splits the message into thousands of small chunks. Truncate aggressively. | ||
| Timber.d("Websocket: onMessage (size=${text.length} preview=${sensitive(text.take(200))})") |
There was a problem hiding this comment.
fun sensitive(sensitive: String): String = sensitive.takeIf { BuildConfig.DEBUG } ?: "HIDDEN"
It only impacts debug builds and it is on purpose.
TimoPtr
left a comment
There was a problem hiding this comment.
I won't review this PR in this state. I invite you to open a PR for each of the point you've mentioned in the description. Let's go one by one since some are going to trigger some discussion.
Thanks for the time you've spent on this changes, but it is simply not possible to review everything all together. You're touching critical parts of the application that could have huge impacts on the user we need to be extract careful with any on the changes made.
|
Hello, thanks for the feedback. I will try to separate things, I don't know if I will be capable of that, a lot of things are correlated. But I will keep working on that and try send pull requests with feew charges by time. Otherwise this stay here as a entry point for someone else to take a look and split if interested. I'm working on that for one or two months using the app on my own device and the usage became very good. But I sure comprehend the implications, a lot of core code changed. |
Summary
I've been experiencing slow device controls loading, especially when away from my home network. Opening the controls panel would show "Loading..." for several seconds before any data appeared, and certain scenarios (rapid open/close cycles, or opening the panel after force-stop) would cause controls to get stuck permanently. These are problems that have been reported by others as well (#3396, #5195, #5952, #5472).
After investigating the root causes, I identified several areas where the controls experience could be improved:
Registry data (area/device/entity) is fetched fresh via WebSocket every time the controls panel opens, even though this data rarely changes. Added an in-memory cache with a 5-minute TTL that also serves stale data when the WebSocket is disconnected, so controls don't have to wait for a reconnection to display.
Entity states are only available while the controls panel is open. Moved entity state subscription to
WebsocketManagerso states are continuously synced in background. When the panel opens, cached states are sent synchronously inrequest()— controls appear instantly without any "Loading..." state on subsequent opens. On a cold start with no in-memory cache (e.g. right after force-stop), placeholder controls withSTATUS_UNKNOWNare emitted immediately so the panel never shows an infinite loading spinner; the async path replaces them with real data a moment later.Camera thumbnails were downloaded synchronously in
createControl()viarunBlocking, blocking delivery of every control below the camera in the list and — worse — trippingStrictMode+CrashFailFastHandlerwhenprovideControlFeaturesruns on the main thread via the synchronous cached-control path. Removed therunBlockingfallback entirely. Thumbnails are populated by a dedicated cache with interval-based background refresh (10s on local network, 10min on external) and force-refresh on panel open. Cache misses render the placeholder icon until the background prefetch completes.When one entity changes state (e.g. a cover moving 1% at a time), only that entity was re-sent to the subscriber. Android's
ControlsProviderServicecan mark other controls as stale when they stop being delivered. Changed to re-send all entities on any state change.WebSocket dead connection detection relies on a manual 30s ping cycle in
WebsocketManager. Added a dedicatedOkHttpClientfor WebSocket connections withpingInterval(15s)andconnectTimeout(5s)for faster detection and failover. Also reducedDELAY_BEFORE_RECONNECTfrom 10s to 1s so the WebSocket recovers quickly from transient failures.WiFi signal strength is not considered when determining if the device is on the home network. Added
getWifiSignalStrength()toWifiHelperand a check inisInternal()— when WiFi RSSI drops below -80 dBm, the app proactively switches to the external/cloud URL before the connection becomes unusable.Fix a race in
WebSocketCoreImpl.createSubscriptionFlowthat causedsubscribe_entitiesinitial state events to be silently dropped. The previous implementation usedcallbackFlow { ... }.shareIn(WhileSubscribed, replay = 0), which only starts the upstream on first subscriber. The initial state event arrived betweensendMessagereturning and the caller's.collect— with no subscriber yet andreplay = 0, the event was lost and.collectwaited forever. This reproducibly left device controls stuck on placeholders after the app process was killed. Replaced with an explicitMutableSharedFlow(replay = 1, extraBufferCapacity = 64)plus a relay job started beforesendMessage, so the initial event is buffered in the replay cache and delivered to late subscribers. Registry update subscriptions also keep a longSUBSCRIPTION_KEEPALIVE = 30.minutesso background sync and subsequent panel opens reuse the same subscription without re-subscribing on every cycle.Added a generic
IgnoreSystemDiskIoStrictMode rule that filtersDiskReadViolationandDiskWriteViolationwhose stack trace contains no application frames. This replaces an earlier OEM-specific rule and covers Oppo/OnePlus (OplusUIFirstManager,OplusHansManager,OplusBinderProxy), Samsung, MIUI, and any future OEM disk I/O happening in binder transactions outside application control.WebSocketCoreImpl.onMessagelogged the entire raw payload viaTimber.d, and in debug buildssensitive()returns the text unchanged. With coalesced registry responses reaching several MB, logcat splits the call into thousands of 4 KB chunks and the WebSocket thread blocks on logging for tens of seconds — which by itself caused post-force-stop panel opens to appear hung. Truncated the preview to 200 characters (keeping the size prefix) so large payloads log in under a millisecond and the message dispatcher doesn't starve.After running the previous nine changes in production for a few weeks, two further problems surfaced — both share the same diagnostic insight: WebSocket health is the right signal to gate several decisions that today rely only on local heuristics.
Cached controls were rendered as live as long as
WebSocketState == Active. OkHttp only transitions the socket toClosedafter a ping/read timeout (up to ~45s after the physical link drops). During that window, panels could display values that were hours old as if they were current — a user could tap a "light off" tile from the cache and believe the command succeeded when nothing actually got sent. AddedisLiveConnectionTrustworthy(WebSocketState, hasActiveNetwork)incommon/.../websocket/WebSocketHealth.kt. The cache is now only displayed as live when both the WebSocket reportsActiveANDConnectivityManagerreports an active network; otherwisesendCachedControlsImmediatelyemitsSTATUS_UNKNOWNplaceholders that the async path replaces once the real subscription delivers events. To shorten the OkHttp detection window further,WebSocketCoreImplnow subscribes toNetworkChangeObserverand force-closes the live connection the instantConnectivityManagerreports no network — short-circuiting the ~45s ping timeout and immediately unblocking any in-flightconnect()call that would otherwise wait out the 30s auth timeout when the user re-enables the network.Opening the app sometimes shows the Frontend connection-error screen even when the background WebSocket is perfectly healthy. The user could see device controls work fine in the system controls panel while the app greeted them with "Failed to connect" and a Retry button on cold start. Likely causes for the WebView's failure path: the 10s
CONNECTION_TIMEOUTfiring on slow page loads, briefly expired tokens caught mid-load, or momentary SSL glitches in Chromium. WhenFrontendViewModel.onErrorreceives anUnreachableError, the synchronous gate (network available + retry budget) plus a strict re-check at fire time (WebSocketState.Active) now decides whether to schedule a silent retry instead of transitioning to theErrorstate. The view stays onLoadingScreenfor the duration of the backoff window, so the error screen never flashes when the retry succeeds. Up to 3 attempts with exponential backoff (1s → 2s → 4s, capped at 5s) plus 0..500ms jitter; the strict gate is re-evaluated after the delay so a WebSocket that flips toClosedmid-backoff bails to the error state instead of looping.AuthenticationError,UnknownError, andUnrecoverableErrordo not auto-retry — surfacing them immediately gives the user faster feedback.onRetry,switchServer, and theConnected → Contenttransition all reset the budget and cancel any pending retry job; the retry job lives inviewModelScopesoonClearedcancels it automatically.Related issues
Checklist
Screenshots
N/A — no user-facing UI changes, only behavioral improvements.
Link to pull request in documentation repositories
N/A
Any other notes
pingInterval(15s)on the WebSocket client may have battery implications. It's configured only on the WebSocket-specificOkHttpClient(not the shared REST client) and only generates traffic when the WebSocket is already connected. Happy to discuss tradeoffs or make the interval configurable.SUBSCRIPTION_KEEPALIVE = 30.minutesis long enough that brief panel close/reopen cycles (and the background entity sync inWebsocketManager) reuse the same subscription, but short enough that the server does unsubscribe on real idle. This replaced an earlierDuration.INFINITEthat reviewers flagged as a potential leak.isLiveConnectionTrustworthyhelper is shared between two callers (HaControlsProviderServicefor [PLAYSTORE] release to playstore #10 andFrontendViewModelfor [CI] Travis with Firebase App Distribution #11) and has a dedicated unit test (IsLiveConnectionTrustworthyTest). The two callers consume the same per-serverWebSocketRepositorysingleton viaServerManager, so a single source of truth gates both.