Skip to content

Improve device controls responsiveness and WebSocket connection resilience#6678

Draft
bobaoapae wants to merge 17 commits into
home-assistant:mainfrom
bobaoapae:main
Draft

Improve device controls responsiveness and WebSocket connection resilience#6678
bobaoapae wants to merge 17 commits into
home-assistant:mainfrom
bobaoapae:main

Conversation

@bobaoapae
Copy link
Copy Markdown

@bobaoapae bobaoapae commented Apr 7, 2026

Summary

I've been experiencing slow device controls loading, especially when away from my home network. Opening the controls panel would show "Loading..." for several seconds before any data appeared, and certain scenarios (rapid open/close cycles, or opening the panel after force-stop) would cause controls to get stuck permanently. These are problems that have been reported by others as well (#3396, #5195, #5952, #5472).

After investigating the root causes, I identified several areas where the controls experience could be improved:

  1. Registry data (area/device/entity) is fetched fresh via WebSocket every time the controls panel opens, even though this data rarely changes. Added an in-memory cache with a 5-minute TTL that also serves stale data when the WebSocket is disconnected, so controls don't have to wait for a reconnection to display.

  2. Entity states are only available while the controls panel is open. Moved entity state subscription to WebsocketManager so states are continuously synced in background. When the panel opens, cached states are sent synchronously in request() — controls appear instantly without any "Loading..." state on subsequent opens. On a cold start with no in-memory cache (e.g. right after force-stop), placeholder controls with STATUS_UNKNOWN are emitted immediately so the panel never shows an infinite loading spinner; the async path replaces them with real data a moment later.

  3. Camera thumbnails were downloaded synchronously in createControl() via runBlocking, blocking delivery of every control below the camera in the list and — worse — tripping StrictMode + CrashFailFastHandler when provideControlFeatures runs on the main thread via the synchronous cached-control path. Removed the runBlocking fallback entirely. Thumbnails are populated by a dedicated cache with interval-based background refresh (10s on local network, 10min on external) and force-refresh on panel open. Cache misses render the placeholder icon until the background prefetch completes.

  4. When one entity changes state (e.g. a cover moving 1% at a time), only that entity was re-sent to the subscriber. Android's ControlsProviderService can mark other controls as stale when they stop being delivered. Changed to re-send all entities on any state change.

  5. WebSocket dead connection detection relies on a manual 30s ping cycle in WebsocketManager. Added a dedicated OkHttpClient for WebSocket connections with pingInterval(15s) and connectTimeout(5s) for faster detection and failover. Also reduced DELAY_BEFORE_RECONNECT from 10s to 1s so the WebSocket recovers quickly from transient failures.

  6. WiFi signal strength is not considered when determining if the device is on the home network. Added getWifiSignalStrength() to WifiHelper and a check in isInternal() — when WiFi RSSI drops below -80 dBm, the app proactively switches to the external/cloud URL before the connection becomes unusable.

  7. Fix a race in WebSocketCoreImpl.createSubscriptionFlow that caused subscribe_entities initial state events to be silently dropped. The previous implementation used callbackFlow { ... }.shareIn(WhileSubscribed, replay = 0), which only starts the upstream on first subscriber. The initial state event arrived between sendMessage returning and the caller's .collect — with no subscriber yet and replay = 0, the event was lost and .collect waited forever. This reproducibly left device controls stuck on placeholders after the app process was killed. Replaced with an explicit MutableSharedFlow(replay = 1, extraBufferCapacity = 64) plus a relay job started before sendMessage, so the initial event is buffered in the replay cache and delivered to late subscribers. Registry update subscriptions also keep a long SUBSCRIPTION_KEEPALIVE = 30.minutes so background sync and subsequent panel opens reuse the same subscription without re-subscribing on every cycle.

  8. Added a generic IgnoreSystemDiskIo StrictMode rule that filters DiskReadViolation and DiskWriteViolation whose stack trace contains no application frames. This replaces an earlier OEM-specific rule and covers Oppo/OnePlus (OplusUIFirstManager, OplusHansManager, OplusBinderProxy), Samsung, MIUI, and any future OEM disk I/O happening in binder transactions outside application control.

  9. WebSocketCoreImpl.onMessage logged the entire raw payload via Timber.d, and in debug builds sensitive() returns the text unchanged. With coalesced registry responses reaching several MB, logcat splits the call into thousands of 4 KB chunks and the WebSocket thread blocks on logging for tens of seconds — which by itself caused post-force-stop panel opens to appear hung. Truncated the preview to 200 characters (keeping the size prefix) so large payloads log in under a millisecond and the message dispatcher doesn't starve.

After running the previous nine changes in production for a few weeks, two further problems surfaced — both share the same diagnostic insight: WebSocket health is the right signal to gate several decisions that today rely only on local heuristics.

  1. Cached controls were rendered as live as long as WebSocketState == Active. OkHttp only transitions the socket to Closed after a ping/read timeout (up to ~45s after the physical link drops). During that window, panels could display values that were hours old as if they were current — a user could tap a "light off" tile from the cache and believe the command succeeded when nothing actually got sent. Added isLiveConnectionTrustworthy(WebSocketState, hasActiveNetwork) in common/.../websocket/WebSocketHealth.kt. The cache is now only displayed as live when both the WebSocket reports Active AND ConnectivityManager reports an active network; otherwise sendCachedControlsImmediately emits STATUS_UNKNOWN placeholders that the async path replaces once the real subscription delivers events. To shorten the OkHttp detection window further, WebSocketCoreImpl now subscribes to NetworkChangeObserver and force-closes the live connection the instant ConnectivityManager reports no network — short-circuiting the ~45s ping timeout and immediately unblocking any in-flight connect() call that would otherwise wait out the 30s auth timeout when the user re-enables the network.

  2. Opening the app sometimes shows the Frontend connection-error screen even when the background WebSocket is perfectly healthy. The user could see device controls work fine in the system controls panel while the app greeted them with "Failed to connect" and a Retry button on cold start. Likely causes for the WebView's failure path: the 10s CONNECTION_TIMEOUT firing on slow page loads, briefly expired tokens caught mid-load, or momentary SSL glitches in Chromium. When FrontendViewModel.onError receives an UnreachableError, the synchronous gate (network available + retry budget) plus a strict re-check at fire time (WebSocketState.Active) now decides whether to schedule a silent retry instead of transitioning to the Error state. The view stays on LoadingScreen for the duration of the backoff window, so the error screen never flashes when the retry succeeds. Up to 3 attempts with exponential backoff (1s → 2s → 4s, capped at 5s) plus 0..500ms jitter; the strict gate is re-evaluated after the delay so a WebSocket that flips to Closed mid-backoff bails to the error state instead of looping. AuthenticationError, UnknownError, and UnrecoverableError do not auto-retry — surfacing them immediately gives the user faster feedback. onRetry, switchServer, and the Connected → Content transition all reset the budget and cancel any pending retry job; the retry job lives in viewModelScope so onCleared cancels it automatically.

Related issues

Checklist

  • New or updated tests have been added following the testing guidelines.
  • The code follows the project's code style and best practices.
  • The changes have been thoroughly tested; edge cases considered.
  • Changes are backward compatible whenever feasible.

Screenshots

N/A — no user-facing UI changes, only behavioral improvements.

Link to pull request in documentation repositories

N/A

Any other notes

  • The pingInterval(15s) on the WebSocket client may have battery implications. It's configured only on the WebSocket-specific OkHttpClient (not the shared REST client) and only generates traffic when the WebSocket is already connected. Happy to discuss tradeoffs or make the interval configurable.
  • SUBSCRIPTION_KEEPALIVE = 30.minutes is long enough that brief panel close/reopen cycles (and the background entity sync in WebsocketManager) reuse the same subscription, but short enough that the server does unsubscribe on real idle. This replaced an earlier Duration.INFINITE that reviewers flagged as a potential leak.
  • The new isLiveConnectionTrustworthy helper is shared between two callers (HaControlsProviderService for [PLAYSTORE] release to playstore #10 and FrontendViewModel for [CI] Travis with Firebase App Distribution #11) and has a dedicated unit test (IsLiveConnectionTrustworthyTest). The two callers consume the same per-server WebSocketRepository singleton via ServerManager, so a single source of truth gates both.
  • Tested on OnePlus 15 (Android 16) on both local network and 4G/cloud connections with the persistent WebSocket setting enabled.

- Keep entity subscriptions alive permanently so controls data is
  always up to date in background, not just when panel is open
- Cache area/device/entity registries with 5min TTL to avoid
  redundant WebSocket round-trips on each panel open
- Send cached controls synchronously in request() to eliminate
  loading state on subsequent opens
- Pre-fetch registries and camera thumbnails in WebsocketManager
  when WebSocket connects
- Background sync of control entities via WebsocketManager so
  cache stays fresh between panel opens
- Re-send all entities on any state change to prevent Android
  from marking unchanged controls as stale
- Add dedicated OkHttpClient for WebSocket with pingInterval(15s)
  and connectTimeout(5s) for faster dead connection detection
- Reduce DELAY_BEFORE_RECONNECT from 10s to 1s for quicker recovery
- Add WiFi RSSI monitoring to proactively switch to external URL
  when signal drops below -80dBm
- Cache camera thumbnails with interval-based refresh (10s local,
  10min external) and force refresh on panel open
- Add retryOnConnectionFailure to OkHttpClient
- Disable FailFast crash handler in debug builds (Oppo StrictMode)
Copilot AI review requested due to automatic review settings April 7, 2026 01:27
Copy link
Copy Markdown

@home-assistant home-assistant Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bobaoapae

It seems you haven't yet signed a CLA. Please do so here.

Once you do that we will be able to review and accept this pull request.

Thanks!

@home-assistant home-assistant Bot marked this pull request as draft April 7, 2026 01:27
@home-assistant
Copy link
Copy Markdown

home-assistant Bot commented Apr 7, 2026

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets faster, more reliable Android Device Controls by caching registry/state/thumbnail data and tightening WebSocket reconnection/connection health behavior, including smarter internal-vs-external URL selection under weak Wi‑Fi.

Changes:

  • Add in-memory caching for registry data and keep select WebSocket subscriptions alive longer to reduce repeated fetch/subscription churn
  • Keep control entity state synced in the background and render cached controls immediately on panel open (plus background camera thumbnail prefetch)
  • Improve connection resilience (OkHttp WS pings/timeouts, faster reconnect delay) and refine “internal network” detection using Wi‑Fi RSSI

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/WebSocketCore.kt Use a dedicated OkHttpClient configuration for WebSocket connections (ping/connect/read timeouts)
common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketCoreImpl.kt Reduce reconnect delay to improve recovery time
common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImpl.kt Add registry caching + “infinite” keepalive timeouts for selected subscriptions
common/src/test/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketRepositoryImplTest.kt Add tests intended to verify registry cache behavior
common/src/main/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImpl.kt Treat very weak Wi‑Fi as external to avoid degraded internal connections
common/src/test/kotlin/io/homeassistant/companion/android/common/data/servers/ServerConnectionStateProviderImplTest.kt Add coverage for weak Wi‑Fi behavior
common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelper.kt Add Wi‑Fi RSSI API + weak-signal threshold constant
common/src/main/kotlin/io/homeassistant/companion/android/common/data/network/WifiHelperImpl.kt Implement RSSI retrieval
common/src/main/kotlin/io/homeassistant/companion/android/common/data/HomeAssistantApis.kt Enable OkHttp retry-on-connection-failure for API client
app/src/main/kotlin/io/homeassistant/companion/android/controls/HaControlsProviderService.kt Immediate cached control emission, background camera refresh, and entity resend behavior changes
app/src/main/kotlin/io/homeassistant/companion/android/controls/CameraControl.kt Add in-memory thumbnail cache + prefetch path
app/src/test/kotlin/io/homeassistant/companion/android/controls/CameraControlTest.kt Add tests intended for thumbnail caching/prefetch
app/src/main/kotlin/io/homeassistant/companion/android/websocket/WebsocketManager.kt Background sync for control entities + background registry/thumbnail prefetch
app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt Add StrictMode ignore rule for Oppo/OnePlus DiskReadViolation stack traces

- Map HA 'auto' hvac mode to Android's MODE_HEAT_COOL so climate
  entities with auto/dry/fan_only modes are presented as thermostats
  instead of falling back to a plain slider
- Relax entityShouldBePresentedAsThermostat to require at least one
  mappable mode instead of all modes, and use safe Number cast for
  supported_features
- Cycle only through Android-mappable modes on toggle action
- Replace per-class OEM StrictMode ignore rules with a generic rule
  that ignores DiskRead/DiskWrite violations with no app frames in
  the stack trace, covering all OEM components (Oppo, OnePlus, etc.)
Ignore DiskReadViolation and DiskWriteViolation that have no app frames
in the stack trace, as these originate from OEM system components during
binder transactions. Replaces the Oppo-specific rule with a generic
approach that covers all OEM vendors.
- Use ConcurrentHashMap for all shared caches (thumbnails, entities,
  control IDs, base URLs) to prevent concurrent modification
- Scope entity cache by serverId to avoid multi-server conflicts
- Rethrow CancellationException in prefetchRegistries to preserve
  cooperative cancellation
- Add connect/read timeout (5s) to camera thumbnail downloads
- Support legacy control IDs without server prefix in cached send
- Change SUBSCRIPTION_KEEPALIVE from INFINITE to 30 minutes
- Fix registry cache tests to provide proper mock responses
- Remove CameraControlTest (impractical to mock URL/BitmapFactory)
- Fix KDoc on refreshCameraThumbnails
- Persist lastControlEntityIds to SharedPreferences so WebsocketManager
  can resume background entity sync after process restart
- Resolve baseUrl dynamically in prefetchCameraThumbnails when not cached
- Add SystemWebViewGoogle to Chromium IncorrectContextUseViolation ignore
  rule (existing rule only covered TrichromeWebViewGoogle variant)
- Add logging for cached controls send diagnostics
bobaoapae and others added 9 commits April 8, 2026 03:56
The panel would hang on loading spinners because the subscribe_entities
initial state event was silently dropped by the subscription pipeline.

createSubscriptionFlow used callbackFlow{...}.shareIn(WhileSubscribed,
replay=0), which only starts the upstream on first subscriber. The
initial event arrived between sendMessage returning and the caller's
.collect — with no subscriber yet and replay=0, the event was lost and
.collect waited forever. Replaced with MutableSharedFlow(replay=1) and
a relay job started before sendMessage so events buffered in the replay
cache reach late subscribers.

Additional hardening in HaControlsProviderService:
- Send placeholder controls with STATUS_UNKNOWN when the in-memory
  cache is empty, so the panel never shows infinite spinners.
- SupervisorJob on ioScope and webSocketScope so one child failure
  does not cancel siblings.
- try-catch around the async work path and the compressed-state
  collect to surface exceptions via Timber instead of dying silently.
- Hoist serverManager.servers() out of the groupBy lambda to avoid
  O(controlIds) Room queries per panel open.
- Fix entities.remove("ha_failed.\$it") which was using Map.Entry's
  toString representation instead of the key.

HaFailedControl: map "loading" state to STATUS_UNKNOWN.

WebSocketCoreImpl.onMessage: truncate payload preview to 200 chars.
Registry payloads can be megabytes and logcat splits them into
thousands of chunks, blocking the WebSocket thread for tens of seconds.
CameraControl.provideControlFeatures: remove the runBlocking +
URL.openStream() fallback for cache misses. provideControlFeatures
can execute on the main thread via sendCachedControlsImmediately,
and the 2s runBlocking would trip StrictMode + CrashFailFastHandler.
Thumbnails are populated asynchronously by prefetchThumbnail
(WebsocketManager background sync and refreshCameraThumbnails on
panel open), so cache misses now show the placeholder icon until
the prefetch completes.

IgnoreViolationRules.IgnoreChromiumTrichomeWrongContextUsage: wrap
the multi-line OR expression on its own indent level to satisfy
ktlint's paren/newline rules.
sendCachedControlsImmediately rendered cached entity state as live whenever
WebSocketState was Active, but OkHttp only transitions the socket to Closed
after a ping/read timeout (~45s) once the link drops. During that window the
panel showed hours-old values as if they were current.

- Add isEntityStateTrustworthy(WebSocketState, hasActiveNetwork) — the cache
  is only safe to display when both signals agree we are online: the WebSocket
  reports Active AND ConnectivityManager reports an active network.
- HaControlsProviderService injects NetworkHelper for a synchronous main-thread
  ConnectivityManager check, and caches the per-server WebSocketRepository in
  webSocketRepositoryCache so the main thread can read getConnectionState()
  without suspending. The cache is primed from suspend contexts
  (subscribeToEntitiesForServer and WebsocketManager.syncControlEntities), so
  the first cold-start panel open after process restart already has both
  signals available.
- When either signal says we are offline, emit STATUS_UNKNOWN placeholders
  instead of cached values; the async path replaces them once the subscription
  delivers events.
- WebSocketCoreImpl: subscribe to NetworkChangeObserver and force-close the
  live connection when ConnectivityManager reports no network. This
  short-circuits the OkHttp ping timeout and unblocks any pendingConnectDeferred
  waiting on the 30s auth timeout when the user re-enables the network.

Tests:
- New IsEntityStateTrustworthyTest covers the truth table for the helper,
  including the WebSocket-Active-but-no-network bug scenario.
- WebSocketCoreImplTest gains three cases for the network listener:
  force-close on loss, no-op when nothing is connected, and no-op on a
  capabilities-only emission while still online.
# Conflicts:
#	app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt
The Frontend connection-error screen surfaced even for transient WebView
failures (slow page load, expired token mid-load, momentary SSL glitch in
Chromium). The background WebSocket pipeline that powers the system controls
was unaffected — the user could see device controls work fine while the app
greeted them with "Failed to connect" and a Retry button on cold start.

When `FrontendViewModel.onError` receives an `UnreachableError` and both
ConnectivityManager and the per-server WebSocket report Active, we now
schedule a silent retry instead of transitioning to the Error state. The
view stays on `LoadingScreen` (which already renders for both LoadServer
and Loading) for the duration of the backoff window, so the error screen
never flashes when the retry succeeds.

- shouldAutoRetry: synchronous gate based on hasActiveNetwork() and a retry
  budget (3 attempts). Suspended ServerManager.webSocketRepository(...)
  lookups happen inside the retry coroutine after the backoff delay.
- scheduleAutoRetry: launches in viewModelScope, delays with exponential
  backoff (1s -> 2s -> 4s, capped at 5s) plus 0..500ms jitter, re-checks
  the gate strictly via isLiveConnectionTrustworthy at fire time, and bails
  to the Error state if the gate fails.
- AuthenticationError, UnknownError, and UnrecoverableError do not auto-retry.
  Their root cause will not be solved by retrying and surfacing the error
  immediately is the better UX.
- onRetry, switchServer, and the Connected -> Content transition all reset
  the budget and cancel any pending retry job. The retry job lives in
  viewModelScope so onCleared cancels it automatically.

Promoted the existing `isEntityStateTrustworthy` helper (introduced in the
prior commit on this branch for the controls layer) to a shared module-level
`isLiveConnectionTrustworthy` in `common/.../websocket/WebSocketHealth.kt`.
The helper now gates two distinct decisions — "render cached entity state as
live" and "auto-retry instead of erroring" — both of which require the same
"WebSocket Active AND ConnectivityManager reports network" predicate. The
test moved with it.

Tests in `FrontendViewModelTest.AutoRetry`:
- UnreachableError + healthy gate -> state stays out of Error, retry fires.
- UnreachableError + WebSocket Closed -> state becomes Error after one
  backoff (synchronous bail by hasActiveNetwork is not enough; the strict
  gate inside scheduleAutoRetry catches the WS state).
- UnreachableError + no network -> Error immediately, no suspend WS lookup.
- AuthenticationError -> Error immediately, no auto-retry scheduled.
- 4 consecutive UnreachableErrors -> 4th surfaces Error.
- onRetry while retry pending -> cancels job, fresh load fires.
- switchServer cancels pending retry on the previous server.
- Manual onRetry after exhaustion resets the budget for a fresh round.
- WebSocketHealth: rewrite KDoc in neutral voice, no first-person plural.
- FrontendViewModel: drop the outer Throwable catch around the auto-retry
  body. The only realistically-thrown exception (IllegalStateException
  from webSocketRepository(serverId) when the server is unknown) is
  already handled inline; anything else is unexpected and should bubble
  up to viewModelScope's CoroutineExceptionHandler instead of being
  swallowed into emitErrorState.
- FrontendViewModelTest: trim the conversational phrasing on the auto-
  retry test comments to match the rest of the file.
…fetch

refreshCameraThumbnails captured the cached entity at the top of each
iteration and reused it after prefetchThumbnail returned. The prefetch
runs a network round-trip (up to 5s on a stuck camera), and during that
window the parallel subscribeToEntitiesForServer flow can write a fresh
entity state into cachedEntities. Sending the original snapshot to the
subscriber after the prefetch then races against — and visibly overwrites —
the live control that subscribeToEntitiesForServer just delivered.

Re-read cachedEntities after the prefetch, falling back to the original
snapshot only when the cache was somehow cleared in the meantime.
# Conflicts:
#	app/src/main/kotlin/io/homeassistant/companion/android/frontend/FrontendViewModel.kt
#	app/src/main/kotlin/io/homeassistant/companion/android/util/IgnoreViolationRules.kt
#	app/src/test/kotlin/io/homeassistant/companion/android/frontend/FrontendViewModelTest.kt
@bobaoapae bobaoapae marked this pull request as ready for review April 30, 2026 17:41
Comment on lines 696 to +699
override fun onMessage(webSocket: WebSocket, text: String) {
Timber.d("Websocket: onMessage (text: ${sensitive(text)})")
// Logging huge payloads (registries can be megabytes) blocks this thread for tens of seconds
// because logcat splits the message into thousands of small chunks. Truncate aggressively.
Timber.d("Websocket: onMessage (size=${text.length} preview=${sensitive(text.take(200))})")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fun sensitive(sensitive: String): String = sensitive.takeIf { BuildConfig.DEBUG } ?: "HIDDEN"

It only impacts debug builds and it is on purpose.

Copy link
Copy Markdown
Member

@TimoPtr TimoPtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't review this PR in this state. I invite you to open a PR for each of the point you've mentioned in the description. Let's go one by one since some are going to trigger some discussion.

Thanks for the time you've spent on this changes, but it is simply not possible to review everything all together. You're touching critical parts of the application that could have huge impacts on the user we need to be extract careful with any on the changes made.

@home-assistant home-assistant Bot marked this pull request as draft May 5, 2026 09:31
@bobaoapae
Copy link
Copy Markdown
Author

Hello, thanks for the feedback. I will try to separate things, I don't know if I will be capable of that, a lot of things are correlated. But I will keep working on that and try send pull requests with feew charges by time. Otherwise this stay here as a entry point for someone else to take a look and split if interested.

I'm working on that for one or two months using the app on my own device and the usage became very good. But I sure comprehend the implications, a lot of core code changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants