Skip to content

Commit b7fbc39

Browse files
authored
uts: stop the base-station feed from silently dying (#74)
* uts: stop the base-station feed from silently dying Two fixes for the "telemetry stops after a few random minutes" failure on the MacBook base station. websocket_bridge.redis_listener: wrap the pub/sub loop in a reconnect loop with health_check_interval. Previously a single Redis connection blip (idle timeout, transient Docker-bridge hiccup, Redis restart) made the listener coroutine return for good while the WebSocket server kept running — PECAN stayed connected but never received another frame, with no error surfaced. ws_relay already reconnects this way; redis_listener now matches. main.py: the child-process monitor only logged "Process X died!" once per second forever and never recovered. Because the parent stayed alive, neither Docker's `restart: unless-stopped` nor systemd's `Restart=always` ever saw the failure. Now a dead child tears down the surviving children and exits non-zero so the supervisor restarts the whole stack cleanly. * Install slicks from PyPI; treat critical processes
1 parent 3f45ed6 commit b7fbc39

5 files changed

Lines changed: 104 additions & 55 deletions

File tree

server/installer/docker-compose.yml

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -269,12 +269,8 @@ services:
269269
build:
270270
context: ./sandbox
271271
dockerfile: Dockerfile.sandbox
272-
# slicks lives outside the build context (sibling repo at
273-
# /home/ubuntu/projects/slicks), so we pass it in as an additional
274-
# build context. The Dockerfile uses `COPY --from=slicks ...` to
275-
# pull files from it. Override SLICKS_HOST_PATH to relocate.
276-
additional_contexts:
277-
- slicks=${SLICKS_HOST_PATH:-/home/ubuntu/projects/slicks}
272+
# slicks is installed from PyPI (pinned in sandbox/requirements-docker.txt),
273+
# so no external build context is needed.
278274
container_name: sandbox
279275
restart: unless-stopped
280276
environment:
@@ -289,11 +285,6 @@ services:
289285
TIMESCALE_TABLE: "${TIMESCALE_TABLE:-${POSTGRES_TABLE:-wfr26}}"
290286
TIMESCALE_SEASON: "${TIMESCALE_SEASON:-${POSTGRES_TABLE:-wfr26}}"
291287
POSTGRES_TABLE: "${POSTGRES_TABLE:-wfr26}"
292-
volumes:
293-
# slicks source (TimescaleDB-migration branch). Editable-installed at
294-
# image build time; the bind mount below lets live source edits show
295-
# up on the next container recreate without an image rebuild.
296-
- ${SLICKS_HOST_PATH:-/home/ubuntu/projects/slicks}:/slicks_src:rw
297288
depends_on:
298289
timescaledb:
299290
condition: service_healthy

server/installer/sandbox/Dockerfile.sandbox

Lines changed: 2 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,22 +33,12 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
3333
# Install uv (replaces pip for dependency management)
3434
RUN pip install --no-cache-dir uv
3535

36-
# Install runtime deps via uv. slicks itself is installed editable below.
36+
# Install runtime deps (including slicks, pinned in requirements-docker.txt)
37+
# from PyPI via uv.
3738
COPY requirements-docker.txt /tmp/requirements-docker.txt
3839
RUN uv pip install --system --no-cache -r /tmp/requirements-docker.txt \
3940
&& rm -rf /root/.cache/uv
4041

41-
# Install slicks editable. The `slicks` build context is passed in by
42-
# docker-compose.yml's `additional_contexts:` (lives outside this repo's
43-
# build context). At runtime, the docker-compose `volumes:` entry
44-
# bind-mounts the same host path over /slicks_src so live source edits
45-
# show up on the next container recreate (no image rebuild needed for
46-
# code-only changes; pyproject changes still need a rebuild).
47-
COPY --from=slicks pyproject.toml /slicks_src/pyproject.toml
48-
COPY --from=slicks README.md /slicks_src/README.md
49-
COPY --from=slicks src /slicks_src/src
50-
RUN uv pip install --system --no-cache -e /slicks_src
51-
5242
# Tell Kaleido where Chromium lives
5343
ENV CHROME_PATH=/usr/bin/chromium
5444

server/installer/sandbox/requirements-docker.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Docker container requirements for sandbox execution environment
22
# These are the dependencies needed inside the Docker container
3-
#
4-
# NOTE: `slicks` is NOT pinned here — it is installed editable from
5-
# /slicks (a host bind mount) in Dockerfile.sandbox so source edits
6-
# land in the running container without a pip rebuild. The published
7-
# 0.2.x line on PyPI is the InfluxDB backend, which is not what we want.
3+
4+
# WFR data pipeline, installed straight from PyPI. Unpinned so the sandbox
5+
# always picks up the newest release (the TimescaleDB backend; the retired
6+
# 0.2.x InfluxDB line will never be "newest" again).
7+
slicks
88

99
# SQL access for TimescaleDB (slicks depends on these; listing explicitly
1010
# in case slicks changes its extras and to keep the layer order stable)

universal-telemetry-software/main.py

Lines changed: 46 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import os
2+
import sys
23
import time
34
import uuid
45
import multiprocessing
@@ -18,6 +19,16 @@
1819
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
1920
logger = logging.getLogger("Main")
2021

22+
# Processes that carry the live telemetry feed. If one of these dies the stack
23+
# is genuinely broken, so we tear everything down and exit non-zero to let the
24+
# supervisor (Docker `restart:`/systemd `Restart=always`) restart cleanly.
25+
# Auxiliary processes (TimescaleBridge, TX bridge, status server, link
26+
# diagnostics, video, audio, LEDs, PoE) are best-effort: if e.g. the optional
27+
# Timescale logging DB is unreachable, that must NOT take the live feed down —
28+
# restarting the whole stack on its death would only crash-loop the feed we are
29+
# trying to keep alive. Those are logged loudly but tolerated.
30+
CRITICAL_PROCESSES = {"Telemetry", "CarServices", "WebSocket"}
31+
2132

2233
def _timescale_dsn_reachable() -> bool:
2334
"""Return True when the configured Timescale/Postgres DSN accepts a connection."""
@@ -284,11 +295,41 @@ def start_timescale_bridge():
284295
try:
285296
while True:
286297
time.sleep(1)
287-
# Monitor children
288-
for p in processes:
289-
if not p.is_alive():
290-
logger.error(f"Process {p.name} died!")
291-
# Optional: Restart logic
298+
# Monitor children. A dead child means the pipeline is degraded. The
299+
# parent stays alive in this loop, so neither Docker's
300+
# `restart: unless-stopped` nor systemd's `Restart=always` ever sees
301+
# the failure and the stack silently keeps running half-dead.
302+
#
303+
# For a critical process (the live-feed path — telemetry / WS bridge)
304+
# fail fast: tear down the surviving children and exit non-zero so the
305+
# supervisor restarts the whole stack cleanly. For an auxiliary
306+
# process, log loudly and stop tracking it, but keep the live feed
307+
# running — nuking the stack because, say, the optional Timescale DB
308+
# is down would only crash-loop the very feed we are protecting.
309+
dead = [p for p in processes if not p.is_alive()]
310+
if dead:
311+
dead_critical = [p for p in dead if p.name in CRITICAL_PROCESSES]
312+
for p in dead:
313+
level = logger.error if p in dead_critical else logger.warning
314+
fate = (
315+
"Shutting down for supervisor restart."
316+
if p in dead_critical
317+
else "Auxiliary process — live feed kept running."
318+
)
319+
level(f"Process {p.name} died (exitcode={p.exitcode}). {fate}")
320+
321+
if dead_critical:
322+
for p in processes:
323+
if p.is_alive():
324+
p.terminate()
325+
for p in processes:
326+
p.join(timeout=5)
327+
sys.exit(1)
328+
329+
# Stop tracking dead auxiliary processes so we don't re-log them
330+
# every second.
331+
for p in dead:
332+
processes.remove(p)
292333
except KeyboardInterrupt:
293334
logger.info("Shutting down...")
294335
for p in processes:

universal-telemetry-software/src/websocket_bridge.py

Lines changed: 49 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import asyncio
2+
import contextlib
23
import redis.asyncio as redis
34
import websockets
45
import os
@@ -277,31 +278,57 @@ async def direct_queue_listener(queue: asyncio.Queue):
277278

278279

279280
async def redis_listener():
280-
"""Listens to Redis and broadcasts to all WS clients."""
281-
try:
282-
r = redis.from_url(REDIS_URL)
283-
pubsub = r.pubsub()
284-
await pubsub.subscribe(REDIS_CHANNEL, REDIS_STATS_CHANNEL, REDIS_DIAG_CHANNEL)
285-
logger.info(f"Subscribed to Redis channels: {REDIS_CHANNEL}, {REDIS_STATS_CHANNEL}, {REDIS_DIAG_CHANNEL}")
281+
"""Listens to Redis and broadcasts to all WS clients.
282+
283+
Wrapped in a reconnect loop. A dropped Redis pub/sub connection — an idle
284+
timeout, a transient blip on the Docker bridge network, or a Redis restart —
285+
must not silently kill the data feed. Without this loop the coroutine would
286+
exit on the first ConnectionError while the WebSocket server kept running:
287+
PECAN stays connected but never receives another frame, so the dashboard
288+
goes dead with no error visible anywhere. `health_check_interval` lets
289+
redis-py detect a half-open connection instead of blocking forever in
290+
listen().
291+
"""
292+
backoff_min, backoff_max = 0.5, 10.0
293+
delay = backoff_min
286294

287-
async for message in pubsub.listen():
295+
while not shutdown_event.is_set():
296+
r = None
297+
try:
298+
r = redis.from_url(REDIS_URL, health_check_interval=30)
299+
pubsub = r.pubsub()
300+
await pubsub.subscribe(REDIS_CHANNEL, REDIS_STATS_CHANNEL, REDIS_DIAG_CHANNEL)
301+
logger.info(f"Subscribed to Redis channels: {REDIS_CHANNEL}, {REDIS_STATS_CHANNEL}, {REDIS_DIAG_CHANNEL}")
302+
delay = backoff_min # reset backoff once a subscribe succeeds
303+
304+
async for message in pubsub.listen():
305+
if shutdown_event.is_set():
306+
break
307+
308+
if message['type'] == 'message':
309+
data = redis_utils.decode_message(message['data'])
310+
311+
# Broadcast to all connected clients
312+
if connected_clients:
313+
# Create tasks for sending to each client to avoid blocking
314+
await asyncio.gather(
315+
*[client.send(data) for client in connected_clients],
316+
return_exceptions=True
317+
)
318+
except asyncio.CancelledError:
319+
raise
320+
except Exception as e:
288321
if shutdown_event.is_set():
289322
break
290-
291-
if message['type'] == 'message':
292-
data = redis_utils.decode_message(message['data'])
293-
294-
# Broadcast to all connected clients
295-
if connected_clients:
296-
# Create tasks for sending to each client to avoid blocking
297-
await asyncio.gather(
298-
*[client.send(data) for client in connected_clients],
299-
return_exceptions=True
300-
)
301-
except Exception as e:
302-
logger.error(f"Redis error: {e}")
303-
finally:
304-
logger.info("Redis listener stopping...")
323+
logger.error(f"Redis listener error: {e} — reconnecting in {delay:.1f}s")
324+
await asyncio.sleep(delay)
325+
delay = min(delay * 2, backoff_max)
326+
finally:
327+
if r is not None:
328+
with contextlib.suppress(Exception):
329+
await r.aclose()
330+
331+
logger.info("Redis listener stopping...")
305332

306333

307334
async def _handle_client_message(websocket, raw: str, redis_client):

0 commit comments

Comments
 (0)