This fork (rocketmark/libsurvive) tracks collabora/libsurvive with patches for the Stagehand project — a PoE-powered Raspberry Pi 4B that connects Vive Trackers over USB/IP to a Windows SteamVR PC for virtual production.
These patches fix bugs discovered while running libsurvive headless on a Pi with a single Vive Tracker 3 over USB (no wireless dongle, no HMD).
Patches #1, #7, #2, #3, #4, and #8 are applied. Patches #5 and #6 (GSS off-by-one and
flag mapping) remain reverted — the wrapper's --globalscenesolver 3 mitigates #6.
Feb 2026 diagnostic confirmation: Instrumented agent (imu_age logging) confirmed that the dropout is USB-level: IMU stops first (~4 min), poses stop ~30s later (Kalman coasting). Patch #7 was re-applied first (NaN crash on cold start), then #2-4 (endpoint abandonment on timeout). Cold start must be re-verified after each deploy.
Status: Applied in fork via clear_halt.patch.
On Linux, the kernel usbhid driver grabs tracker interfaces on plug-in. libsurvive auto-detaches it with libusb_set_auto_detach_kernel_driver(), but the IMU endpoint (0x81) can be left in a STALL state after detach. Without IMU data the Kalman filter never produces poses.
// In AttachInterface(), before libusb_submit_transfer():
libusb_clear_halt(devh, endpoint_num);When handle_transfer() receives LIBUSB_TRANSFER_TIMED_OUT, upstream returns without resubmitting the transfer. The endpoint permanently stops receiving data — no warning, no recovery. This is the primary cause of tracking dying after 1-6 minutes over USB/IP.
Planned fix: Always resubmit after timeout. Declare "device turned off" only after 10 consecutive timeouts (10 seconds of silence).
The upstream error path has two bugs:
error_count++appears twice on the same path (error_count++thenif (error_count++ < 10))- After a successful
libusb_submit_transfer()retry, the code falls through togoto disconnectinstead of returning
Planned fix: Single increment, return after successful resubmit, goto disconnect only after 10 consecutive errors.
When a transfer completes with LIBUSB_TRANSFER_STALL, upstream retries without clearing the halt condition — so the retry also stalls.
Planned fix: Call libusb_clear_halt() before retrying when transfer->status == LIBUSB_TRANSFER_STALL.
run_optimization() and check_object() use solve_counts > solve_count_max which allows N+1 solves instead of N. The 2nd solve (at ~7 minutes) incorporated a bad scene, causing the MPFIT error to jump from 68 to 4661 (68x), corrupting lighthouse positions and killing tracking permanently.
Planned fix: Change to solve_counts >= solve_count_max in both locations.
The upstream flag mapping flag > 1 ? flag : -1 makes --globalscenesolver 1 set solve_count_max = -1 (unlimited). Only values >= 2 are respected as actual limits.
Planned fix: Change to flag > 0 ? flag : -1 so --globalscenesolver 1 means exactly 1 solve (initial calibration only).
Note: With upstream code, --globalscenesolver 3 passes through correctly (flag > 1 → flag = 3). The wrapper now uses --globalscenesolver 3 to avoid this issue.
The jerk-model process noise scales as t^7. libsurvive warns at dt > 500ms but does not cap dt. Over USB/IP, IMU timestamp gaps cause catastrophic P matrix growth:
| dt | t^7 | Effect |
|---|---|---|
| 1ms (normal) | 1e-21 | fine |
| 50ms | 8e-10 | fine |
| 350ms | 0.006 | variance gate triggers |
| 500ms | 0.008 | light gate triggers |
| 1s | 1.0 | NaN/Inf in filter |
Confirmed: NaN assertion crash observed on cold start: linmath.c:658: quatrotateabout: Assertion '!isnan(qout[i])' failed. This is the highest priority patch to re-apply.
Planned fix: Cap t to 50ms at the top of survive_kalman_tracker_process_noise(). State prediction still uses the real dt; only uncertainty growth (Q matrix) is bounded.
Commit: 1d34e73
Moved measured_dev and cnt variable declarations to point of use to fix -Werror=missing-field-initializers style warnings. No behavioral change.
config_save() previously used fopen(path, "w") which truncates the destination file immediately, then wrote with multiple fprintf() calls. A power loss during any write left a truncated or empty config.json. On the next start libsurvive either failed to parse it or silently discarded it, forcing a full recalibration from a corrupted baseline.
Fix: Write to config.json.tmp first, then rename() into place. rename() is atomic on Linux — either the old file survives intact or the new file is fully committed, never a partial.
char tmp_path[FILENAME_MAX];
snprintf(tmp_path, sizeof(tmp_path), "%s.tmp", path);
FILE *f = fopen(tmp_path, "w");
// ...write...
fclose(f);
rename(tmp_path, path);After initial calibration is established, lighthouse wake events deliver fresh OOTX data which calls ootx_recv() → set_needs_solve() → schedules a new GSS solve. Mid-session re-solves use scene data captured during the lighthouse transition (noisy, partial) and can produce a bad calibration that corrupts tracking for the remainder of the session.
Fix: Add an early return in set_needs_solve() guarded by flushed_blind_scenes (set by patch #9's gss_flush_blind_scenes logic once the first good pose is produced). Once tracking is established, no further re-solves are triggered regardless of lighthouse events. A restart is the correct response to genuine scene changes (lighthouse moved or replaced).
if (gss->flushed_blind_scenes)
return;Three source changes plus a bonus bug fix to reject reflection-contaminated sensor readings and poses before they corrupt the Kalman state. Reflections (LED walls, truss, shiny floors) cause libsurvive to accept ghost poses that are geometrically consistent but physically wrong.
- Back-facing normal filter (
survive_sensor_activations.c,survive.h): rejects sensor hits where the sensor surface normal points away from the lighthouse. Configurable via--filter-normal-facingness(default 0.0) and--filter-normal-min-confidence(default 0.1). - Pose angular rate gate (
survive_kalman_tracker.c,survive_kalman_tracker.h): suppresses pose emission when the implied angular rate exceeds a threshold. Disabled by default (--kalman-max-pose-angular-rate -1); requires calibration fromreflect_test.capdata before enabling (expected value: 5–10 rad/s). - Configurable sync cluster window (
survive_sensor_activations.c,survive.h): makes the 0.5s Chauvenet cluster window configurable via--sync-cluster-window(default 0.5; tighten to 0.15 for more reactive outlier detection). - quatdist bug fix (
redist/linmath.c): pre-existing bug where swapped min/max clamp args causedquatdist()to always return 0. Fixed:linmath_max(1., linmath_min(-1, rtn))→linmath_min(1., linmath_max(-1., rtn)). This made the angular rate gate silently inert before the fix was applied. Candidate for upstream PR tocollabora/libsurvive.
Full details: docs/reflection-rejection.md
Ten property test suites added in src/test_cases/:
| File | What it tests |
|---|---|
quat_props.c |
Quaternion math (normalization, rotation, slerp, quatdist) |
kabsch_props.c |
Kabsch algorithm (rigid body alignment) |
kalman_props.c |
Kalman filter properties (covariance, prediction) |
numeric_props.c |
Numerical utilities (matrix ops, SVD, sync cluster window) |
reproject_props.c |
Lighthouse reprojection model |
reproject_residual_props.c |
Reprojection residual calculations |
event_queue_props.c |
Event queue data structure |
residual_cascade_props.c |
Light error threshold cascade (currently disabled path) |
variance_gate_props.c |
Variance gate behavior, IMU gap sensitivity |
normal_filter_props.c |
Back-facing normal filter geometry (reflection rejection) |
CI workflow: .github/workflows/ci-property-tests.yml
Documentation: docs/property-tests.md, docs/reflection-rejection.md
Upstream CI workflows (cmake, docker, nuget, wheels, publish-source) were removed and replaced with ci-property-tests.yml for the property test suite.
| # | Patch | Applied? | Confirmed? | Re-apply priority |
|---|---|---|---|---|
| 1 | Clear halt on attach | YES | YES | — |
| 2 | Timeout resubmit | APPLIED | YES (1-6 min dropout) | — |
| 3 | Error handler fix | APPLIED | YES (code review) | — |
| 4 | STALL clear_halt | APPLIED | YES (code review) | — |
| 5 | GSS off-by-one | REVERTED | Confirmed corruption 60→15198→963755/meas; patch broke tracking, needs investigation | investigate |
| 6 | GSS flag mapping | REVERTED | flag=1 broke tracking (OOTX stuck); needs investigation | investigate |
| 7 | Process noise dt cap | APPLIED | YES (NaN crash on cold start) | — |
| 8 | Compiler warning | YES | YES | — |
| 9 | Reflection artifact rejection | APPLIED | YES (field confirmed) | — |
| 10 | Atomic config write | APPLIED | Pending deploy | — |
| 11 | Lock GSS after first tracking | APPLIED | Pending deploy | — |
Patches should be re-applied one at a time. After each patch, deploy to Pi and verify cold start calibration still succeeds (both lighthouses detected, OOTX decoded, GSS solves, tracking goes green).
- Patch #7 (dt cap) — prevents NaN crash on cold start, confirmed by assertion failure
- Patches #2–4 (USB transfer handler) — prevents 1-6 minute dropout over USB/IP
- Patches #5–6 (GSS) — prevents 7-minute re-solve corruption
- Cold start calibration is fast (~20s) when it works: OOTX decode takes ~18s, GSS solves in ~1s after that. The stuck-yellow failures were caused by the agent's 120s init timeout and the NaN crash, not slow calibration.
- Test patches in isolation. When multiple patches interact (GSS flag mapping + agent init timeout + NaN crash), failures are hard to attribute.
mkdir build && cd build
cmake .. -DUSE_HIDAPI=OFF
make -j4For property tests:
cmake .. -DUSE_HIDAPI=OFF -DENABLE_TESTS=ON
make -j4
ctest --output-on-failureThe stagehand agent wrapper passes --globalscenesolver 3 by default. With upstream GSS code (patches #5–6 reverted), this correctly limits to 3 solves. With patches #5–6 applied, any value >= 1 works as expected.