Stagehand Patches to libsurvive

This fork (rocketmark/libsurvive) tracks collabora/libsurvive with patches for the Stagehand project — a PoE-powered Raspberry Pi 4B that connects Vive Trackers over USB/IP to a Windows SteamVR PC for virtual production.

These patches fix bugs discovered while running libsurvive headless on a Pi with a single Vive Tracker 3 over USB (no wireless dongle, no HMD).

Current State

Patches #1, #7, #2, #3, #4, and #8 are applied. Patches #5 and #6 (GSS off-by-one and flag mapping) remain reverted — the wrapper's --globalscenesolver 3 mitigates #6.

Feb 2026 diagnostic confirmation: Instrumented agent (imu_age logging) confirmed that the dropout is USB-level: IMU stops first (~4 min), poses stop ~30s later (Kalman coasting). Patch #7 was re-applied first (NaN crash on cold start), then #2-4 (endpoint abandonment on timeout). Cold start must be re-verified after each deploy.

Bug Fix Patches

1. Clear stalled endpoints on attach (`driver_vive.c`) — APPLIED

Status: Applied in fork via clear_halt.patch.

On Linux, the kernel usbhid driver grabs tracker interfaces on plug-in. libsurvive auto-detaches it with libusb_set_auto_detach_kernel_driver(), but the IMU endpoint (0x81) can be left in a STALL state after detach. Without IMU data the Kalman filter never produces poses.

// In AttachInterface(), before libusb_submit_transfer():
libusb_clear_halt(devh, endpoint_num);

2. Transfer timeout handler silently abandons endpoint (`driver_vive.libusb.h`) — REVERTED

When handle_transfer() receives LIBUSB_TRANSFER_TIMED_OUT, upstream returns without resubmitting the transfer. The endpoint permanently stops receiving data — no warning, no recovery. This is the primary cause of tracking dying after 1-6 minutes over USB/IP.

Planned fix: Always resubmit after timeout. Declare "device turned off" only after 10 consecutive timeouts (10 seconds of silence).

3. Transfer error handler double-increments and falls through (`driver_vive.libusb.h`) — REVERTED

The upstream error path has two bugs:

error_count++ appears twice on the same path (error_count++ then if (error_count++ < 10))
After a successful libusb_submit_transfer() retry, the code falls through to goto disconnect instead of returning

Planned fix: Single increment, return after successful resubmit, goto disconnect only after 10 consecutive errors.

4. No `libusb_clear_halt()` on STALL errors (`driver_vive.libusb.h`) — REVERTED

When a transfer completes with LIBUSB_TRANSFER_STALL, upstream retries without clearing the halt condition — so the retry also stalls.

Planned fix: Call libusb_clear_halt() before retrying when transfer->status == LIBUSB_TRANSFER_STALL.

5. Global scene solver off-by-one allows extra solve (`driver_global_scene_solver.c`) — REVERTED

run_optimization() and check_object() use solve_counts > solve_count_max which allows N+1 solves instead of N. The 2nd solve (at ~7 minutes) incorporated a bad scene, causing the MPFIT error to jump from 68 to 4661 (68x), corrupting lighthouse positions and killing tracking permanently.

Planned fix: Change to solve_counts >= solve_count_max in both locations.

6. Global scene solver flag=1 means unlimited (`driver_global_scene_solver.c`) — REVERTED

The upstream flag mapping flag > 1 ? flag : -1 makes --globalscenesolver 1 set solve_count_max = -1 (unlimited). Only values >= 2 are respected as actual limits.

Planned fix: Change to flag > 0 ? flag : -1 so --globalscenesolver 1 means exactly 1 solve (initial calibration only).

Note: With upstream code, --globalscenesolver 3 passes through correctly (flag > 1 → flag = 3). The wrapper now uses --globalscenesolver 3 to avoid this issue.

7. Process noise t^7 explosion on IMU timing gaps (`survive_kalman_tracker.c`) — REVERTED

The jerk-model process noise scales as t^7. libsurvive warns at dt > 500ms but does not cap dt. Over USB/IP, IMU timestamp gaps cause catastrophic P matrix growth:

dt	t^7	Effect
1ms (normal)	1e-21	fine
50ms	8e-10	fine
350ms	0.006	variance gate triggers
500ms	0.008	light gate triggers
1s	1.0	NaN/Inf in filter

Confirmed: NaN assertion crash observed on cold start: linmath.c:658: quatrotateabout: Assertion '!isnan(qout[i])' failed. This is the highest priority patch to re-apply.

Planned fix: Cap t to 50ms at the top of survive_kalman_tracker_process_noise(). State prediction still uses the real dt; only uncertainty growth (Q matrix) is bounded.

8. Compiler warning fix (`survive_sensor_activations.c`) — APPLIED

Commit: 1d34e73

Moved measured_dev and cnt variable declarations to point of use to fix -Werror=missing-field-initializers style warnings. No behavioral change.

10. Atomic config write (`survive_config.c`) — APPLIED

config_save() previously used fopen(path, "w") which truncates the destination file immediately, then wrote with multiple fprintf() calls. A power loss during any write left a truncated or empty config.json. On the next start libsurvive either failed to parse it or silently discarded it, forcing a full recalibration from a corrupted baseline.

Fix: Write to config.json.tmp first, then rename() into place. rename() is atomic on Linux — either the old file survives intact or the new file is fully committed, never a partial.

char tmp_path[FILENAME_MAX];
snprintf(tmp_path, sizeof(tmp_path), "%s.tmp", path);
FILE *f = fopen(tmp_path, "w");
// ...write...
fclose(f);
rename(tmp_path, path);

11. Lock GSS solver after first tracking (`driver_global_scene_solver.c`) — APPLIED

After initial calibration is established, lighthouse wake events deliver fresh OOTX data which calls ootx_recv() → set_needs_solve() → schedules a new GSS solve. Mid-session re-solves use scene data captured during the lighthouse transition (noisy, partial) and can produce a bad calibration that corrupts tracking for the remainder of the session.

Fix: Add an early return in set_needs_solve() guarded by flushed_blind_scenes (set by patch #9's gss_flush_blind_scenes logic once the first good pose is produced). Once tracking is established, no further re-solves are triggered regardless of lighthouse events. A restart is the correct response to genuine scene changes (lighthouse moved or replaced).

if (gss->flushed_blind_scenes)
    return;

9. Reflection artifact rejection — APPLIED

Three source changes plus a bonus bug fix to reject reflection-contaminated sensor readings and poses before they corrupt the Kalman state. Reflections (LED walls, truss, shiny floors) cause libsurvive to accept ghost poses that are geometrically consistent but physically wrong.

Back-facing normal filter (survive_sensor_activations.c, survive.h): rejects sensor hits where the sensor surface normal points away from the lighthouse. Configurable via --filter-normal-facingness (default 0.0) and --filter-normal-min-confidence (default 0.1).
Pose angular rate gate (survive_kalman_tracker.c, survive_kalman_tracker.h): suppresses pose emission when the implied angular rate exceeds a threshold. Disabled by default (--kalman-max-pose-angular-rate -1); requires calibration from reflect_test.cap data before enabling (expected value: 5–10 rad/s).
Configurable sync cluster window (survive_sensor_activations.c, survive.h): makes the 0.5s Chauvenet cluster window configurable via --sync-cluster-window (default 0.5; tighten to 0.15 for more reactive outlier detection).
quatdist bug fix (redist/linmath.c): pre-existing bug where swapped min/max clamp args caused quatdist() to always return 0. Fixed: linmath_max(1., linmath_min(-1, rtn)) → linmath_min(1., linmath_max(-1., rtn)). This made the angular rate gate silently inert before the fix was applied. Candidate for upstream PR to collabora/libsurvive.

Full details: docs/reflection-rejection.md

Property Tests (new files)

Ten property test suites added in src/test_cases/:

File	What it tests
`quat_props.c`	Quaternion math (normalization, rotation, slerp, quatdist)
`kabsch_props.c`	Kabsch algorithm (rigid body alignment)
`kalman_props.c`	Kalman filter properties (covariance, prediction)
`numeric_props.c`	Numerical utilities (matrix ops, SVD, sync cluster window)
`reproject_props.c`	Lighthouse reprojection model
`reproject_residual_props.c`	Reprojection residual calculations
`event_queue_props.c`	Event queue data structure
`residual_cascade_props.c`	Light error threshold cascade (currently disabled path)
`variance_gate_props.c`	Variance gate behavior, IMU gap sensitivity
`normal_filter_props.c`	Back-facing normal filter geometry (reflection rejection)

CI workflow: .github/workflows/ci-property-tests.yml Documentation: docs/property-tests.md, docs/reflection-rejection.md

CI Changes

Upstream CI workflows (cmake, docker, nuget, wheels, publish-source) were removed and replaced with ci-property-tests.yml for the property test suite.

Patch Status

#	Patch	Applied?	Confirmed?	Re-apply priority
1	Clear halt on attach	YES	YES	—
2	Timeout resubmit	APPLIED	YES (1-6 min dropout)	—
3	Error handler fix	APPLIED	YES (code review)	—
4	STALL clear_halt	APPLIED	YES (code review)	—
5	GSS off-by-one	REVERTED	Confirmed corruption 60→15198→963755/meas; patch broke tracking, needs investigation	investigate
6	GSS flag mapping	REVERTED	flag=1 broke tracking (OOTX stuck); needs investigation	investigate
7	Process noise dt cap	APPLIED	YES (NaN crash on cold start)	—
8	Compiler warning	YES	YES	—
9	Reflection artifact rejection	APPLIED	YES (field confirmed)	—
10	Atomic config write	APPLIED	Pending deploy	—
11	Lock GSS after first tracking	APPLIED	Pending deploy	—

Re-apply Order

Patches should be re-applied one at a time. After each patch, deploy to Pi and verify cold start calibration still succeeds (both lighthouses detected, OOTX decoded, GSS solves, tracking goes green).

Patch #7 (dt cap) — prevents NaN crash on cold start, confirmed by assertion failure
Patches #2–4 (USB transfer handler) — prevents 1-6 minute dropout over USB/IP
Patches #5–6 (GSS) — prevents 7-minute re-solve corruption

Lessons Learned

Cold start calibration is fast (~20s) when it works: OOTX decode takes ~18s, GSS solves in ~1s after that. The stuck-yellow failures were caused by the agent's 120s init timeout and the NaN crash, not slow calibration.
Test patches in isolation. When multiple patches interact (GSS flag mapping + agent init timeout + NaN crash), failures are hard to attribute.

Build

mkdir build && cd build
cmake .. -DUSE_HIDAPI=OFF
make -j4

For property tests:

cmake .. -DUSE_HIDAPI=OFF -DENABLE_TESTS=ON
make -j4
ctest --output-on-failure

Runtime

The stagehand agent wrapper passes --globalscenesolver 3 by default. With upstream GSS code (patches #5–6 reverted), this correctly limits to 3 solves. With patches #5–6 applied, any value >= 1 works as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stagehand Patches to libsurvive

Current State

Bug Fix Patches

1. Clear stalled endpoints on attach (`driver_vive.c`) — APPLIED

2. Transfer timeout handler silently abandons endpoint (`driver_vive.libusb.h`) — REVERTED

3. Transfer error handler double-increments and falls through (`driver_vive.libusb.h`) — REVERTED

4. No `libusb_clear_halt()` on STALL errors (`driver_vive.libusb.h`) — REVERTED

5. Global scene solver off-by-one allows extra solve (`driver_global_scene_solver.c`) — REVERTED

6. Global scene solver flag=1 means unlimited (`driver_global_scene_solver.c`) — REVERTED

7. Process noise t^7 explosion on IMU timing gaps (`survive_kalman_tracker.c`) — REVERTED

8. Compiler warning fix (`survive_sensor_activations.c`) — APPLIED

10. Atomic config write (`survive_config.c`) — APPLIED

11. Lock GSS solver after first tracking (`driver_global_scene_solver.c`) — APPLIED

9. Reflection artifact rejection — APPLIED

Property Tests (new files)

CI Changes

Patch Status

Re-apply Order

Lessons Learned

Build

Runtime

FilesExpand file tree

STAGEHAND_PATCHES.md

Latest commit

History

STAGEHAND_PATCHES.md

File metadata and controls

Stagehand Patches to libsurvive

Current State

Bug Fix Patches

1. Clear stalled endpoints on attach (driver_vive.c) — APPLIED

2. Transfer timeout handler silently abandons endpoint (driver_vive.libusb.h) — REVERTED

3. Transfer error handler double-increments and falls through (driver_vive.libusb.h) — REVERTED

4. No libusb_clear_halt() on STALL errors (driver_vive.libusb.h) — REVERTED

5. Global scene solver off-by-one allows extra solve (driver_global_scene_solver.c) — REVERTED

6. Global scene solver flag=1 means unlimited (driver_global_scene_solver.c) — REVERTED

7. Process noise t^7 explosion on IMU timing gaps (survive_kalman_tracker.c) — REVERTED

8. Compiler warning fix (survive_sensor_activations.c) — APPLIED

10. Atomic config write (survive_config.c) — APPLIED

11. Lock GSS solver after first tracking (driver_global_scene_solver.c) — APPLIED

9. Reflection artifact rejection — APPLIED

Property Tests (new files)

CI Changes

Patch Status

Re-apply Order

Lessons Learned

Build

Runtime

1. Clear stalled endpoints on attach (`driver_vive.c`) — APPLIED

2. Transfer timeout handler silently abandons endpoint (`driver_vive.libusb.h`) — REVERTED

3. Transfer error handler double-increments and falls through (`driver_vive.libusb.h`) — REVERTED

4. No `libusb_clear_halt()` on STALL errors (`driver_vive.libusb.h`) — REVERTED

5. Global scene solver off-by-one allows extra solve (`driver_global_scene_solver.c`) — REVERTED

6. Global scene solver flag=1 means unlimited (`driver_global_scene_solver.c`) — REVERTED

7. Process noise t^7 explosion on IMU timing gaps (`survive_kalman_tracker.c`) — REVERTED

8. Compiler warning fix (`survive_sensor_activations.c`) — APPLIED

10. Atomic config write (`survive_config.c`) — APPLIED

11. Lock GSS solver after first tracking (`driver_global_scene_solver.c`) — APPLIED