chore(depot): fault injection tests#4868
chore(depot): fault injection tests#4868NathanFlurry wants to merge 18 commits into04-29-feat_sqlite_pitr_forkingfrom
Conversation
|
🚅 Deployed to the rivet-pr-4868 environment in rivet-frontend
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
Code Review: chore(depot): fault injection testsOverview: This is a spec-only PR adding a detailed design document for a fault injection test system targeting the SQLite depot/VFS stack. No implementation code is included — purely The spec is impressively thorough: phased rollout, fault point enumeration per subsystem, invariant lists, anti-bullshit harness requirements, and a native SQLite oracle comparison pattern. The architecture is sound. Spec QualityStrengths:
Issues and Concerns1. The spec proposes: #[cfg(feature = "test-faults")]
#[serde(default)]
pub disable_planning_timers: bool,
2. Open questions left unresolved (low — but flag for impl) Two open questions are noted at the end:
These don't block the spec, but question #2 has a direct impact on Phase 1 scope: if several mock-backed tests cover real VFS behavior that isn't otherwise tested, deleting them without rewriting creates coverage regression. It would be worth making a conservative default explicit: "rewrite unless there is no real-behavior coverage loss." 3. Production-leak verification is underspecified for CI (low) The spec says:
This is the right check, but "should include" is weak. To actually enforce the invariant, this check should be wired into CI (e.g., as a step in 4. Step 5 of
But steps 3–4 reopen through a fresh 5. The spec says resync is allowed only after a canonical dump "exactly matches either the old or fully-new oracle state." However, the oracle resync path is called after reload — at which point the Rivet VFS state is the freshly reloaded depot state. If there is a bug where Rivet SQLite ends up in a state that is neither old nor new (partial write), the verification should classify it as a hard failure, not fall through to resync. The spec implies this but doesn't make it a hard assertion rule. Worth stating explicitly: "any dump that does not exactly match old or new is a test failure, not a resync trigger." Minor Notes
SummaryThe spec is well-designed and production-of-this-work is clearly well-understood. The main flag before implementation starts is item #1 (wire-format safety of |

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: