Skip to content

chore(depot): fault injection tests#4868

Draft
NathanFlurry wants to merge 18 commits into04-29-feat_sqlite_pitr_forkingfrom
05-01-chore_depot_fault_injection_tests
Draft

chore(depot): fault injection tests#4868
NathanFlurry wants to merge 18 commits into04-29-feat_sqlite_pitr_forkingfrom
05-01-chore_depot_fault_injection_tests

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 2, 2026

🚅 Deployed to the rivet-pr-4868 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 2, 2026 at 7:14 am
frontend-cloud 😴 Sleeping (View Logs) Web May 2, 2026 at 3:52 am
website 😴 Sleeping (View Logs) Web May 2, 2026 at 3:52 am
ladle ✅ Success (View Logs) Web May 2, 2026 at 3:44 am
mcp-hub ✅ Success (View Logs) Web May 2, 2026 at 3:42 am
frontend-inspector ❌ Build Failed (View Logs) Web May 2, 2026 at 3:42 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 2, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented May 2, 2026

Code Review: chore(depot): fault injection tests

Overview: This is a spec-only PR adding a detailed design document for a fault injection test system targeting the SQLite depot/VFS stack. No implementation code is included — purely .agent/specs/, a depot/CLAUDE.md update, and a new fault/CLAUDE.md in the sqlite test directory. The draft state is appropriate.

The spec is impressively thorough: phased rollout, fault point enumeration per subsystem, invariant lists, anti-bullshit harness requirements, and a native SQLite oracle comparison pattern. The architecture is sound.


Spec Quality

Strengths:

  • The FaultBoundary classification (PreDurableCommit, AmbiguousAfterDurableCommit, PostDurableNonData, ReadOnly, WorkflowOnly) is exactly the right framing for verifying ambiguous post-durable-commit semantics.
  • The "Anti-Bullshit Requirements" section is excellent — requiring unfired faults to fail tests and poisoned-mirror smoke tests prevents the harness from silently validating nothing.
  • Requiring every ambiguous fault point to declare ambiguity in the enum (not inferred from test failure) is a critical correctness constraint.
  • Phased rollout is sensible: cleanup first (Phase 1), then workflow harness, then fault injection, then verification, then coverage expansion.
  • chaos.rs being feature-gated or #[ignore]-d by default is correct for CI.

Issues and Concerns

1. DbManagerInput.disable_planning_timers and wire-format stability (medium)

The spec proposes:

#[cfg(feature = "test-faults")]
#[serde(default)]
pub disable_planning_timers: bool,

DbManagerInput is a Gasoline workflow input, meaning it is serialized and stored durably. A #[cfg(feature = "test-faults")] field that compiles away in production creates an implicit wire-format divergence: a test build encodes the field; a production build drops it. #[serde(default)] handles deserializing the field in production (it will be ignored), but CLAUDE.md mandates vbare for persisted/wire-format encoding, which does not support optional fields via #[serde(default)] — BARE is positional, not self-describing. If DbManagerInput uses vbare, removing the field in production breaks the encoding. The spec should explicitly address this — either: (a) use the JSON Gasoline input path (which is self-describing and handles missing keys), or (b) gate the test-faults manager on a separate DbManagerInput type that is never persisted in production.

2. Open questions left unresolved (low — but flag for impl)

Two open questions are noted at the end:

  • Whether Db::commit should reject pgno > db_size_pages dirty pages.
  • Which mock-backed VFS tests are real-behavior tests to rewrite vs delete.

These don't block the spec, but question #2 has a direct impact on Phase 1 scope: if several mock-backed tests cover real VFS behavior that isn't otherwise tested, deleting them without rewriting creates coverage regression. It would be worth making a conservative default explicit: "rewrite unless there is no real-behavior coverage loss."

3. Production-leak verification is underspecified for CI (low)

The spec says:

verification should include cargo check -p depot --release without test-faults

This is the right check, but "should include" is weak. To actually enforce the invariant, this check should be wired into CI (e.g., as a step in publish.yaml or a pre-submit check). The spec doesn't say where or how it gets enforced. Worth adding to Phase 4 as an explicit CI gate rather than a developer responsibility.

4. reload_database cache eviction ordering (low)

Step 5 of reload_database says:

Evict the actor's cached depot Db from DirectStorage.

But steps 3–4 reopen through a fresh NativeDatabase and SqliteTransport::from_direct before evicting the old Db cache. There's a window where the fresh transport could reuse the stale cached Db. The ordering should be: evict first (step 4), then reopen (step 5), to avoid any chance of the fresh transport inheriting the old in-memory state.

5. ctx.verify_against_native_oracle() oracle resync boundary (low)

The spec says resync is allowed only after a canonical dump "exactly matches either the old or fully-new oracle state." However, the oracle resync path is called after reload — at which point the Rivet VFS state is the freshly reloaded depot state. If there is a bug where Rivet SQLite ends up in a state that is neither old nor new (partial write), the verification should classify it as a hard failure, not fall through to resync. The spec implies this but doesn't make it a hard assertion rule. Worth stating explicitly: "any dump that does not exactly match old or new is a test failure, not a resync trigger."


Minor Notes

  • The spec correctly places design documents in .agent/specs/ per CLAUDE.md. ✓
  • engine/packages/depot/CLAUDE.md addition is a single bullet point per convention. ✓
  • rivetkit-rust/packages/rivetkit-sqlite/tests/inline/fault/CLAUDE.md entries are concise one-liners. ✓
  • The Rust code examples use hard tabs consistent with rustfmt.toml. ✓
  • The spec correctly avoids doc comments on what the code does vs. why. ✓
  • Phase ordering is correct: cleaning up mock paths before adding new fault paths prevents the fault harness from accidentally exercising mock code. ✓

Summary

The spec is well-designed and production-of-this-work is clearly well-understood. The main flag before implementation starts is item #1 (wire-format safety of disable_planning_timers with vbare). Items #2–5 are low-severity notes to consider during phase execution. Recommend resolving the vbare/serde question in the spec before Phase 2 implementation begins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant