Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
9f16fec
chore(depot): fault injection tests
NathanFlurry May 2, 2026
e12410f
feat: US-001 - Remove mock VFS transport path
NathanFlurry May 2, 2026
9d11cad
feat: US-002 - Add strict DirectStorage test mode
NathanFlurry May 2, 2026
6d8690c
feat: US-003 - Add depot test-faults feature shell
NathanFlurry May 2, 2026
ccb26e4
feat: US-004 - Add depot fault controller API
NathanFlurry May 2, 2026
81c2fc9
feat: US-005 - Add forced compaction test driver
NathanFlurry May 2, 2026
b893198
feat: US-006 - Add SQLite fault scenario harness
NathanFlurry May 2, 2026
72bb9a6
feat: US-007 - Add native SQLite oracle verification
NathanFlurry May 2, 2026
806f9e3
feat: US-008 - Add depot invariant scanner
NathanFlurry May 2, 2026
bd3c4e3
feat: US-009 - Add commit fault hooks
NathanFlurry May 2, 2026
ac6a5c0
feat: US-010 - Add read and cold-tier fault hooks
NathanFlurry May 2, 2026
4ae46ad
feat: US-011 - Add compaction and reclaim fault hooks
NathanFlurry May 2, 2026
b77eaa0
feat: US-012 - Add simple SQLite depot fault tests
NathanFlurry May 2, 2026
11c2a09
feat: US-013 - Add chaos fault test suite
NathanFlurry May 2, 2026
e5351b0
feat: US-014 - Add production fault-leak checks
NathanFlurry May 2, 2026
db0fe23
feat: US-015 - Fail VFS reload on real depot read errors
NathanFlurry May 2, 2026
76ff17d
feat: US-016 - Start fault workloads in strict DirectStorage mode
NathanFlurry May 2, 2026
f0f0390
feat: US-017 - Wire workflow cold tier to fault controller
NathanFlurry May 2, 2026
27cb20a
feat: US-018 - Separate verifier reads from workload fault accounting
NathanFlurry May 2, 2026
233a938
feat: US-019 - Remove handcrafted cold refs from end-to-end coverage
NathanFlurry May 2, 2026
24f251e
feat: US-020 - Classify ambiguous post-commit outcomes
NathanFlurry May 2, 2026
1e54a0d
feat: US-021 - Verify cold refs contain referenced pages
NathanFlurry May 2, 2026
eb463f6
feat: US-022 - Fix strict cold-tier read evidence
NathanFlurry May 2, 2026
8daa771
feat: US-023 - Add table-driven high-risk fault matrix
NathanFlurry May 2, 2026
82c91d8
feat: US-024 - Add heavier SQLite VFS fault workloads
NathanFlurry May 2, 2026
0adde66
feat: US-025 - Upgrade chaos suite beyond smoke coverage
NathanFlurry May 2, 2026
f9ba239
feat: US-026 - Rerun full fault suite and capture remaining issues
NathanFlurry May 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
828 changes: 828 additions & 0 deletions .agent/specs/sqlite-depot-fault-injection.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ docker-compose up -d

- RivetKit SQLite is native-only: VFS and query execution live in `rivetkit-rust/packages/rivetkit-sqlite/`, core owns lifecycle, and NAPI only marshals JS types.
- SQLite VFS direct tests should record workflow compaction wakes through `CompactionSignaler`, not call legacy `compact_default_batch`.
- SQLite VFS correctness tests should use `DirectStorage`; do not reintroduce mock or envoy transport variants for those tests.
- SQLite VFS `xSync` durability depends on depot's `sqlite_commit` reply waiting for the FDB transaction commit.
- SQLite VFS process-global registrations must be owned by a Drop guard so panics unwind through `sqlite3_vfs_unregister`.
- `NativeDatabase::Drop` must bound dirty-page flushes with a short timeout and return after logging if the commit future never resolves.
Expand Down
6 changes: 6 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 10 additions & 1 deletion engine/packages/depot/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ These come from `r2-prior-art/.agent/research/sqlite/requirements.md` and supers
- **Workflow compaction read fallback preserves hot-tier order.** PIDX/DELTA wins, branch SHARD fallback is next, and `CMP/cold_shard` refs are considered only before legacy cold manifest layers.
- **Conveyer read domains live behind the `conveyer/read.rs` facade.** Keep read planning, PIDX/cache helpers, SHARD fallback, cold-tier reads, and transaction scan helpers under `conveyer/read/*.rs`.
- **Sparse in-range reads zero-fill only when no source exists.** Corrupted or broken source blobs must return an explicit read error, not a zero page.
- **PIDX-owned DELTA gaps are broken source coverage.** Missing delta chunks may fall back only to valid SHARD or cold coverage; otherwise return `ShardCoverageMissing` or a decode error.
- **Compaction cold-shard reads must revalidate live refs.** After fetching cold bytes, re-read the `CMP/cold_shard` ref under `Serializable`; missing or changed refs return `ShardCoverageMissing`, not sparse zero-fill.
- **Read-path tests should seed branch-owned storage.** Use `Db::commit` or `BR/{branch}` keys, not pre-PITR database-scoped `META`/`PIDX`/`DELTA`/`SHARD` keys.
- **Conveyer branch domains live behind the `conveyer/branch.rs` facade.** Keep branch resolution, bucket catalog/list/delete, fork/derive, lifecycle rollback, and shared branch helpers under `conveyer/branch/*.rs`.
Expand All @@ -55,6 +56,7 @@ These come from `r2-prior-art/.agent/research/sqlite/requirements.md` and supers
- **Shard-cache fill idle waits pre-arm `Notify` before checking `outstanding`.** `notify_waiters()` does not store permits.
- **Shard-cache metrics use fixed outcome labels only.** Keep branch, database, shard, object, restore point, and bucket identifiers out of metric labels.
- **Workflow compaction integration tests configure filesystem cold storage through `TestCtx`.** Do not use runtime global cold-tier overrides for integration coverage.
- **Fault scenarios install workflow cold-tier test overrides by branch id.** Use `test_hooks::install_workflow_cold_tier_for_test` only when the workflow must share a fault-controller-backed tier with VFS reads.
- **Truncate cleanup prunes the boundary SHARD instead of blindly deleting it.** `shard_id = pgno / SHARD_SIZE`, so page 64 and page 65 both map to shard 1.
- **Debug historical reads cannot trust PIDX.** PIDX is the current owner map, so `debug::read_at` scans DELTA history up to the target txid before falling through to SHARD/cold layers.
- **Fresh fork branches use `/META/head_at_fork` until first commit.** The first local commit treats it as the previous `DBHead`, writes `/META/head`, and clears `/META/head_at_fork` in the same transaction.
Expand All @@ -78,7 +80,8 @@ These come from `r2-prior-art/.agent/research/sqlite/requirements.md` and supers
- **Workflow compaction joined signals expose `database_branch_id()`.** Manager and companion loops should filter unrelated branch signals once before variant-specific handling.
- **Workflow companion signal handlers are kind-specific.** Hot, cold, and reclaim loops choose typed signal handlers first; storage work stays inside activities.
- **Workflow force-compaction state lives in `ForceCompactionTracker`.** Manager code should use tracker methods for request, result, attempted-job, and completed-job bookkeeping.
- **Workflow force-compaction tests use the manager `ForceCompaction` signal.** Wait on durable `force_compactions.recent_results`, not planner deadlines or arbitrary timing thresholds.
- **Workflow force-compaction tests use `DepotCompactionTestDriver` under `depot/test-faults`.** Wait on durable `force_compactions.recent_results`, not planner deadlines or arbitrary timing thresholds.
- **Workflow fault-hook tests should assert durable forced results.** Manager retries can surface a terminal error and still settle later state in the same forced cycle.
- **Workflow hot compaction writes only staged LTX blobs under `CMP/stage/{job_id}/hot_shard`.** Successful staged results stay as the manager's active hot job until the manager install path publishes or rejects them.
- **Workflow hot install runs in the DB manager.** It accepts only the active job fingerprint, copies staged hot shards to reader-visible `SHARD`, advances `CMP/root`, and clears matching PIDX rows with `COMPARE_AND_CLEAR`.
- **Workflow hot planning treats `DB_PIN` records as exact coverage targets.** Preserve pinned txids in `HotJobInputRange.coverage_txids`; do not round them up to the latest head.
Expand Down Expand Up @@ -132,6 +135,7 @@ These come from `r2-prior-art/.agent/research/sqlite/requirements.md` and supers
- **ColdTier object keys are relative S3-style keys.** Reject empty object keys, absolute paths, and `..`; use `FilesystemColdTier` for local tests and `FaultyColdTier` for injected latency or failures.
- **ColdTier implementations live under `src/cold_tier/`.** Keep `mod.rs` as the trait/shared-type facade and put disabled, filesystem, S3, config, and fault-injection behavior in focused files.
- **FaultyColdTier requires an explicit node id.** Use a real node id or test-specific label so injected-failure metrics never fall back to `unknown`.
- **FaultyColdTier controller faults are test-only.** `DropArtifact` on GET returns a missing object; on PUT it writes first and then drops the acknowledgement with an error.
- **S3 missing-object handling uses typed SDK service errors.** Check `GetObjectError::is_no_such_key`; do not match `Display` strings.
- **Cold read fall-through keeps ColdTier GETs outside UDB transactions.** `Db::new_with_cold_tier` supplies the backend and read-side manifests are cached per connection.

Expand Down Expand Up @@ -201,6 +205,11 @@ We explicitly do **not** import:
- Lease-expiry and time-window tests use `tokio::time::pause()` + `advance()` for determinism.
- Use a nil bucket `Db` to exercise branch-scoped hot compaction through public `compact_default_batch`.
- Latency tests that depend on `UDB_SIMULATED_LATENCY_MS` must live in a dedicated integration test binary because UDB caches the env var once per process via `OnceLock`.
- SQLite VFS integration and fault-injection tests live in `rivetkit-rust/packages/rivetkit-sqlite/` so they exercise the full VFS.
- Depot fault-injection APIs live only behind `depot/test-faults`; enable the feature from dev/test dependencies only.
- Production fault-leak checks run through `engine/packages/depot/scripts/check-production-fault-leaks.sh`.
- Depot fault-controller tests live in `tests/fault_controller.rs` and run with `cargo test -p depot --features test-faults --test fault_controller`.
- Commit fault-hook tests should search the `anyhow` error chain because UDB transaction failures wrap injected fault errors.

## Metrics

Expand Down
4 changes: 4 additions & 0 deletions engine/packages/depot/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ authors.workspace = true
license.workspace = true
edition.workspace = true

[features]
default = []
test-faults = []

[dependencies]
anyhow.workspace = true
async-channel.workspace = true
Expand Down
109 changes: 109 additions & 0 deletions engine/packages/depot/scripts/check-production-fault-leaks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#!/usr/bin/env bash
set -euo pipefail

repo_root="$(git rev-parse --show-toplevel)"
cd "$repo_root"

tmp_dir="$(mktemp -d)"
trap 'rm -rf "$tmp_dir"' EXIT

echo "checking depot release build without depot/test-faults"
cargo check -p depot --release

echo "checking release IR for fault-only symbols"
cargo rustc -p depot --release --target-dir "$tmp_dir/release-target" --lib -- --emit=llvm-ir
shopt -s nullglob
ir_files=("$tmp_dir"/release-target/release/deps/depot-*.ll)
if [[ ${#ir_files[@]} -eq 0 ]]; then
echo "expected depot LLVM IR output, found none" >&2
exit 1
fi

if grep -E 'DepotFault(Action|Controller|Point)|DropArtifact|MAX_FAULT_DELAY|disable_planning_timers' "${ir_files[@]}" >/dev/null; then
echo "fault-injection symbol leaked into normal release IR" >&2
grep -n -E 'DepotFault(Action|Controller|Point)|DropArtifact|MAX_FAULT_DELAY|disable_planning_timers' "${ir_files[@]}" >&2
exit 1
fi

echo "checking normal dependency surface rejects fault APIs"
probe_dir="$tmp_dir/no-feature-probe"
mkdir -p "$probe_dir"

cat >"$probe_dir/main.rs" <<'EOF'
use std::time::Duration;

fn main() {
let _controller = depot::fault::DepotFaultController::new();
let _pause = depot::fault::DepotFaultAction::Pause {
checkpoint: String::new(),
};
let _delay = depot::fault::DepotFaultAction::Delay {
duration: Duration::from_millis(1),
};
let _drop = depot::fault::DepotFaultAction::DropArtifact;
let _input = depot::workflows::compaction::DbManagerInput {
database_branch_id: depot::conveyer::types::DatabaseBranchId::nil(),
actor_id: None,
disable_planning_timers: true,
};
}
EOF

rlibs=("$tmp_dir"/release-target/release/deps/libdepot-*.rlib)
if [[ ${#rlibs[@]} -ne 1 ]]; then
echo "expected exactly one no-feature libdepot rlib, found ${#rlibs[@]}" >&2
exit 1
fi

if rustc \
--edition=2024 \
"$probe_dir/main.rs" \
--emit=metadata \
-L "dependency=$tmp_dir/release-target/release/deps" \
--extern "depot=${rlibs[0]}" \
>"$tmp_dir/probe.out" 2>"$tmp_dir/probe.err"; then
echo "normal dependency unexpectedly compiled with fault-only APIs" >&2
exit 1
fi

if ! grep -Eq 'could not find `fault` in `depot`|could not find .*fault.*depot' "$tmp_dir/probe.err"; then
echo "normal dependency probe failed, but not because depot::fault was hidden" >&2
cat "$tmp_dir/probe.err" >&2
exit 1
fi

if ! grep -q 'disable_planning_timers' "$tmp_dir/probe.err"; then
echo "normal dependency probe did not prove disable_planning_timers was hidden" >&2
cat "$tmp_dir/probe.err" >&2
exit 1
fi

echo "checking depot/test-faults is only enabled from dev dependencies"
cargo metadata --format-version 1 --no-deps >"$tmp_dir/metadata.json"
python3 - "$tmp_dir/metadata.json" <<'PY'
import json
import sys

metadata_path = sys.argv[1]
with open(metadata_path, "r", encoding="utf-8") as f:
metadata = json.load(f)

leaks = []
for package in metadata["packages"]:
for dep in package.get("dependencies", []):
if dep.get("name") != "depot":
continue
if "test-faults" not in dep.get("features", []):
continue
if dep.get("kind") == "dev":
continue
leaks.append(f'{package["name"]} depends on depot/test-faults as {dep.get("kind") or "normal"}')

if leaks:
print("non-dev depot/test-faults dependency leaks found:", file=sys.stderr)
for leak in leaks:
print(f"- {leak}", file=sys.stderr)
sys.exit(1)
PY

echo "production fault-leak checks passed"
84 changes: 84 additions & 0 deletions engine/packages/depot/src/cold_tier/faulty.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, AtomicU64, AtomicUsize, Ordering};
use std::time::Duration;

Check warning on line 6 in engine/packages/depot/src/cold_tier/faulty.rs

View workflow job for this annotation

GitHub Actions / Rustfmt

Diff in /home/runner/work/rivet/rivet/engine/packages/depot/src/cold_tier/faulty.rs
#[cfg(feature = "test-faults")]
use crate::fault::{
ColdTierFaultPoint, DepotFaultAction, DepotFaultContext, DepotFaultController,
DepotFaultPoint,
};
use crate::metrics;

use super::{ColdTier, ColdTierObjectMetadata};
Expand All @@ -21,6 +26,8 @@
inner: T,
node_id: String,
state: Arc<FaultyColdTierState>,
#[cfg(feature = "test-faults")]
fault_controller: Option<DepotFaultController>,
}

#[derive(Debug, Default)]
Expand All @@ -39,6 +46,22 @@
inner,
node_id: node_id.into(),
state: Arc::new(FaultyColdTierState::default()),
#[cfg(feature = "test-faults")]
fault_controller: None,
}
}

#[cfg(feature = "test-faults")]
pub fn new_with_fault_controller_for_test(
inner: T,
node_id: impl Into<String>,
fault_controller: DepotFaultController,
) -> Self {
FaultyColdTier {
inner,
node_id: node_id.into(),
state: Arc::new(FaultyColdTierState::default()),
fault_controller: Some(fault_controller),
}
}

Expand Down Expand Up @@ -94,6 +117,27 @@

Ok(())
}

#[cfg(feature = "test-faults")]
async fn maybe_fire_controller_fault(
&self,
point: ColdTierFaultPoint,
) -> Result<Option<DepotFaultAction>> {
let Some(controller) = &self.fault_controller else {
return Ok(None);

Check warning on line 127 in engine/packages/depot/src/cold_tier/faulty.rs

View workflow job for this annotation

GitHub Actions / Rustfmt

Diff in /home/runner/work/rivet/rivet/engine/packages/depot/src/cold_tier/faulty.rs
};
let Some(fired) = controller
.maybe_fire(
DepotFaultPoint::ColdTier(point),
DepotFaultContext::new(),
)
.await?
else {
return Ok(None);
};

Ok(Some(fired.action))
}
}

impl ColdTierOperation {
Expand All @@ -116,6 +160,8 @@
inner: self.inner.clone(),
node_id: self.node_id.clone(),
state: self.state.clone(),
#[cfg(feature = "test-faults")]
fault_controller: self.fault_controller.clone(),
}
}
}
Expand All @@ -127,21 +173,59 @@
{
async fn put_object(&self, key: &str, bytes: &[u8]) -> Result<()> {
self.maybe_fail(ColdTierOperation::Put).await?;
#[cfg(feature = "test-faults")]
let drop_ack = matches!(
self.maybe_fire_controller_fault(ColdTierFaultPoint::PutObject)
.await?,

Check warning on line 179 in engine/packages/depot/src/cold_tier/faulty.rs

View workflow job for this annotation

GitHub Actions / Rustfmt

Diff in /home/runner/work/rivet/rivet/engine/packages/depot/src/cold_tier/faulty.rs
Some(DepotFaultAction::DropArtifact)
);
self.inner.put_object(key, bytes).await
.and_then(|()| {
#[cfg(feature = "test-faults")]
if drop_ack {
anyhow::bail!("injected cold-tier put acknowledgement drop for {key}");
}

Ok(())
})
}

async fn get_object(&self, key: &str) -> Result<Option<Vec<u8>>> {
self.maybe_fail(ColdTierOperation::Get).await?;
#[cfg(feature = "test-faults")]
if matches!(
self.maybe_fire_controller_fault(ColdTierFaultPoint::GetObject)
.await?,
Some(DepotFaultAction::DropArtifact)
) {
return Ok(None);
}
self.inner.get_object(key).await
}

async fn delete_objects(&self, keys: &[String]) -> Result<()> {
self.maybe_fail(ColdTierOperation::Delete).await?;
#[cfg(feature = "test-faults")]
if matches!(
self.maybe_fire_controller_fault(ColdTierFaultPoint::DeleteObjects)
.await?,
Some(DepotFaultAction::DropArtifact)
) {
anyhow::bail!("cold-tier DropArtifact is not supported for delete_objects");
}
self.inner.delete_objects(keys).await
}

async fn list_prefix(&self, prefix: &str) -> Result<Vec<ColdTierObjectMetadata>> {
self.maybe_fail(ColdTierOperation::List).await?;
#[cfg(feature = "test-faults")]
if matches!(
self.maybe_fire_controller_fault(ColdTierFaultPoint::ListPrefix)
.await?,
Some(DepotFaultAction::DropArtifact)
) {
anyhow::bail!("cold-tier DropArtifact is not supported for list_prefix");
}
self.inner.list_prefix(prefix).await
}
}
15 changes: 15 additions & 0 deletions engine/packages/depot/src/compaction/companion.rs
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,21 @@ async fn run_hot_compaction_job(
})
.await?;
test_hooks::maybe_pause_after_hot_stage(database_branch_id).await;
#[cfg(feature = "test-faults")]
let output = match test_hooks::maybe_fire_hot_compaction_fault(
database_branch_id,
crate::fault::HotCompactionFaultPoint::AfterStageBeforeFinishSignal,
)
.await
{
Ok(Some(_)) | Ok(None) => output,
Err(err) => StageHotJobOutput {
status: CompactionJobStatus::Failed {
error: err.to_string(),
},
output_refs: Vec::new(),
},
};

let tag_value = database_branch_tag_value(database_branch_id);
ctx.signal(HotJobFinished {
Expand Down
2 changes: 2 additions & 0 deletions engine/packages/depot/src/compaction/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
pub(crate) mod companion;
pub(crate) mod shared;
#[cfg(feature = "test-faults")]
pub mod test_driver;
pub(crate) mod types;

#[cfg(debug_assertions)]
Expand Down
Loading
Loading