chore: sqlite v1 to v2 data migration#4692
chore: sqlite v1 to v2 data migration#4692NathanFlurry wants to merge 1 commit intographite-base/4692from
Conversation
|
🚅 Deployed to the rivet-pr-4692 environment in rivet-frontend
|
|
SQLite v1-to-v2 Migration - Code Review Overview This PR implements the v1-to-v2 SQLite data migration path for actor-side storage, runs the migration eagerly at actor boot (on envoys that support v2), and consolidates the VFS layer so only v2 is available going forward. The migration reads the v1 KV snapshot, rolls back any dangling journal, validates integrity via rusqlite, and commits the recovered pages into the v2 staged-commit protocol. Net change: -7725/+6125 lines, largely from retiring the legacy v2/vfs.rs (5077 deletions) and the EnvoyKv / KvVfs v1 path from database.rs (159 deletions). Issues Indentation bug in prepare_v1_migration (takeover.rs) The body of the if let block is not indented - let existing_head and ensure! appear at the same indent level as the if let condition. cargo fmt would catch this. Worth fixing before merge so reviewers do not have to wonder whether ensure! runs unconditionally. split_bytes(bytes, 0) returns one oversized chunk If meta.max_delta_bytes is somehow 0 (misconfiguration, stale engine), this silently promotes the entire database into a single chunk and lets the commit path reject it with an opaque error. A guard before the split_bytes call (ensure!(max_chunk_bytes > 0)) would surface a cleaner failure. Memory and Performance Concerns Double-materialise for large databases The migration holds two full copies of the database in memory at once: snapshot.main.bytes (v1 chunks assembled) and recovered.bytes (post-journal-recovery). For a 128 MiB database that is ~256 MiB peak. Dropping snapshot before reading the recovered file would halve that; page-streaming from the tempfile would eliminate it entirely. SQLITE_MIGRATION_LOCKS map never shrinks Entries are added once per actor_id and never removed. After a successful migration the early-return on Native/MigratedFromV1 means the lock is never acquired again, but the allocation lives for the process lifetime. For a long-running envoy that migrates many unique actors this leaks one Arc<Mutex<()>> per actor. Consider evicting the entry after maybe_migrate_v1_to_v2 returns Ok(true). Metrics SQLITE_MIGRATION_PAGES uses latency buckets for page counts The shared BUCKETS are latency-shaped (0.001 to 100 s). Page counts for a 4 KiB-page 128 MiB database run up to ~32768 - nearly everything will land in the infinity bucket. Custom buckets like [1.0, 10.0, 100.0, 500.0, 1000.0, 5000.0, 10000.0, 32768.0] would make the metric useful. Minor / Nit test_db() leaks its temp directory. tempdir()?.keep() prevents auto-cleanup. tempdir()? without .keep() would be cleaner unless RocksDB requires the directory to outlive the handle. Last assertion in reads_v1_files_beyond_the_default_kv_list_limit checks a coincidental value. seed_v1_sparse_chunks stores (chunk_idx as u8).wrapping_add(1). For chunk 16384 that is 1. chunk_count as u8 (= 16385 as u8) is also 1. Both happen to equal 1, so the assertion passes without distinguishing the last chunk from chunk 0. Spelling out 1_u8 literally or mirroring the seeder formula would make the intent clear. What Is Working Well
Open Question After a successful migration the v1 KV chunks remain in the store - the migration tags the v2 head as MigratedFromV1 but does not delete the old data. Is a cleanup pass planned, or is v1 data expected to be tombstoned lazily? If out of scope for this PR, a follow-up ticket would help track the storage overhead during the transition window. Generated with Claude Code (https://claude.com/claude-code) |
c2fbb00 to
daf7aa8
Compare
710f2df to
5b8e173
Compare
daf7aa8 to
f6c7523
Compare
5b8e173 to
02b0b04
Compare
02b0b04 to
b2af041
Compare
1418052 to
b6d4f6f
Compare
b2af041 to
94992cd
Compare
b6d4f6f to
59858bf
Compare
|
Landed in main via stack-merge fast-forward push. Commits are in main; closing to match. |

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: