feat(l1): add automatic DB migration framework#6519
Conversation
Instead of forcing a full resync on every schema version bump, the storage layer now runs migration functions automatically. The MIGRATIONS array in migrations.rs maps each version upgrade (v1→v2 is index 0, v2→v3 is index 1, etc.), with a compile-time assert ensuring the array length stays in sync with STORE_SCHEMA_VERSION. Store::new detects the on-disk version and runs pending migrations before proceeding, writing metadata with fsync after each step for crash safety. has_valid_db now accepts any migratable version (1..=STORE_SCHEMA_VERSION). Currently empty (version stays at 1) — this is the framework only.
Lines of code reportTotal lines added: Detailed view |
🤖 Claude Code ReviewI have enough from the diff to write a thorough review. Here is the full analysis: Code Review: PR #6519 — Automatic DB Migration FrameworkOverviewThis PR introduces a lightweight migration framework for Critical Issues1. Metadata write is NOT crash-safe despite the claim
A truly atomic write requires the write-to-tempfile-then-rename pattern: let tmp_path = metadata_path.with_extension("tmp");
let mut file = std::fs::File::create(&tmp_path)?;
file.write_all(serialized.as_bytes())?;
file.sync_all()?;
std::fs::rename(&tmp_path, &metadata_path)?;
// Optionally: sync the parent directory fd for durability on LinuxThe claim in the PR description ("writing metadata with fsync after each step for crash safety") is currently not accurate. 2. No rollback on partial migration If Design Issues3. Migration silently falls back to The migration arm is gated on 4.
5.
Minor Issues6. Redundant runtime assert
7. Test isolation: shared temp directory paths Both tests in the 8. The test confirms 9. The
What Looks Good
SummaryThe most important fix before merging is Item 1 (atomic metadata write) — without it the "crash safety" guarantee doesn't hold and a power failure at the wrong moment can permanently brick the database. Item 2 (no rollback / idempotency contract) should at minimum be documented. The remaining items are lower-severity but Items 4–5 are worth cleaning up to avoid exposing internals that don't need to be public. Automated review by Claude (Anthropic) · sonnet · custom prompt |
🤖 Kimi Code ReviewReview of the database migration framework PR. Overall this is a solid implementation with proper crash safety considerations (fsync after each migration step) and good error handling. Below are specific findings: crates/storage/migrations.rs
crates/storage/store.rs
Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
Greptile SummaryThis PR introduces an automatic DB migration framework for
Confidence Score: 4/5Safe to merge after fixing the non-atomic metadata write, which can make a successfully migrated DB unrecoverable on crash. One P1 finding: write_metadata_version is not crash-atomic, directly undermining the crash-safety guarantee advertised by this PR. The remaining findings are P2 (dead code, test temp-path hygiene). Score is 4 because the P1 is straightforward to fix with a write-tmp+rename pattern. crates/storage/migrations.rs — write_metadata_version needs atomic write (temp file + rename)
|
| Filename | Overview |
|---|---|
| crates/storage/migrations.rs | New migration framework — contains a P1 crash-safety bug: write_metadata_version uses File::create (truncate-in-place) rather than the atomic write-tmp+rename pattern, risking a corrupt metadata file on crash mid-write. Also uses fixed temp paths in tests. |
| crates/storage/store.rs | Store::new restructured to detect on-disk version and dispatch to run_pending_migrations; has_valid_db broadened to accept any version in 1..=STORE_SCHEMA_VERSION. The #[cfg(not(feature = "rocksdb"))] migration arm is unreachable dead code. |
| crates/storage/error.rs | Adds MigrationFailed { from, to, reason } variant to StoreError — clean addition. |
| crates/storage/lib.rs | Exposes pub mod migrations and updates the STORE_SCHEMA_VERSION docstring to guide future contributors. |
| cmd/ethrex/initializers.rs | Adds MigrationFailed arm to the Store::new error handler with a user-facing message directing operators to removedb. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Store::new called] --> B{engine_type == InMemory?}
B -- Yes --> Z[Open InMemoryBackend]
B -- No --> C[read_store_schema_version]
C --> D{metadata.json exists?}
D -- No, dir non-empty --> E[Err: NotFoundDBVersion\npre-metadata DB]
D -- No, dir empty/missing --> F[init_metadata_file\nwrite v=STORE_SCHEMA_VERSION]
F --> Z2[Open RocksDBBackend normally]
D -- Yes --> G{version value?}
G -- v < 1 --> H[Err: IncompatibleDBVersion]
G -- v > STORE_SCHEMA_VERSION --> H
G -- v < STORE_SCHEMA_VERSION & rocksdb on --> I[Open RocksDBBackend\nrun_pending_migrations]
I --> J{migration loop\nv..STORE_SCHEMA_VERSION}
J -- per step --> K[MIGRATIONS idx apply fn]
K --> L[write_metadata_version\nFile::create + write_all + sync_all\n⚠️ NOT atomic]
L --> J
J -- done --> M[from_backend return Store]
G -- v == STORE_SCHEMA_VERSION --> Z2
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/storage/migrations.rs
Line: 74-84
Comment:
**Non-atomic metadata write contradicts crash-safety claim**
`write_metadata_version` is documented as "Atomically writes" but it is not atomic. `File::create` truncates the existing file immediately; if the process crashes after the truncation but before `write_all` completes, the metadata file will be empty or partial. `serde_json::from_str` will then fail on the next startup, making the database inaccessible even though the underlying RocksDB migration completed successfully.
The safe pattern for POSIX filesystems is: write to a sibling temp file → `sync_all()` → `rename()` (which is atomic).
```rust
fn write_metadata_version(db_path: &Path, version: u64) -> Result<(), StoreError> {
let metadata_path = db_path.join(STORE_METADATA_FILENAME);
let tmp_path = db_path.join(format!("{}.tmp", STORE_METADATA_FILENAME));
let metadata = StoreMetadata::new(version);
let serialized = serde_json::to_string_pretty(&metadata)?;
let mut file = std::fs::File::create(&tmp_path)?;
file.write_all(serialized.as_bytes())?;
file.sync_all()?;
std::fs::rename(&tmp_path, &metadata_path)?;
Ok(())
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/storage/store.rs
Line: 1490-1496
Comment:
**Unreachable dead code in `#[cfg(not(feature = "rocksdb"))]` branch**
`EngineType` only has two variants: `InMemory` and `#[cfg(feature = "rocksdb")] RocksDB`. The outer guard is `if engine_type != EngineType::InMemory`, so when the `rocksdb` feature is disabled, this block is never entered (there is no non-`InMemory` engine type to select). This arm can therefore never be reached and will silently become dead code, which may confuse future contributors writing their first migration.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: crates/storage/migrations.rs
Line: 105-106
Comment:
**Fixed temp paths risk test flakiness under parallel test runners**
Both tests use a fixed path under `std::env::temp_dir()` (e.g. `ethrex_migration_test_noop`). If `cargo test` is run with parallelism across multiple invocations, or if a previous run left the directory behind from a failed cleanup, tests can observe stale state or race with one another. Consider using `tempfile::TempDir` (or `std::env::temp_dir().join(format!(".._{}", std::process::id()))`) to guarantee unique, isolated directories.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "style(l1): fix formatting in store.rs" | Re-trigger Greptile
🤖 Codex Code Review
Aside from those durability issues, the I couldn’t run the storage tests here because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
| /// Atomically writes the schema version to metadata.json with fsync for crash safety. | ||
| fn write_metadata_version(db_path: &Path, version: u64) -> Result<(), StoreError> { | ||
| let metadata_path = db_path.join(STORE_METADATA_FILENAME); | ||
| let metadata = StoreMetadata::new(version); | ||
| let serialized = serde_json::to_string_pretty(&metadata)?; | ||
|
|
||
| let mut file = std::fs::File::create(&metadata_path)?; | ||
| file.write_all(serialized.as_bytes())?; | ||
| file.sync_all()?; | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
Non-atomic metadata write contradicts crash-safety claim
write_metadata_version is documented as "Atomically writes" but it is not atomic. File::create truncates the existing file immediately; if the process crashes after the truncation but before write_all completes, the metadata file will be empty or partial. serde_json::from_str will then fail on the next startup, making the database inaccessible even though the underlying RocksDB migration completed successfully.
The safe pattern for POSIX filesystems is: write to a sibling temp file → sync_all() → rename() (which is atomic).
fn write_metadata_version(db_path: &Path, version: u64) -> Result<(), StoreError> {
let metadata_path = db_path.join(STORE_METADATA_FILENAME);
let tmp_path = db_path.join(format!("{}.tmp", STORE_METADATA_FILENAME));
let metadata = StoreMetadata::new(version);
let serialized = serde_json::to_string_pretty(&metadata)?;
let mut file = std::fs::File::create(&tmp_path)?;
file.write_all(serialized.as_bytes())?;
file.sync_all()?;
std::fs::rename(&tmp_path, &metadata_path)?;
Ok(())
}Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/migrations.rs
Line: 74-84
Comment:
**Non-atomic metadata write contradicts crash-safety claim**
`write_metadata_version` is documented as "Atomically writes" but it is not atomic. `File::create` truncates the existing file immediately; if the process crashes after the truncation but before `write_all` completes, the metadata file will be empty or partial. `serde_json::from_str` will then fail on the next startup, making the database inaccessible even though the underlying RocksDB migration completed successfully.
The safe pattern for POSIX filesystems is: write to a sibling temp file → `sync_all()` → `rename()` (which is atomic).
```rust
fn write_metadata_version(db_path: &Path, version: u64) -> Result<(), StoreError> {
let metadata_path = db_path.join(STORE_METADATA_FILENAME);
let tmp_path = db_path.join(format!("{}.tmp", STORE_METADATA_FILENAME));
let metadata = StoreMetadata::new(version);
let serialized = serde_json::to_string_pretty(&metadata)?;
let mut file = std::fs::File::create(&tmp_path)?;
file.write_all(serialized.as_bytes())?;
file.sync_all()?;
std::fs::rename(&tmp_path, &metadata_path)?;
Ok(())
}
```
How can I resolve this? If you propose a fix, please make it concise.| #[cfg(not(feature = "rocksdb"))] | ||
| Some(v) if v < STORE_SCHEMA_VERSION => { | ||
| return Err(StoreError::IncompatibleDBVersion { | ||
| found: v, | ||
| expected: STORE_SCHEMA_VERSION, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Unreachable dead code in
#[cfg(not(feature = "rocksdb"))] branch
EngineType only has two variants: InMemory and #[cfg(feature = "rocksdb")] RocksDB. The outer guard is if engine_type != EngineType::InMemory, so when the rocksdb feature is disabled, this block is never entered (there is no non-InMemory engine type to select). This arm can therefore never be reached and will silently become dead code, which may confuse future contributors writing their first migration.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/store.rs
Line: 1490-1496
Comment:
**Unreachable dead code in `#[cfg(not(feature = "rocksdb"))]` branch**
`EngineType` only has two variants: `InMemory` and `#[cfg(feature = "rocksdb")] RocksDB`. The outer guard is `if engine_type != EngineType::InMemory`, so when the `rocksdb` feature is disabled, this block is never entered (there is no non-`InMemory` engine type to select). This arm can therefore never be reached and will silently become dead code, which may confuse future contributors writing their first migration.
How can I resolve this? If you propose a fix, please make it concise.| let temp_dir = std::env::temp_dir().join("ethrex_migration_test_noop"); | ||
| std::fs::create_dir_all(&temp_dir).ok(); |
There was a problem hiding this comment.
Fixed temp paths risk test flakiness under parallel test runners
Both tests use a fixed path under std::env::temp_dir() (e.g. ethrex_migration_test_noop). If cargo test is run with parallelism across multiple invocations, or if a previous run left the directory behind from a failed cleanup, tests can observe stale state or race with one another. Consider using tempfile::TempDir (or std::env::temp_dir().join(format!(".._{}", std::process::id()))) to guarantee unique, isolated directories.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/migrations.rs
Line: 105-106
Comment:
**Fixed temp paths risk test flakiness under parallel test runners**
Both tests use a fixed path under `std::env::temp_dir()` (e.g. `ethrex_migration_test_noop`). If `cargo test` is run with parallelism across multiple invocations, or if a previous run left the directory behind from a failed cleanup, tests can observe stale state or race with one another. Consider using `tempfile::TempDir` (or `std::env::temp_dir().join(format!(".._{}", std::process::id()))`) to guarantee unique, isolated directories.
How can I resolve this? If you propose a fix, please make it concise.- Use write-to-temp-then-rename pattern in write_metadata_version for crash-safe metadata updates (atomic on POSIX filesystems) - Remove unreachable #[cfg(not(feature = "rocksdb"))] match arm in Store::new — without rocksdb the only engine is InMemory, so the outer guard already excludes this path - Use tempfile::tempdir() in tests instead of fixed paths to avoid flakiness under parallel test runners
| })?; | ||
|
|
||
| // Persist the new version to metadata.json after each migration step | ||
| write_metadata_version(db_path, target)?; |
There was a problem hiding this comment.
We should also wrap inside MigrationFailed
If write_metadata_version fails during a migration, the error is now wrapped in MigrationFailed with the version context, matching the handling for the migration function itself.
iovoid
left a comment
There was a problem hiding this comment.
It probably doesn't matter, but update of the db and metadata file aren't atomic.
Arkenan
left a comment
There was a problem hiding this comment.
LGTM with some nit considerations. We can merge as is but if we do so let's start a PR fixing these issues.
| return Err(StoreError::IncompatibleDBVersion { | ||
| found: v, | ||
| expected: STORE_SCHEMA_VERSION, |
There was a problem hiding this comment.
Shouldn't this be a specific migration error? ie, something like
MigrationFailed { from: v, to: STORE_SCHEMA_VERSION, reason: "DB version {from} is previous to the existence of migrations." }| Some(v) if v > STORE_SCHEMA_VERSION => { | ||
| return Err(StoreError::IncompatibleDBVersion { | ||
| found: v, | ||
| expected: STORE_SCHEMA_VERSION, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Similar, the message could be something like
MigrationFailed { from: v, to: STORE_SCHEMA_VERSION, reason: "DB version is more recent than the one the client expects. Migrating would involve rolling back, which is not supported." }| return Err(StoreError::NotFoundDBVersion { | ||
| expected: STORE_SCHEMA_VERSION, | ||
| }); |
There was a problem hiding this comment.
Is it possible to have this more explicit somehow? It's not that we expect STORE_SCHEMA_VERSION specifically, we can't migrate this at all because there's no version data in the db at all. We should output something like
Cannot migrate the database. It's version is unavailable, which means that it predates versioning and migrations.
we may just remove the "expected" part of the message entirely, as we are not really expecting that specific version now, we're expecting any migratable version.
| assert!( | ||
| MIGRATIONS.len() == (STORE_SCHEMA_VERSION - 1) as usize, | ||
| "MIGRATIONS length must equal STORE_SCHEMA_VERSION - 1" | ||
| ); |
There was a problem hiding this comment.
Isn't this done before as a compile time check just above? I don't think we need to do it at runtime again.
| ); | ||
|
|
||
| for version in current_version..STORE_SCHEMA_VERSION { | ||
| let idx = (version - 1) as usize; // v1→v2 is index 0 |
There was a problem hiding this comment.
nit: let's have a migration_for_version(i) function that returns the correct migration for that version. That way we don't think about the indexing tricks in here..
| })?; | ||
|
|
||
| // Persist the new version to metadata.json after each migration step | ||
| write_metadata_version(db_path, target).map_err(|e| StoreError::MigrationFailed { |
There was a problem hiding this comment.
Shouldn't this be part of a storage abstraction instead? Having to pass the path_db around is a bit weird.
- Make NotFoundDBVersion fieldless with descriptive error message - Use MigrationFailed for invalid version (v < 1) and future version (v > STORE_SCHEMA_VERSION) instead of IncompatibleDBVersion - Remove redundant runtime assert (compile-time const assert suffices) - Extract migration_for_version() helper for clarity - Add TODO for moving metadata persistence into StorageBackend
Summary
ethrex-storageso schema version bumps run upgrade functions automatically instead of forcing a full resyncMIGRATIONSarray inmigrations.rsmaps each version step (index 0 = v1→v2, index 1 = v2→v3, etc.) with a compile-time assert to keep it in sync withSTORE_SCHEMA_VERSIONStore::newdetects the on-disk version and runs pending migrations before proceeding, writing metadata with fsync after each step for crash safetyhas_valid_dbnow accepts any migratable version (1..=STORE_SCHEMA_VERSION)Files changed
crates/storage/migrations.rsMigrationFntype,MIGRATIONSarray,run_pending_migrations(), unit testscrates/storage/lib.rspub mod migrations, update docstringcrates/storage/store.rsvalidate_store_schema_version→read_store_schema_version, restructureStore::newwith migration support, updatehas_valid_dbcrates/storage/error.rsMigrationFailedvariantcmd/ethrex/initializers.rsMigrationFailedat startupHow to add a future migration
fn migrate_1_to_2(backend: &dyn StorageBackend) -> Result<(), StoreError>inmigrations.rsMIGRATIONSarray (index 0)STORE_SCHEMA_VERSIONto2inlib.rsTest plan
cargo check -p ethrex-storage— cleancargo check -p ethrex— cleancargo test -p ethrex-storage— 3 unit tests pass (array length, no-op migration, metadata correctness)