feat(consensus): persist standalone WAL with crash-safe replay#3763
feat(consensus): persist standalone WAL with crash-safe replay#3763ongyimeng wants to merge 6 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3763 +/- ##
==========================================
+ Coverage 75.26% 75.36% +0.09%
==========================================
Files 433 438 +5
Lines 38558 39464 +906
==========================================
+ Hits 29022 29742 +720
- Misses 7536 7643 +107
- Partials 2000 2079 +79 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
e8c340a to
0383582
Compare
|
Claude finished @ongyimeng's task in 6m 35s —— View job PR Review: Standalone consensus WAL
Reviewed the full change set with a focus on consensus-recovery correctness, durability ordering, concurrency, and resource handling. This is a well-engineered PR — the durability invariants are sound and the failure-injection test coverage is unusually thorough. I have no blockers. BlockerNone. ImportantNone. A few things I verified that are correct and worth recording:
Nit
Test quality is good: channel-based synchronization (no |
|
Claude finished @ongyimeng's task in 2m 1s —— View job Re-review: standalone consensus WAL (synchronize)Re-reviewed the three commits pushed since the last review. They directly address the prior nits, and the changed packages build cleanly.
VerdictAll three changes are correct. No blockers, no new issues.
New decode-path tests ( Nit (non-blocking, optional)
Overall: the PR remains in good shape — durability invariants intact, the prior review's concrete items are now closed. |
| } | ||
|
|
||
| func appendUint64Array[T ~[4]uint64](payload []byte, value T) []byte { | ||
| array := [4]uint64(value) |
There was a problem hiding this comment.
This performs a copy of the value, maybe you can make it a pointer, by receiving *T
There was a problem hiding this comment.
Switching to *T makes the API less clear because it introduces nil handling, even though the helper doesn’t mutate or store the value.
If the concern is performance, I checked escape analysis with the compiler output and passing value T does not introduce a heap allocation here.
There was a problem hiding this comment.
Because this is internal API you can trust yourself and know that you will never call it knowing it is nil, by passing a reference, you saved from copying 4 words to just copying 1 word per call
| for i, record := range records { | ||
| // Each batchrepr entry starts with the kind byte and the encoded key. | ||
| encodedBatch = append(encodedBatch, byte(pebble.InternalKeyKindSet), keyBytes) | ||
| encodedBatch = binary.BigEndian.AppendUint32(encodedBatch, uint32(i)) | ||
|
|
||
| // Reserve space for the value length prefix and backfill it after writing | ||
| // the record payload. | ||
| valueLenIndex := len(encodedBatch) | ||
| encodedBatch = append(encodedBatch, 0, 0, 0, 0, 0) | ||
| valueIndex := len(encodedBatch) | ||
|
|
There was a problem hiding this comment.
Performance nitpick: Append is generally slower than indexing and adding values. Since you already know the max size then it is better to do index work and it will be slighter faster.
Appends, even if not exceeding max capacity, it adds an overhead to it
There was a problem hiding this comment.
I measured this with a microbenchmark using the common encodeBatch sizes from the WAL flow which is 1-2 records.
With -count=10, indexing was basically tied for the 2-record case and only ~0.7 ns faster for the 1-record case, with 0 allocs either way. That is negligible compared to the real WAL flush path, which is ~20 ms/op.
This feels mostly like a readability preference, and I find the append version clearer than the manual-offset version. Let me know if you still prefer indexing.
There was a problem hiding this comment.
We can leave it as is. I personally don't think it is more readable since it is not clear enough what you want to do. For example:
encodedBatch = append(encodedBatch, 0, 0, 0, 0, 0)Looks very weird, whilst the intention is reserving extra 5 bytes.
Additionally, if tomorrow the capacity requirements change we will end up with append silently allocating and not realizing it.
None of this are big deal, I do think the indexing one is preferable and not leave it up to the compiler to optimize it for you.
Leave if you think is good 👍
| payload []byte, | ||
| record walRecordEnvelope[V, H, A], | ||
| ) ([]byte, error) { | ||
| payload = append(payload, byte(record.Kind)) |
There was a problem hiding this comment.
Similar comment with the append. If you pass an index and treat that as your 0 index it will improve performance
| payload = appendMessageHeader(payload, proposalMessage.MessageHeader) | ||
| payload = appendInt64(payload, int64(proposalMessage.ValidRound)) |
There was a problem hiding this comment.
append is so expensive 🫠, it is being used way more than I thought
| } | ||
| if decoder.remaining() != 0 { | ||
| return walRecordEnvelope[V, H, A]{}, fmt.Errorf( | ||
| "trailing WAL record bytes: %d", |
There was a problem hiding this comment.
The error doesn't make clear what the problem is? Perhaps you want to put something like "unexpected remaining WAL record bytes: %d" ?
There was a problem hiding this comment.
Codec holds both the coding and decoding logic wouldn't it makes sense having a codec package, with the coding and decoding logic properly separated, or having a script per coding / decoding step paired together?
There was a problem hiding this comment.
Not relevant to this file but the whole folder.
Why is the package called db when it is clearly designed to be a WAL log, shouldn't it be renamed wal or waldb?
There was a problem hiding this comment.
What's the rationale behind this changes?
|
Claude finished @ongyimeng's task in 2m 27s —— View job PR Re-review: standalone consensus WAL (synchronize)Re-reviewed the commits pushed since the last review:
These directly address @rodrodros's review nits. Affected packages ( VerdictAll changes are correct. No blockers, no new issues.
The error-message wording flagged earlier ( Overall: the PR remains in good shape. The durability invariants verified in the earlier reviews are untouched, and the refactors are behavior-preserving. |
Summary
Replaces the consensus WAL, previously stored as generic CBOR-encoded values in the shared Pebble KV store, with a dedicated append-only WAL owned by consensus. The new WAL uses typed records, segment files, pruning, and crash-safe replay.
Why
Consensus relies on WAL replay to recover after a crash mid-height. As generic KV values, that recovery path was harder to inspect and debug, and was coupled to the shared KV store. A standalone WAL gives consensus an explicit durability boundary with clearer replay and failure handling.
What changed
Startentry replay. It already existed as an entry type, but the old generic CBOR decoding couldn't reconstruct it on replay.Performance
Headline comparison is
BenchmarkTendermintWALStoreSustainedHeightLatency, which is closest to the consensus write/delete flow: 1000 heights, real cleanup frequency,count=5.Medians:
Same latency at p50/max, better p95/p99, fewer allocations.
B/opandallocs/opare per 1000-height run, since Go counts the loop as one op under-benchtime=1x.Note: This benchmark uses 1,000 heights, which fits within Pebble’s in-memory write path. It does not exercise the old KV-backed WAL’s eventual flush, compaction, or tombstone-reclamation costs. There is no steady-state latency regression at this scale, but it is not representative of recurring background costs in a long-running node which is opaque and state dependent.
Testing
The remaining coverage gaps are mostly rare internal error paths. The main WAL recovery flows are already covered, so extra tests would add little value.