Skip to content

feat(consensus): persist standalone WAL with crash-safe replay#3763

Open
ongyimeng wants to merge 6 commits into
mainfrom
fix/tendermint-use-pebble-wal
Open

feat(consensus): persist standalone WAL with crash-safe replay#3763
ongyimeng wants to merge 6 commits into
mainfrom
fix/tendermint-use-pebble-wal

Conversation

@ongyimeng

@ongyimeng ongyimeng commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the consensus WAL, previously stored as generic CBOR-encoded values in the shared Pebble KV store, with a dedicated append-only WAL owned by consensus. The new WAL uses typed records, segment files, pruning, and crash-safe replay.

Why

Consensus relies on WAL replay to recover after a crash mid-height. As generic KV values, that recovery path was harder to inspect and debug, and was coupled to the shared KV store. A standalone WAL gives consensus an explicit durability boundary with clearer replay and failure handling.

What changed

  • New consensus WAL:
    • typed record codec
    • append-only segments with rotation and tail repair
    • persisted prune watermarks
    • live-entry indexes
  • Replaced the Pebble KV-backed WAL and replay path.
  • Fixed Start entry replay. It already existed as an entry type, but the old generic CBOR decoding couldn't reconstruct it on replay.
  • Improved error handling across WAL writes/flushes, commit persistence, and replay.

Performance

Headline comparison is BenchmarkTendermintWALStoreSustainedHeightLatency, which is closest to the consensus write/delete flow: 1000 heights, real cleanup frequency, count=5.

Medians:

metric old WAL (KV) new WAL result
p50 155ms 153ms tie
p95 192ms 177ms ~7.8% faster
p99 239ms 203ms ~15.1% faster
max 298ms 299ms tie
B/op 1.77 MB 1.47 MB ~17% less
allocs/op 41.1k 28.2k ~31% fewer

Same latency at p50/max, better p95/p99, fewer allocations.

B/op and allocs/op are per 1000-height run, since Go counts the loop as one op under -benchtime=1x.

Note: This benchmark uses 1,000 heights, which fits within Pebble’s in-memory write path. It does not exercise the old KV-backed WAL’s eventual flush, compaction, or tombstone-reclamation costs. There is no steady-state latency regression at this scale, but it is not representative of recurring background costs in a long-running node which is opaque and state dependent.

Testing

  • Codec, replay, WAL store/writer, pruning, and failure-injection tests.
  • Restart test: node recovers via WAL replay with peers stopped and no committed block.
  • Benchmarks for entry durability, full-height writes, cleanup, replay, and sustained latency.

The remaining coverage gaps are mostly rare internal error paths. The main WAL recovery flows are already covered, so extra tests would add little value.

Comment thread consensus/db/wal_writer.go Fixed
Comment thread consensus/db/wal_writer.go Fixed
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.64481% with 232 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.36%. Comparing base (82780a4) to head (1ad21e6).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
consensus/db/codec.go 75.49% 42 Missing and 33 partials ⚠️
consensus/db/wal_store.go 76.54% 21 Missing and 17 partials ⚠️
consensus/db/replay.go 50.76% 21 Missing and 11 partials ⚠️
consensus/db/wal_writer.go 78.87% 17 Missing and 13 partials ⚠️
consensus/db/prune_watermark.go 58.33% 13 Missing and 7 partials ⚠️
consensus/db/record.go 82.45% 6 Missing and 4 partials ⚠️
consensus/db/wal_index.go 84.00% 4 Missing and 4 partials ⚠️
consensus/driver/driver.go 60.00% 5 Missing and 1 partial ⚠️
db/memory/db.go 60.00% 3 Missing and 3 partials ⚠️
consensus/driver/commit_listener.go 78.57% 3 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3763      +/-   ##
==========================================
+ Coverage   75.26%   75.36%   +0.09%     
==========================================
  Files         433      438       +5     
  Lines       38558    39464     +906     
==========================================
+ Hits        29022    29742     +720     
- Misses       7536     7643     +107     
- Partials     2000     2079      +79     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ongyimeng ongyimeng force-pushed the fix/tendermint-use-pebble-wal branch from e8c340a to 0383582 Compare June 25, 2026 04:39
@ongyimeng ongyimeng marked this pull request as ready for review June 25, 2026 05:29
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 6m 35s —— View job


PR Review: Standalone consensus WAL

  • Read core WAL files (codec, record, wal_writer, wal_store, wal_index, prune_watermark, replay)
  • Review consensus/driver integration & db layer changes
  • Analyze for correctness, concurrency, resource leaks, error handling
  • Post inline comments + summary

Reviewed the full change set with a focus on consensus-recovery correctness, durability ordering, concurrency, and resource handling. This is a well-engineered PR — the durability invariants are sound and the failure-injection test coverage is unusually thorough. I have no blockers.

Blocker

None.

Important

None. A few things I verified that are correct and worth recording:

  • WAL tail repair uses physical offsets consistently. appendSync stores the offset returned by Writer.WriteRecord and recoverLatestWALTail uses offset.Physical / PhysicalFile; the fake writer in tests mirrors record.Writer physical offsets, so truncation targets line up. (wal_writer.go:82, :230)
  • appendSync won't deadlock on the sync WaitGroup. On a WriteRecord error it returns before waitGroup.Wait(), so a never-signalled Done can't hang the flush. (wal_writer.go:67-76)
  • Prune durability ordering is crash-safe. Watermark is fsync'd before WAL files are deleted (wal_store.go:210-217), and a stale watermark is harmless because the surviving WAL prune records re-drive pruneLiveEntriesUpTo on replay (replay.go:113). Recovery at any crash point stays consistent.
  • Persisted chan error (buffer 1) is leak-free. Each sync.go path sends exactly once, and the buffer means a producer never blocks even when OnCommit returns early on ctx.Done() without draining. (sync.go:324/349/359/372, commit_listener.go:71)
  • putFixedUvarint32 round-trips correctly for the full uint32 range as a fixed 5-byte uvarint (verified the bit layout incl. the terminating buf[4] = value>>28). (codec.go:135)
  • WAL store is fully mutex-guarded; LoadAllEntries snapshots under lock; replay (isReplaying) correctly suppresses re-writing WAL entries, avoiding duplication.

Nit

  1. db/memory/db.go:113Path() panics on MkdirTemp failure and silently makes the "persistent" WAL ephemeral under an in-memory DB. Details in the inline comment. Fix this →
  2. wal_index.go:51panic(err) in updateIndexesFromCommittedRecords. These records were just built via setEntry, so entry() can't realistically fail, but a panic on the post-commit consensus path is a sharp edge. Consider surfacing it as a returned error from flushLocked instead.
  3. consensus.go:93-103 — WAL store leak on a future Init error. No path between NewTendermintWALStore and the return currently fails, but the store opens file handles / a Pebble WAL manager and is only closed via Driver.Run. If a fallible step is ever added below it, it'd leak; a deferred-close-on-error guard would harden this.
  4. wal_store.go:24 — WAL retention. With cleanupPruneRecordInterval = 256 and ~one prune/height, up to ~256 heights of sealed WAL accumulate between cleanups. Intentional and documented (amortizes fsyncs), just flagging the disk-footprint tradeoff for long-running nodes.

Test quality is good: channel-based synchronization (no time.Sleep), meaningful assertions, and the failure-injection suite covers the rare error paths the codecov gaps point at.
• Branch: fix/tendermint-use-pebble-wal

Comment thread db/memory/db.go
@ongyimeng ongyimeng requested a review from rodrodros June 25, 2026 05:56
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 1s —— View job


Re-review: standalone consensus WAL (synchronize)

Re-reviewed the three commits pushed since the last review. They directly address the prior nits, and the changed packages build cleanly.

  • fix: handle WAL file close errors (a188a81)
  • fix(db/memory): avoid panic on temp path creation (ef5ed8f)
  • test: cover WAL record / prune watermark decode (0383582)
  • Check for regressions / new issues

Verdict

All three changes are correct. No blockers, no new issues.

repairWALTail close handling (wal_writer.go:181-209) — correctly converted to a named return (err error) with a deferred errors.Join on file.Close(). The earlier-return error paths (Stat/Truncate/Sync) are preserved and the close error is now joined rather than dropped. This resolves the CodeQL "writable file handle closed without error handling" finding. ✅

Path() no longer panics (db/memory/db.go:112-115) — returns "" on MkdirTemp failure instead of panic. Callers already treat "" as "no local path" (NewTendermintWALStore returns a clean error), so a transient TMPDIR failure now degrades gracefully instead of crashing the node from an infallible accessor. ✅

New decode-path tests (prune_watermark_test.go, codec_test.go, export_test.go) are meaningful — round-trip plus corruption cases (wrong size, bad header) assert real error behavior, using t.TempDir() with no flaky sleeps. ✅

Nit (non-blocking, optional)

  • db/memory/db.go Path() — the prior review also suggested a doc comment noting that under an in-memory backing DB the consensus WAL lives in an ephemeral temp dir that is RemoveAll'd on Close(), so there is no crash recovery in that configuration. The panic fix is the important part and is done; a one-line contract comment on Path() would still help future readers understand the silent ephemeral-WAL behavior. Add the comment →

Overall: the PR remains in good shape — durability invariants intact, the prior review's concrete items are now closed.

@rodrodros rodrodros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments

Comment thread consensus/db/codec.go
Comment thread consensus/db/codec.go Outdated
Comment thread consensus/db/codec.go
}

func appendUint64Array[T ~[4]uint64](payload []byte, value T) []byte {
array := [4]uint64(value)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This performs a copy of the value, maybe you can make it a pointer, by receiving *T

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to *T makes the API less clear because it introduces nil handling, even though the helper doesn’t mutate or store the value.

If the concern is performance, I checked escape analysis with the compiler output and passing value T does not introduce a heap allocation here.

@rodrodros rodrodros Jun 30, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is internal API you can trust yourself and know that you will never call it knowing it is nil, by passing a reference, you saved from copying 4 words to just copying 1 word per call

Comment thread consensus/db/codec.go
Comment thread consensus/db/codec.go Outdated
Comment on lines +48 to +58
for i, record := range records {
// Each batchrepr entry starts with the kind byte and the encoded key.
encodedBatch = append(encodedBatch, byte(pebble.InternalKeyKindSet), keyBytes)
encodedBatch = binary.BigEndian.AppendUint32(encodedBatch, uint32(i))

// Reserve space for the value length prefix and backfill it after writing
// the record payload.
valueLenIndex := len(encodedBatch)
encodedBatch = append(encodedBatch, 0, 0, 0, 0, 0)
valueIndex := len(encodedBatch)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance nitpick: Append is generally slower than indexing and adding values. Since you already know the max size then it is better to do index work and it will be slighter faster.

Appends, even if not exceeding max capacity, it adds an overhead to it

@ongyimeng ongyimeng Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured this with a microbenchmark using the common encodeBatch sizes from the WAL flow which is 1-2 records.

With -count=10, indexing was basically tied for the 2-record case and only ~0.7 ns faster for the 1-record case, with 0 allocs either way. That is negligible compared to the real WAL flush path, which is ~20 ms/op.

This feels mostly like a readability preference, and I find the append version clearer than the manual-offset version. Let me know if you still prefer indexing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it as is. I personally don't think it is more readable since it is not clear enough what you want to do. For example:

	encodedBatch = append(encodedBatch, 0, 0, 0, 0, 0)

Looks very weird, whilst the intention is reserving extra 5 bytes.

Additionally, if tomorrow the capacity requirements change we will end up with append silently allocating and not realizing it.

None of this are big deal, I do think the indexing one is preferable and not leave it up to the compiler to optimize it for you.

Leave if you think is good 👍

@rodrodros rodrodros left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review

Comment thread consensus/db/codec.go Outdated
Comment thread consensus/db/codec.go
Comment thread consensus/db/codec.go
payload []byte,
record walRecordEnvelope[V, H, A],
) ([]byte, error) {
payload = append(payload, byte(record.Kind))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment with the append. If you pass an index and treat that as your 0 index it will improve performance

Comment thread consensus/db/codec.go
Comment thread consensus/db/codec.go
Comment on lines +94 to +95
payload = appendMessageHeader(payload, proposalMessage.MessageHeader)
payload = appendInt64(payload, int64(proposalMessage.ValidRound))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append is so expensive 🫠, it is being used way more than I thought

Comment thread consensus/db/codec.go
}
if decoder.remaining() != 0 {
return walRecordEnvelope[V, H, A]{}, fmt.Errorf(
"trailing WAL record bytes: %d",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error doesn't make clear what the problem is? Perhaps you want to put something like "unexpected remaining WAL record bytes: %d" ?

Comment thread consensus/db/codec.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codec holds both the coding and decoding logic wouldn't it makes sense having a codec package, with the coding and decoding logic properly separated, or having a script per coding / decoding step paired together?

Comment thread consensus/db/codec.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to this file but the whole folder.

Why is the package called db when it is clearly designed to be a WAL log, shouldn't it be renamed wal or waldb?

Comment thread consensus/db/init_test.go
Comment thread sync/data_source.go

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind this changes?

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Claude finished @ongyimeng's task in 2m 27s —— View job


PR Re-review: standalone consensus WAL (synchronize)

Re-reviewed the commits pushed since the last review:

  • test: cover WAL record and prune watermark decode paths (0383582)
  • refactor: use constant for estimatedRecordBytes (daddcff)
  • refactor: pass WalRecordEnvelope by reference (1ad21e6)
  • Re-check core WAL files for correctness/concurrency
  • Verify the refactors introduce no aliasing/regression
  • Build affected packages

These directly address @rodrodros's review nits. Affected packages (consensus/db, db/memory) build cleanly.

Verdict

All changes are correct. No blockers, no new issues.

encodeBatch index-based header write (consensus/db/codec.go:52-78) — the previous append-based header construction was replaced with slices.Grow + explicit offset writes. I verified the byte layout is byte-for-byte identical to the original:

offset bytes field
0 (headerIndex) 1 InternalKeyKindSet
1 (keyLenOffset) 1 key length (4)
2 (keyOffset) 4 record key = uint32(i) (big-endian)
6 (valueLenOffset) 5 value length (fixed uvarint, backfilled)
11 (recordHeaderBytes) payload

slices.Grow(encodedBatch, recordHeaderBytes) guarantees cap ≥ headerIndex+11, so encodedBatch[:valueIndex] is safe, and all 11 reserved header bytes are unconditionally overwritten (kind/keyLen by assignment, key by PutUint32, value-length by putFixedUvarint32), so stale bytes from reused capacity can't leak. ✅

appendWALRecordPayload(payload, *walRecordEnvelope) (codec.go:83-85) — now takes the envelope by pointer to avoid copying the large struct. The function only reads record.* fields, never mutates or retains them, and &records[i] in the loop is a read-only borrow, so there's no aliasing hazard. The single test caller (export_test.go:31) was updated to pass &envelope. ✅

estimatedRecordBytes const (codec.go:40) — folding it into the const block with derived offset constants is purely mechanical and computes the same value (11 + 128 = 139). ✅

Path() returns "" on MkdirTemp failure (db/memory/db.go:114) — the fmt import was dropped along with the panic; matches the graceful-degradation behavior agreed in the prior review. ✅

The error-message wording flagged earlier (codec.go:176) is also now the clearer "trailing WAL record bytes: %d".

Overall: the PR remains in good shape. The durability invariants verified in the earlier reviews are untouched, and the refactors are behavior-preserving.
• Branch: fix/tendermint-use-pebble-wal

@ongyimeng ongyimeng deployed to Development June 30, 2026 07:57 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants