feat(consensus): persist standalone WAL with crash-safe replay by ongyimeng · Pull Request #3763 · NethermindEth/juno

ongyimeng · 2026-06-24T09:04:24Z

Summary

Replaces the consensus WAL, previously stored as generic CBOR-encoded values in the shared Pebble KV store, with a dedicated append-only WAL owned by consensus. The new WAL uses typed records, segment files, pruning, and crash-safe replay.

Why

Consensus relies on WAL replay to recover after a crash mid-height. As generic KV values, that recovery path was harder to inspect and debug, and was coupled to the shared KV store. A standalone WAL gives consensus an explicit durability boundary with clearer replay and failure handling.

What changed

New consensus WAL:
- typed record codec
- append-only segments with rotation and tail repair
- persisted prune watermarks
- live-entry indexes
Replaced the Pebble KV-backed WAL and replay path.
Fixed Start entry replay. It already existed as an entry type, but the old generic CBOR decoding couldn't reconstruct it on replay.
Improved error handling across WAL writes/flushes, commit persistence, and replay.

Performance

Headline comparison is BenchmarkTendermintWALStoreSustainedHeightLatency, which is closest to the consensus write/delete flow: 1000 heights, real cleanup frequency, count=5.

Medians:

metric	old WAL (KV)	new WAL	result
p50	155ms	153ms	tie
p95	192ms	177ms	~7.8% faster
p99	239ms	203ms	~15.1% faster
max	298ms	299ms	tie
B/op	1.77 MB	1.47 MB	~17% less
allocs/op	41.1k	28.2k	~31% fewer

Same latency at p50/max, better p95/p99, fewer allocations.

B/op and allocs/op are per 1000-height run, since Go counts the loop as one op under -benchtime=1x.

Note: This benchmark uses 1,000 heights, which fits within Pebble’s in-memory write path. It does not exercise the old KV-backed WAL’s eventual flush, compaction, or tombstone-reclamation costs. There is no steady-state latency regression at this scale, but it is not representative of recurring background costs in a long-running node which is opaque and state dependent.

Testing

Codec, replay, WAL store/writer, pruning, and failure-injection tests.
Restart test: node recovers via WAL replay with peers stopped and no committed block.
Benchmarks for entry durability, full-height writes, cleanup, replay, and sustained latency.

The remaining coverage gaps are mostly rare internal error paths. The main WAL recovery flows are already covered, so extra tests would add little value.

codecov · 2026-06-24T09:16:50Z

Codecov Report

❌ Patch coverage is 74.64481% with 232 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.36%. Comparing base (82780a4) to head (1ad21e6).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
consensus/db/codec.go	75.49%	42 Missing and 33 partials ⚠️
consensus/db/wal_store.go	76.54%	21 Missing and 17 partials ⚠️
consensus/db/replay.go	50.76%	21 Missing and 11 partials ⚠️
consensus/db/wal_writer.go	78.87%	17 Missing and 13 partials ⚠️
consensus/db/prune_watermark.go	58.33%	13 Missing and 7 partials ⚠️
consensus/db/record.go	82.45%	6 Missing and 4 partials ⚠️
consensus/db/wal_index.go	84.00%	4 Missing and 4 partials ⚠️
consensus/driver/driver.go	60.00%	5 Missing and 1 partial ⚠️
db/memory/db.go	60.00%	3 Missing and 3 partials ⚠️
consensus/driver/commit_listener.go	78.57%	3 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3763      +/-   ##
==========================================
+ Coverage   75.26%   75.36%   +0.09%     
==========================================
  Files         433      438       +5     
  Lines       38558    39464     +906     
==========================================
+ Hits        29022    29742     +720     
- Misses       7536     7643     +107     
- Partials     2000     2079      +79

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2026-06-25T05:30:46Z

Claude finished @ongyimeng's task in 6m 35s —— View job

PR Review: Standalone consensus WAL

Read core WAL files (codec, record, wal_writer, wal_store, wal_index, prune_watermark, replay)
Review consensus/driver integration & db layer changes
Analyze for correctness, concurrency, resource leaks, error handling
Post inline comments + summary

Reviewed the full change set with a focus on consensus-recovery correctness, durability ordering, concurrency, and resource handling. This is a well-engineered PR — the durability invariants are sound and the failure-injection test coverage is unusually thorough. I have no blockers.

Blocker

None.

Important

None. A few things I verified that are correct and worth recording:

WAL tail repair uses physical offsets consistently. appendSync stores the offset returned by Writer.WriteRecord and recoverLatestWALTail uses offset.Physical / PhysicalFile; the fake writer in tests mirrors record.Writer physical offsets, so truncation targets line up. (wal_writer.go:82, :230)
appendSync won't deadlock on the sync WaitGroup. On a WriteRecord error it returns before waitGroup.Wait(), so a never-signalled Done can't hang the flush. (wal_writer.go:67-76)
Prune durability ordering is crash-safe. Watermark is fsync'd before WAL files are deleted (wal_store.go:210-217), and a stale watermark is harmless because the surviving WAL prune records re-drive pruneLiveEntriesUpTo on replay (replay.go:113). Recovery at any crash point stays consistent.
Persisted chan error (buffer 1) is leak-free. Each sync.go path sends exactly once, and the buffer means a producer never blocks even when OnCommit returns early on ctx.Done() without draining. (sync.go:324/349/359/372, commit_listener.go:71)
putFixedUvarint32 round-trips correctly for the full uint32 range as a fixed 5-byte uvarint (verified the bit layout incl. the terminating buf[4] = value>>28). (codec.go:135)
WAL store is fully mutex-guarded; LoadAllEntries snapshots under lock; replay (isReplaying) correctly suppresses re-writing WAL entries, avoiding duplication.

Nit

db/memory/db.go:113 — Path() panics on MkdirTemp failure and silently makes the "persistent" WAL ephemeral under an in-memory DB. Details in the inline comment. Fix this →
wal_index.go:51 — panic(err) in updateIndexesFromCommittedRecords. These records were just built via setEntry, so entry() can't realistically fail, but a panic on the post-commit consensus path is a sharp edge. Consider surfacing it as a returned error from flushLocked instead.
consensus.go:93-103 — WAL store leak on a future Init error. No path between NewTendermintWALStore and the return currently fails, but the store opens file handles / a Pebble WAL manager and is only closed via Driver.Run. If a fallible step is ever added below it, it'd leak; a deferred-close-on-error guard would harden this.
wal_store.go:24 — WAL retention. With cleanupPruneRecordInterval = 256 and ~one prune/height, up to ~256 heights of sealed WAL accumulate between cleanups. Intentional and documented (amortizes fsyncs), just flagging the disk-footprint tradeoff for long-running nodes.

Test quality is good: channel-based synchronization (no time.Sleep), meaningful assertions, and the failure-injection suite covers the rare error paths the codecov gaps point at.
• Branch: fix/tendermint-use-pebble-wal

github-actions · 2026-06-25T05:57:16Z

Claude finished @ongyimeng's task in 2m 1s —— View job

Re-review: standalone consensus WAL (synchronize)

Re-reviewed the three commits pushed since the last review. They directly address the prior nits, and the changed packages build cleanly.

fix: handle WAL file close errors (a188a81)
fix(db/memory): avoid panic on temp path creation (ef5ed8f)
test: cover WAL record / prune watermark decode (0383582)
Check for regressions / new issues

Verdict

All three changes are correct. No blockers, no new issues.

repairWALTail close handling (wal_writer.go:181-209) — correctly converted to a named return (err error) with a deferred errors.Join on file.Close(). The earlier-return error paths (Stat/Truncate/Sync) are preserved and the close error is now joined rather than dropped. This resolves the CodeQL "writable file handle closed without error handling" finding. ✅

Path() no longer panics (db/memory/db.go:112-115) — returns "" on MkdirTemp failure instead of panic. Callers already treat "" as "no local path" (NewTendermintWALStore returns a clean error), so a transient TMPDIR failure now degrades gracefully instead of crashing the node from an infallible accessor. ✅

New decode-path tests (prune_watermark_test.go, codec_test.go, export_test.go) are meaningful — round-trip plus corruption cases (wrong size, bad header) assert real error behavior, using t.TempDir() with no flaky sleeps. ✅

Nit (non-blocking, optional)

db/memory/db.go Path() — the prior review also suggested a doc comment noting that under an in-memory backing DB the consensus WAL lives in an ephemeral temp dir that is RemoveAll'd on Close(), so there is no crash recovery in that configuration. The panic fix is the important part and is done; a one-line contract comment on Path() would still help future readers understand the silent ephemeral-WAL behavior. Add the comment →

Overall: the PR remains in good shape — durability invariants intact, the prior review's concrete items are now closed.

rodrodros

Some initial comments

rodrodros · 2026-06-29T10:59:46Z

+}
+
+func appendUint64Array[T ~[4]uint64](payload []byte, value T) []byte {
+	array := [4]uint64(value)


This performs a copy of the value, maybe you can make it a pointer, by receiving *T

Switching to *T makes the API less clear because it introduces nil handling, even though the helper doesn’t mutate or store the value.

If the concern is performance, I checked escape analysis with the compiler output and passing value T does not introduce a heap allocation here.

Because this is internal API you can trust yourself and know that you will never call it knowing it is nil, by passing a reference, you saved from copying 4 words to just copying 1 word per call

rodrodros · 2026-06-29T11:07:48Z

+	for i, record := range records {
+		// Each batchrepr entry starts with the kind byte and the encoded key.
+		encodedBatch = append(encodedBatch, byte(pebble.InternalKeyKindSet), keyBytes)
+		encodedBatch = binary.BigEndian.AppendUint32(encodedBatch, uint32(i))
+
+		// Reserve space for the value length prefix and backfill it after writing
+		// the record payload.
+		valueLenIndex := len(encodedBatch)
+		encodedBatch = append(encodedBatch, 0, 0, 0, 0, 0)
+		valueIndex := len(encodedBatch)
+


Performance nitpick: Append is generally slower than indexing and adding values. Since you already know the max size then it is better to do index work and it will be slighter faster.

Appends, even if not exceeding max capacity, it adds an overhead to it

I measured this with a microbenchmark using the common encodeBatch sizes from the WAL flow which is 1-2 records.

With -count=10, indexing was basically tied for the 2-record case and only ~0.7 ns faster for the 1-record case, with 0 allocs either way. That is negligible compared to the real WAL flush path, which is ~20 ms/op.

This feels mostly like a readability preference, and I find the append version clearer than the manual-offset version. Let me know if you still prefer indexing.

We can leave it as is. I personally don't think it is more readable since it is not clear enough what you want to do. For example:

encodedBatch = append(encodedBatch, 0, 0, 0, 0, 0)

Looks very weird, whilst the intention is reserving extra 5 bytes.

Additionally, if tomorrow the capacity requirements change we will end up with append silently allocating and not realizing it.

None of this are big deal, I do think the indexing one is preferable and not leave it up to the compiler to optimize it for you.

Leave if you think is good 👍

rodrodros

Initial review

rodrodros · 2026-06-29T12:12:37Z

+	payload []byte,
+	record walRecordEnvelope[V, H, A],
+) ([]byte, error) {
+	payload = append(payload, byte(record.Kind))


Similar comment with the append. If you pass an index and treat that as your 0 index it will improve performance

rodrodros · 2026-06-29T12:16:34Z

+			payload = appendMessageHeader(payload, proposalMessage.MessageHeader)
+			payload = appendInt64(payload, int64(proposalMessage.ValidRound))


append is so expensive 🫠, it is being used way more than I thought

rodrodros · 2026-06-29T12:27:45Z

+	}
+	if decoder.remaining() != 0 {
+		return walRecordEnvelope[V, H, A]{}, fmt.Errorf(
+			"trailing WAL record bytes: %d",


The error doesn't make clear what the problem is? Perhaps you want to put something like "unexpected remaining WAL record bytes: %d" ?

rodrodros · 2026-06-29T12:29:32Z

Codec holds both the coding and decoding logic wouldn't it makes sense having a codec package, with the coding and decoding logic properly separated, or having a script per coding / decoding step paired together?

rodrodros · 2026-06-29T12:31:51Z

Not relevant to this file but the whole folder.

Why is the package called db when it is clearly designed to be a WAL log, shouldn't it be renamed wal or waldb?

rodrodros · 2026-06-29T13:18:23Z

What's the rationale behind this changes?

github-actions · 2026-06-30T07:53:34Z

Claude finished @ongyimeng's task in 2m 27s —— View job

PR Re-review: standalone consensus WAL (synchronize)

Re-reviewed the commits pushed since the last review:

test: cover WAL record and prune watermark decode paths (0383582)
refactor: use constant for estimatedRecordBytes (daddcff)
refactor: pass WalRecordEnvelope by reference (1ad21e6)
Re-check core WAL files for correctness/concurrency
Verify the refactors introduce no aliasing/regression
Build affected packages

These directly address @rodrodros's review nits. Affected packages (consensus/db, db/memory) build cleanly.

Verdict

All changes are correct. No blockers, no new issues.

encodeBatch index-based header write (consensus/db/codec.go:52-78) — the previous append-based header construction was replaced with slices.Grow + explicit offset writes. I verified the byte layout is byte-for-byte identical to the original:

offset	bytes	field
`0` (`headerIndex`)	1	`InternalKeyKindSet`
`1` (`keyLenOffset`)	1	key length (`4`)
`2` (`keyOffset`)	4	record key = `uint32(i)` (big-endian)
`6` (`valueLenOffset`)	5	value length (fixed uvarint, backfilled)
`11` (`recordHeaderBytes`)	…	payload

slices.Grow(encodedBatch, recordHeaderBytes) guarantees cap ≥ headerIndex+11, so encodedBatch[:valueIndex] is safe, and all 11 reserved header bytes are unconditionally overwritten (kind/keyLen by assignment, key by PutUint32, value-length by putFixedUvarint32), so stale bytes from reused capacity can't leak. ✅

appendWALRecordPayload(payload, *walRecordEnvelope) (codec.go:83-85) — now takes the envelope by pointer to avoid copying the large struct. The function only reads record.* fields, never mutates or retains them, and &records[i] in the loop is a read-only borrow, so there's no aliasing hazard. The single test caller (export_test.go:31) was updated to pass &envelope. ✅

estimatedRecordBytes const (codec.go:40) — folding it into the const block with derived offset constants is purely mechanical and computes the same value (11 + 128 = 139). ✅

Path() returns "" on MkdirTemp failure (db/memory/db.go:114) — the fmt import was dropped along with the panic; matches the graceful-degradation behavior agreed in the prior review. ✅

The error-message wording flagged earlier (codec.go:176) is also now the clearer "trailing WAL record bytes: %d".

Overall: the PR remains in good shape. The durability invariants verified in the earlier reviews are untouched, and the refactors are behavior-preserving.
• Branch: fix/tendermint-use-pebble-wal

feat(consensus): persist standalone WAL with crash-safe replay

e24b269

github-advanced-security AI found potential problems Jun 24, 2026

View reviewed changes

Comment thread consensus/db/wal_writer.go Fixed

github-code-quality Bot found potential problems Jun 24, 2026

View reviewed changes

Comment thread consensus/db/wal_writer.go Fixed

ongyimeng temporarily deployed to Development June 24, 2026 09:22 — with GitHub Actions Inactive

fix: handle WAL file close errors

a188a81

ongyimeng temporarily deployed to Development June 24, 2026 10:04 — with GitHub Actions Inactive

test: cover WAL record and prune watermark decode paths

0383582

ongyimeng force-pushed the fix/tendermint-use-pebble-wal branch from e8c340a to 0383582 Compare June 25, 2026 04:39

ongyimeng temporarily deployed to Development June 25, 2026 04:54 — with GitHub Actions Inactive

ongyimeng marked this pull request as ready for review June 25, 2026 05:29

github-actions Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread db/memory/db.go

fix(db/memory): avoid panic when temp path creation fails

ef5ed8f

ongyimeng requested a review from rodrodros June 25, 2026 05:56

ongyimeng temporarily deployed to Development June 25, 2026 06:02 — with GitHub Actions Inactive

rodrodros reviewed Jun 29, 2026

View reviewed changes

ongyimeng added 2 commits June 30, 2026 14:31

refactor: use constant for estimatedRecordBytes

daddcff

refactor: pass WalRecordEnvelope by reference

1ad21e6

ongyimeng deployed to Development June 30, 2026 07:57 — with GitHub Actions Active

		payload = appendMessageHeader(payload, proposalMessage.MessageHeader)
		payload = appendInt64(payload, int64(proposalMessage.ValidRound))

Uh oh!

Conversation

ongyimeng commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Performance

Testing

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Standalone consensus WAL

Blocker

Important

Nit

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: standalone consensus WAL (synchronize)

Verdict

Nit (non-blocking, optional)

Uh oh!

rodrodros left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodrodros Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ongyimeng Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodrodros left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Re-review: standalone consensus WAL (synchronize)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

ongyimeng commented Jun 24, 2026 •

edited

Loading

codecov Bot commented Jun 24, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading

rodrodros Jun 30, 2026 •

edited

Loading

ongyimeng Jun 30, 2026 •

edited

Loading

github-actions Bot commented Jun 30, 2026 •

edited

Loading