diff --git a/docs/launch-arbitrum-chain/operate/batch-poster-troubleshooting.mdx b/docs/launch-arbitrum-chain/operate/batch-poster-troubleshooting.mdx
index aa1fb20a93..6fd266c645 100644
--- a/docs/launch-arbitrum-chain/operate/batch-poster-troubleshooting.mdx
+++ b/docs/launch-arbitrum-chain/operate/batch-poster-troubleshooting.mdx
@@ -63,6 +63,32 @@ The exact mempool weight limits depend on the parent chain node’s configuratio
+#### `ErrExceedsMaxMempoolSize`→ “error posting batch”
+
+- [DEBUG→WARN→ERROR over 5 min]—The next batch's nonce would exceed `unconfirmed nonce + max-mempool-transactions`, so it's held back.
+- [CAUSE]—L1 is confirming the poster's transactions slowly, and the in-flight window (default: 18) is filled.
+- [MEANING]—Normal back-pressure short-term; sustained means posts are stuck.
+- [ACTION]—If persistent, check L1 confirmation/fees, consider raising `max-mempool-transactions`.
+
+#### `lack of L1 balance prevents posting transaction with desired fee cap`
+
+- [WARN]—The poster's wallet can't cover the target max cost, so the bid is capped at the available balance.
+- [CAUSE]—Low parent-chain balance and/or high L1 fees.
+- [ACTION]—Fund the L1 wallet (`send-l1`); watch `arb/batchposter/wallet/eth`.
+
+#### `a large batch posting backlog exists`
+
+- [INFO/WARN/ERROR]—Unposted messages are piling up; `INFO`: is recently hit L1 bounds, `WARN` at backlog > 10, `ERROR` at > 30.
+- [CAUSE]—Posts not keeping up (fees, mempool cap, L1 congestion, or the poster is behind).
+- [ACTION]—Investigate why posting is throttled; `ERROR` is an alerting threshold.
+
+#### `error fetching batch poster wallet balance` / `...gas refunder balance`
+
+- [WARN]—Balance gauge update failed.
+- [CAUSE]—L1 RPC hiccup.
+- [MEANING]—Monitoring gap only, not a posting failure.
+- [ACTION]—Ignore if transient.
+
## Batch posting backlog diagnosis
A batch posting backlog occurs when the sequencer continues ordering transactions, but the batch poster can’t submit them to the parent chain at the same rate. This results in an accumulation of unposted messages.
@@ -122,6 +148,179 @@ The exact phrasing and context of retry errors can vary across Nitro versions. T
3. **Check parent chain connectivity**: Ensure the batch poster can reliably reach the parent chain RPC. Intermittent connectivity causes retry failures.
4. **Inspect RBF fee escalation**: If the error mentions “replacement transaction underpriced”, the fee increase between attempts may be insufficient. The parent chain typically requires at least a 10% fee increase for RBF. Review the `blob-tx-replacement-times` schedule—shorter intervals may not allow enough price movement between attempts.
+## Reverts and halting
+
+#### `Large gap between last seen and current block number, skipping check for reverts`
+
+- [WARN]—`pollForReverts` fell > 100 blocks behind the chain, so it skipped scanning the intervening blocks for reverted poster transactions and fast-forwarded.
+- [CAUSE]—Node lag, a long pause, or a header-subscription stall.
+- [MEANING]—Reverts in that skipped window won't be detected.
+- [ACTION]—Usually transient; if frequent, investigate node sync/L1 RPC health.
+
+#### `Transaction from batch poster reverted`
+
+- [WARN→ERROR]—A batch posting transaction got a failed receipt on L1.
+- [ERROR (not warn)]—When using persistent storage, it sets `batchReverted` and **halts** all further posting.
+- [CAUSE]—Contract rejected the batch (e.g., message-count mismatch, bad sequence number, gas/blob issue).
+- [ACTION]—Investigate `txErr`; restart clears the halt flag; a count mismatch after force-inclusion may need `allow-posting-first-batch-when-sequencer-message-count-mismatch`.
+
+#### `Error checking batch reverts`
+
+- [DEBUG/WARN]—Revert check failed.
+- [DEBUG]—If the error contains "not found" (a benign parent-node inconsistency where one node served a header, another lacks), else WARN.
+- [ACTION]—Ignore the DEBUG case; investigate persistent WARNs.
+
+## Nonce/sync
+
+#### `failed to update nonce with queue empty; falling back to using a recent block`
+
+- [WARN]—Couldn't get the finalized nonce, so it used a recent block instead. Safe because the queue is empty.
+- [CAUSE]—Finality data unavailable/RPC issue.
+- [ACTION]—Benign one-off; investigate if constant.
+
+#### `Failed to get current nonce`
+
+- [WARN]—Nonce refresh failed, but a previous nonce exists, so it's non-fatal.
+- [CAUSE]—L1 RPC.
+- [ACTION]—Ignore if transient.
+
+#### `Failed to get latest nonce`
+
+- [WARN]—Couldn't fetch the unconfirmed nonce this loop; The iteration backs off 10s (`minWait`).
+- [ACTION]—Transient RPC.
+
+#### `failed to update tx poster balance` / `failed to update tx poster nonce`
+
+- [WARN]—Periodic state refresh in the data-poster loop failed; on balance failure, the loop backs off 10s.
+- [ACTION]—Transient RPC.
+
+#### `DataPoster failed to send transaction`
+
+- [WARN]—The RPC `SendTransaction` call errored.
+- [CAUSE]—Mempool rejection, underpriced, RPC issue.
+- [MEANING]—The transaction stays queued and is retried.
+- [ACTION]—Watch for repetition.
+
+#### `maybeLogError` family—`failed to replace-by-fee transaction` / `failed to re-send transaction`
+
+- [DEBUG/INFO → WARN/ERROR after 20 consecutive]—Per-nonce send/RBF errors that escalate if they persist. `storage.ErrStorageRace` starts `DEBUG`; `ErrFutureReplacePending`/`ErrNonceTooHigh` start `INFO`; anything else is immediate `ERROR`.
+- [ACTION]—The escalated `WARN/ERROR` is the real signal—investigate then.
+
+## Fee/pricing
+
+#### `can't meet data poster fee cap obligations with current target max cost` / `can't meet current parent chain fees with current target max cost`
+
+- [INFO]—The computed fee cap can't cover the current L1 base fee/required cost.
+- [CAUSE]—L1 fees spiked above the poster's escalation target, or balanced-capped.
+- [ACTION]—If posts stall, tune the fee formula (`target-price-gwei`, `urgency-gwei`, `max-fee-bid-multiple-bips`) or fund the wallet.
+
+#### `submitting transaction with GasFeeCap less than latest basefee` / `...BlobGasFeeCap less than latest blobfee`
+
+- [INFO]—Posting anyway with a cap below the current fee, expecting it to confirm as fees drop.
+- [ACTION]—Normal during fee volatility; concerning only if posts never confirm.
+
+#### `unable to fetch suggestedTipCap from l1 client to update arb/batchposter/suggestedtipcap metric`
+
+- [WARN]—Couldn't get a tip suggestion for the metric.
+- [MEANING]—Metric gap only.
+- [ACTION]—Ignore if transient.
+
+## L1 bounds/reorg
+
+#### `Disabling batch posting due to batch being within reorg resistance margin from layer 1 minimum block or timestamp bounds`
+
+- [ERROR]—The batch's first message is within `reorg-resistance-margin` of the L1 minimum bound, so posting is refused this round.
+- [CAUSE]—The margin guard (default 10m) protects against reorgs near the lower bound.
+- [ACTION]—Expected safety behavior; set margin to `0` only if you accept the reorg risk.
+
+#### `disabling L1 bound as batch posting message is close to the maximum delay`
+
+- [ERROR]—Overriding the L1 block bound because messages are near `max-delay`; the `l1-block-bound-bypass` margin kicked in to avoid stalling.
+- [ACTION]—Informational; means it chose to post over respecting the bounds.
+
+#### `not posting more messages because block number or timestamp exceed L1 bounds`
+
+- [INFO]—Stopped adding messages that fall outside the current L1 bound window.
+- [ACTION]—Normal bounding behavior.
+
+#### `error getting max time variation on L1 bound block; falling back on latest block`
+
+- [WARN]—`unknown L2 block bound config value; falling back on using finalized`
+- [ERROR]—Bound resolution issues; the latter means a bad `l1-block-bound` config value.
+- [ACTION]—Fix the config value for the `ERROR`.
+
+#### `DataPoster is avoiding creating a mempool nonce gap`
+
+- [INFO]—Held a transaction back rather than create a nonce gap that a reorg could expose (predecessor not yet reorg-resistant).
+- [ACTION]—Normal reorg-safety; the transaction is retried.
+
+## DA/fallback (AnyTrust/AltDA)
+
+#### `DA writer failed, operator action required`
+
+- [ERROR]—A non-fallback DA writer error; posting stops.
+- [ACTION]—**Investigate immediately**—this is an explicit operator-action alert.
+
+#### `DA writer explicitly requested fallback`
+
+- [WARN]—A DA backend requested a fallback; the poster moves to the next writer/EthDA.
+- [ACTION]—Check DA provider health; sustained fallback means degraded DA.
+
+
+ - `DA writer reports message too large, will rebuild batch`
+ - `DA writers exhausted, will rebuild for EthDA`
+ - `EthDA fallback period complete, will retry AltDA`
+
+ These pair with the `da_success`/`da_failure`/`da_last_success` metrics.
+
+
+
+## Lock/coordination and gas estimation
+
+#### `Error checking if we could acquire redis lock`
+
+- [WARN]—Redis lock check failed; it optimistically tries anyway.
+- [CAUSE]—Redis connectivity.
+- [ACTION]—Check Redis if high availability matters.
+
+#### `Not posting batches right now because another batch poster has the lock or this node is behind`
+
+- [DEBUG]—Normal on backup posters/when behind.
+- [ACTION]—Expected; not an error.
+
+#### `Failed to estimate gas for EIP-7623 check 1/2`
+
+- [WARN]—An EIP-7623 calldata-cost estimation probe failed.
+- [ACTION]—Usually transient; relevant only on EIP-7623 parent chains.
+
+#### `error estimating gas for batch`
+
+- [escalates via ephemeral handler]—Gas estimation failed (`ErrNormalGasEstimationFailed`); DEBUG→WARN→ERROR over 5 min.
+- [CAUSE]—L1 state lag, reverting estimation, inbox not caught up.
+- [ACTION]—Investigate if it reaches `ERROR`.
+
+## Config/startup
+
+#### `max-size is deprecated; use max-calldata-batch-size...`
+
+- [ERROR]—Deprecated flag in use.
+- [ACTION]—Migrate to `max-calldata-batch-size`.
+
+#### `Disabling data poster storage, as parent chain appears to be an Arbitrum chain without a mempool`
+
+- [INFO]—Auto-switched to no-op storage on an Arbitrum parent.
+- [ACTION]—Expected for L3s.
+
+#### `messagesPerbatch is somehow zero`
+
+- [WARN]—Defensive guard against a should-be-impossible state; defaults to `1`.
+- [ACTION]—Benign unless recurring.
+
+## The escalation rule to remember
+
+Many of these (mempool size, storage race, gas estimation, nonce-too-high,
+accumulator-not-found) are logged quietly at first and only escalate to `WARN/ERROR` if they persist past ~1–5 minutes, and `batchPosterFailureCounter` increments **only** at `ERROR` level. So a single `WARN` is usually noise; a _sustained_ one that reaches `ERROR` is the real signal. Pair these with the metrics: `estimated_batch_backlog`, `wallet/eth`, the `dataposter/nonce/*` gap, and `da_failure`.
+
## Quick reference
| Symptom | Likely cause | Resolution |
diff --git a/docs/launch-arbitrum-chain/operate/bp-recovery.mdx b/docs/launch-arbitrum-chain/operate/bp-recovery.mdx
new file mode 100644
index 0000000000..c9ae5d5bce
--- /dev/null
+++ b/docs/launch-arbitrum-chain/operate/bp-recovery.mdx
@@ -0,0 +1,129 @@
+---
+title: 'Batch poster recovery'
+sidebar_label: 'Batch poster recovery'
+description: 'Learn how the batch poster recovers and mechanisms for recovery.'
+author: pete-vielhaber
+sme: Jason-W123
+user_story:
+content_type: reference
+---
+
+This section covers how the batch poster recovers state after a crash, restart, or bad start—including DB restore from a batch-poster checkpoint, storage backends, revert and halt recovery, Redis failover, nonce sync, and reorg handling.
+
+## The batch poster is mostly stateless—its checkpoint lives on L1
+
+The batch poster does **not** primarily checkpoint its position in its own database. On every posting attempt it reconstructs where to resume from authoritative sources: the **SequencerInbox contract on L1** plus the local **inbox tracker DB**. Its own data-poster DB holds only the **in-flight transaction queue** (transactions sent but not yet confirmed), for replace-by-fee (RBF). This is why recovery is robust: lose the data-poster DB and the poster still knows exactly where to resume.
+
+## Recovery state machine flow
+
+
+ Recovery state machine flow
+
+
+### The position "checkpoint"
+
+This is wired as the data poster's `MetadataRetriever`. So the "checkpoint" (`batchPosterPosition`: message count, delayed count, next sequence number) is RLP-encoded into each transaction's metadata, but the **source of truth** is L1 + the inbox tracker, not a snapshot.
+
+```go
+func (b *BatchPoster) getBatchPosterPosition(ctx context.Context, blockNum *big.Int) ([]byte, error) {
+ bigInboxBatchCount, err := b.seqInbox.BatchCount(...) // <-- read from L1 contract
+ ...
+ prevBatchMeta, err = b.batchMetaFetcher.GetBatchMetadata(inboxBatchCount - 1) // <-- inbox tracker DB
+ return rlp.EncodeToBytes(batchPosterPosition{
+ MessageCount: prevBatchMeta.MessageCount,
+ DelayedMessageCount: prevBatchMeta.DelayedMessageCount,
+ NextSeqNum: inboxBatchCount,
+ })
+}
+```
+
+### How resume actually works on restart
+
+1. `FetchLast()` the data-poster queue. **If non-empty** (transactions survived restart), resume from `lastQueueItem.Nonce()+1` with its stored metadata—continues exactly where it left off, RBF-ing pending transactions.
+2. **If empty**, call `updateNonce` to sync nonce from L1, then fetch position metadata via `getBatchPosterPosition`. If `updateNonce` fails and the queue isn't persistent (or can't wait for finality), it **falls back to a recent block** (the `"failed to update nonce with queue empty; falling back to using a recent block"` warning)—safe precisely because nothing is queued.
+
+In short: _queue intact → resume in-flight; queue gone → rebuild cleanly from L1_.
+
+## Storage backends and what survives a restart
+
+| Backend | File | Persists? | Recovery behavior |
+| ---------------- | ---------------------------------- | --------------------------------------------- | ---------------------------------------------------------------------------------- |
+| **dbstorage** | `dataposter/dbstorage/storage.go` | Yes (consensus DB, `BatchPosterPrefix` table) | Queue rehydrated from keyed entries; `FetchContents`/`FetchLast` reload on startup |
+| **redisstorage** | `dataposter/redis/redisstorage.go` | Yes (shared) | Sorted-set keyed by nonce; **HMAC-signed** entries; enables failover |
+| **slice** | `dataposter/slice/slicestorage.go` | No (in-memory) | Lost on restart; used when parent chain is Arbitrum (no mempool) |
+| **noop** | `dataposter/noop/storage.go` | No | Stores nothing; post-and-forget |
+
+Notably, when the parent chain itself is an Arbitrum chain, the data poster **forces no-op storage**—there's no L1 mempool to RBF into, so there's nothing to persist or recover.
+
+## Recovery mechanism 1: `dangerous.clear-dbstorage`
+
+When the persisted queue gets into a bad state, set `--node.batch-poster.data-poster.dangerous.clear-dbstorage`. At construction, with DB storage active, it calls `PruneAll` before starting:
+
+```go
+func (s *Storage) PruneAll(ctx context.Context) error {
+ idx, err := s.lastItemIdx(ctx) // dbstorage/storage.go:94
+ ...
+ return s.Prune(ctx, until+1) // delete every entry through the last
+}
+```
+
+`Prune` iterates and batch-deletes all keys below the bound and rewrites the count. After clearing, the empty-queue path above rebuilds nonce and position from L1. **Use once, then unset**—it discards in-flight transaction tracking, so clearing while transactions are genuinely pending risks nonce conflicts or double-posting. It's a no-op unless `use-db-storage` is the active backend.
+
+## Recovery mechanism 2: revert → halt, and force-inclusion recovery
+
+**Halt on revert**.
+`pollForReverts` watches L1; `checkReverts` finds a failed receipt from the poster's sender and returns `shouldHalt := ~UsingNoOpStorage()`. On a confirmed revert it sets `b.batchReverted.Store(true)` and `MaybePostSequencerBatch` then refuses to post: `"batch was reverted, not posting any more batches"`. This is deliberate—a revert means something is wrong; the poster stops rather than burn funds. Recovery requires operator investigation and a restart (which clears the in-memory `batchReverted` flag).
+
+**Force-inclusion mismatch**
+`dangerous.allow-posting-first-batch-when-sequencer-message-count-mismatch` handles the case
+where the poster's DB message count drifts from the chain's `sequencerReportedSubMessageCount`. The scenario: poster down >24h, someone force-includes a delayed message via the parent contract (which doesn't bump `sequencerReportedSubMessageCount`), so on restart the inbox reader's count diverges. The fix:
+
+```go
+prevMessageCount := batchPosition.MessageCount
+if b.config().Dangerous.AllowPostingFirstBatchWhenSequencerMessageCountMismatch && !b.postedFirstBatch {
+ ...
+ prevMessageCount = 0 // contract skips the prevMessageCount equality check when it's 0
+}
+```
+
+Setting `prevMessageCount = 0` tells the SequencerInbox to skip the equality check, so the
+first post goes through and re-aligns the on-chain count. It applies **only to the first
+batch after startup** (`!b.postedFirstBatch`) — once posted, the mismatch resolves itself.
+
+## Recovery mechanism 3: Redis failover (high availability)
+
+Multiple posters coordinate via `redislock`, default `Enable: true`, `LockoutDuration: 1m`, `RefreshDuration: 10s`:
+
+1. Primary holds the lock: the loop checks `CouldAcquireLock` and logs `"Not posting batches right now because another batch poster has the lock or this node is behind"` on backups.
+2. If the primary crashes, the lock **expires after `LockoutDuration`**; a backup with `background-lock` acquires it.
+3. The backup recovers shared state from **Redis queue storage**, which is **HMAC-signed** so a tampered queue is rejected. Nonce continuity comes from `updateNonce` against L1.
+
+So failover state-sharing rides on persistent, signed Redis storage plus L1-derived nonce and position.
+
+## Recovery mechanism 4: nonce/sync on restart
+
+`updateNonce` queries the finalized (or latest) L1 nonce; when it advances past `s.Nonce` it logs `"Data poster transactions confirmed"` and **prunes confirmed txs** from the queue (`Prune(ctx, nonce-1)`). On a failed fetch with a prior nonce it's non-fatal (`"Failed to get current nonce"` warning). `wait-for-l1-finality` (default true) governs whether it tracks finalized vs latest — trading confirmation latency for reorg safety.
+
+## Reorg handling
+
+- **Parent-chain reorg, revert polling**: if the chain went backward (`nextRevertCheckBlock > blockNum`) it resets to re-check; the >100-block gap warning fast-forwards and skips.
+- **Mempool nonce-gap avoidance**: before sending a transaction of a different type than its predecessor (or whose predecessor isn't yet reorg-resistant), it checks `NonceRbfSoftConfs` deep and, if the nonce exceeds the reorg-resistant count, leaves the transaction queued—`"DataPoster is avoiding creating a mempool nonce gap"`—rather than risk a gap a reorg would expose.
+- **`l1-block-bound`** (`safe`/`finalized`/`latest`/`ignore`) and **`reorg-resistance-margin`** (default 10m) keep batches from referencing L1 blocks that could reorg out (see `posting-cadence-and-lifecycle.md` and `config-flags.md`).
+
+## Where state physically lives
+
+The data-poster DB is a table within the consensus DB: `rawdb.NewTable(consensusDB, storage.BatchPosterPrefix)`. The inbox tracker independently persists `SequencerBatchCountKey`/`DelayedMessageCountKey` (initialized to `0` if absent). A full DB restore therefore restores both the in-flight queue **and** the inbox-tracker counts—but even a wiped data-poster table self-heals from L1.
+
+## Operator recovery cheat-sheet
+
+- _Corrupt or stuck queue_ → restart with `data-poster.dangerous.clear-dbstorage` (once), then
+ remove it.
+- _Batch reverted or poster halted_ → investigate the revert, then restart to clear
+ `batchReverted`.
+- _Message-count mismatch after force-inclusion or long downtime_ → restart with
+ `dangerous.allow-posting-first-batch-when-sequencer-message-count-mismatch` (first batch
+ only).
+- _Primary died, HA configured_ → automatic failover after `lockout-duration`; backup
+ resumes from signed Redis state + L1 nonce.
+- _Lost data-poster DB entirely_ → no manual action needed; position rebuilds from L1 +
+ inbox tracker, queue starts empty.
diff --git a/sidebars.js b/sidebars.js
index 52c67e1afb..f1962b117b 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -481,6 +481,11 @@ const sidebars = {
id: 'launch-arbitrum-chain/operate/arbos-upgrade',
label: 'ArbOS upgrade',
},
+ {
+ type: 'doc',
+ id: 'launch-arbitrum-chain/operate/bp-recovery',
+ label: `Batch poster recovery`,
+ },
{
type: 'doc',
id: 'launch-arbitrum-chain/operate/batch-poster-troubleshooting',
diff --git a/static/img/bp-recovery.png b/static/img/bp-recovery.png
new file mode 100644
index 0000000000..e663afbc50
Binary files /dev/null and b/static/img/bp-recovery.png differ