Skip to content

Commit a9c1cdc

Browse files
paulohtb6micheleRPmichael-redpandaFeediver1trevpanda
committed
dr: adds shadowing docs (#1381)
Co-authored-by: Michele Cyran <michele@redpanda.com> Co-authored-by: Mike Boquard <michael@redpanda.com> Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com> Co-authored-by: Trevor Blackford <trevor.blackford@redpanda.com>
1 parent cd4541f commit a9c1cdc

25 files changed

Lines changed: 1198 additions & 38 deletions

antora.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
name: ROOT
22
title: Self-Managed
33
version: 25.3
4+
display_version: '25.3 Beta'
45
start_page: home:index.adoc
56
prerelease: true
67
nav:
78
- modules/ROOT/nav.adoc
89
asciidoc:
910
attributes:
1011
# Date of release in the format YYYY-MM-DD
11-
page-release-date: 2025-07-31
12+
page-release-date: 2025-10-31
1213
# Only used in the main branch (latest version)
1314
page-header-data:
1415
order: 2
@@ -18,16 +19,16 @@ asciidoc:
1819
# Fallback versions
1920
# We try to fetch the latest versions from GitHub at build time
2021
# --
21-
full-version: 25.2.1
22+
full-version: 25.3.1-rc2
2223
latest-redpanda-tag: 'v25.2.1'
2324
latest-console-tag: ''
2425
latest-release-commit: ''
2526
latest-operator-version: ''
2627
operator-beta-tag: ''
2728
helm-beta-tag: ''
2829
latest-redpanda-helm-chart-version: ''
29-
redpanda-beta-version: '25.3.1-rc1'
30-
redpanda-beta-tag: 'v25.3.1-rc1'
30+
redpanda-beta-version: '25.3.1-rc2'
31+
redpanda-beta-tag: 'v25.3.1-rc2'
3132
console-beta-version: ''
3233
console-beta-tag: ''
3334
# --

local-antora-playbook.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
site:
22
title: Redpanda Docs
3-
start_page: 25.2@ROOT:get-started:intro-to-events.adoc
3+
start_page: 25.3@ROOT:get-started:intro-to-events.adoc
44
url: http://localhost:5002
55
robots: disallow
66
keys:

modules/ROOT/nav.adoc

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,6 @@
8484
***** xref:deploy:redpanda/manual/production/production-deployment-automation.adoc[]
8585
***** xref:deploy:redpanda/manual/production/production-deployment.adoc[]
8686
***** xref:deploy:redpanda/manual/production/production-readiness.adoc[]
87-
**** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
8887
**** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
8988
**** xref:deploy:redpanda/manual/sizing.adoc[Sizing Guidelines]
9089
**** xref:deploy:redpanda/manual/linux-system-tuning.adoc[System Tuning]
@@ -177,9 +176,6 @@
177176
*** xref:manage:tiered-storage.adoc[]
178177
*** xref:manage:fast-commission-decommission.adoc[]
179178
*** xref:manage:mountable-topics.adoc[]
180-
*** xref:manage:remote-read-replicas.adoc[Remote Read Replicas]
181-
*** xref:manage:topic-recovery.adoc[Topic Recovery]
182-
*** xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore]
183179
** xref:manage:iceberg/index.adoc[Iceberg]
184180
*** xref:manage:iceberg/about-iceberg-topics.adoc[About Iceberg Topics]
185181
*** xref:manage:iceberg/specify-iceberg-schema.adoc[Specify Iceberg Schema]
@@ -197,6 +193,21 @@
197193
*** xref:manage:schema-reg/schema-reg-authorization.adoc[Schema Registry Authorization]
198194
*** xref:manage:schema-reg/schema-id-validation.adoc[]
199195
*** xref:console:ui/schema-reg.adoc[Manage in Redpanda Console]
196+
** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
197+
** xref:deploy:redpanda/manual/disaster-recovery/index.adoc[Disaster Recovery]
198+
*** xref:deploy:redpanda/manual/disaster-recovery/shadowing/index.adoc[Shadowing]
199+
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/overview.adoc[Overview]
200+
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/setup.adoc[Configure Shadowing]
201+
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/monitor.adoc[Monitor Shadowing]
202+
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover.adoc[Configure Failover]
203+
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover-runbook.adoc[Failover Runbook]
204+
*** xref:deploy:redpanda/manual/disaster-recovery/whole-cluster-restore.adoc[Whole Cluster Restore]
205+
*** xref:deploy:redpanda/manual/disaster-recovery/topic-recovery.adoc[Topic Recovery]
206+
** xref:deploy:redpanda/manual/remote-read-replicas.adoc[Remote Read Replicas]
207+
** xref:manage:recovery-mode.adoc[Recovery Mode]
208+
** xref:manage:rack-awareness.adoc[Rack Awareness]
209+
** xref:manage:raft-group-reconfiguration.adoc[Raft Group Reconfiguration]
210+
** xref:manage:io-optimization.adoc[]
200211
** xref:manage:console/index.adoc[Redpanda Console]
201212
*** xref:console:config/configure-console.adoc[Configure Redpanda Console]
202213
*** xref:console:config/enterprise-license.adoc[Add an Enterprise License]
@@ -210,12 +221,8 @@
210221
*** xref:console:config/topic-documentation.adoc[Topic Documentation]
211222
*** xref:console:config/analytics.adoc[Telemetry]
212223
*** xref:console:config/kafka-connect.adoc[Kafka Connect]
213-
** xref:manage:recovery-mode.adoc[Recovery Mode]
214-
** xref:manage:rack-awareness.adoc[Rack Awareness]
215-
** xref:manage:monitoring.adoc[]
216-
** xref:manage:io-optimization.adoc[]
217-
** xref:manage:raft-group-reconfiguration.adoc[Raft Group Reconfiguration]
218224
** xref:manage:use-admin-api.adoc[Use the Admin API]
225+
** xref:manage:monitoring.adoc[]
219226
* xref:upgrade:index.adoc[Upgrade]
220227
** xref:upgrade:rolling-upgrade.adoc[Upgrade Redpanda in Linux]
221228
** xref:upgrade:k-rolling-upgrade.adoc[Upgrade Redpanda in Kubernetes]

modules/deploy/pages/console/linux/deploy.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This page shows you how to deploy Redpanda Console on Linux using Docker or the
77

88
== Prerequisites
99

10-
* You must have a running Redpanda or Kafka cluster available to connect to. Redpanda Console requires a cluster to function. For instructions on deploying a Redpanda cluster, see xref:deploy:redpanda/manual/index.adoc[].
10+
* You must have a running Redpanda or Kafka cluster available to connect to. Redpanda Console requires a cluster to function. For instructions on deploying a Redpanda cluster, see xref:deploy:redpanda/manual/production/index.adoc[].
1111
* Review the xref:deploy:console/linux/requirements.adoc[system requirements for Redpanda Console on Linux].
1212

1313
== Deploy with Docker
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
= Disaster Recovery
2+
:description: Learn about Shadowing with cross-region replication for disaster recovery.
3+
:env-linux: true
4+
:page-layout: index
5+
:page-categories: Management, High Availability, Disaster Recovery
Lines changed: 274 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,274 @@
1+
= Failover Runbook
2+
:description: Step-by-step emergency guide for failing over Redpanda shadow links during disasters.
3+
:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc
4+
:env-linux: true
5+
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response
6+
7+
include::shared:partial$enterprise-license.adoc[]
8+
9+
This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.
10+
11+
// TODO: All command output examples in this guide need verification by running actual commands in test environment
12+
13+
[IMPORTANT]
14+
====
15+
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the disaster readiness checklist in xref:./overview.adoc#disaster-readiness-checklist[] before an emergency occurs.
16+
====
17+
18+
== Emergency failover procedure
19+
20+
Follow these steps during an active disaster:
21+
22+
1. <<assess-situation,Assess the situation>>
23+
2. <<verify-shadow-status,Verify shadow cluster status>>
24+
3. <<document-state,Document current state>>
25+
4. <<initiate-failover,Initiate failover>>
26+
5. <<monitor-progress,Monitor failover progress>>
27+
6. <<update-applications,Update application configuration>>
28+
7. <<verify-functionality,Verify application functionality>>
29+
8. <<cleanup-stabilize,Clean up and stabilize>>
30+
31+
[[assess-situation]]
32+
=== Assess the situation
33+
34+
Confirm that failover is necessary:
35+
36+
[,bash]
37+
----
38+
# Check if the primary cluster is responding
39+
rpk cluster info --brokers prod-cluster-1.example.com:9092,prod-cluster-2.example.com:9092
40+
41+
# If primary cluster is down, check shadow cluster health
42+
rpk cluster info --brokers shadow-cluster-1.example.com:9092,shadow-cluster-2.example.com:9092
43+
----
44+
45+
**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.
46+
47+
**Examples that require full failover:**
48+
49+
* Primary cluster is completely unreachable (network partition, regional outage)
50+
* Multiple broker failures preventing writes to critical topics
51+
* Data center failure affecting majority of brokers
52+
* Persistent authentication or authorization failures across the cluster
53+
54+
**Examples that may NOT require failover:**
55+
56+
* Single broker failure with sufficient replicas remaining
57+
* Temporary network connectivity issues affecting some clients
58+
* High latency or performance degradation (but cluster still functional)
59+
* Non-critical topic or partition unavailability
60+
61+
[[verify-shadow-status]]
62+
=== Verify shadow cluster status
63+
64+
Check the health of your shadow links:
65+
66+
[,bash]
67+
----
68+
# List all shadow links
69+
rpk shadow list
70+
71+
# Check the configuration of your shadow link
72+
rpk shadow describe <shadow-link-name>
73+
74+
# Check the status of your disaster recovery link
75+
rpk shadow status <shadow-link-name>
76+
----
77+
78+
Verify that the following conditions exist before proceeding with failover:
79+
80+
* Shadow link state should be `ACTIVE`.
81+
* Topics should be in `ACTIVE` state (not `FAULTED`).
82+
* Replication lag should be reasonable for your RPO requirements.
83+
84+
**Understanding replication lag:**
85+
86+
Use `rpk shadow status <shadow-link-name>` to check lag, which shows the message count difference between source and shadow partitions:
87+
88+
* **Acceptable lag examples**: 0-1000 messages for low-throughput topics, 0-10000 messages for high-throughput topics
89+
* **Concerning lag examples**: Growing lag over 50,000 messages, or lag that continuously increases without recovering
90+
* **Critical lag examples**: Lag exceeding your data loss tolerance (for example, if you can only afford to lose 1 minute of data, lag should represent less than 1 minute of typical message volume)
91+
92+
[[document-state]]
93+
=== Document current state
94+
95+
Record the current lag and status before proceeding:
96+
97+
[,bash]
98+
----
99+
# Capture current status for post-mortem analysis
100+
rpk shadow status <shadow-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
101+
----
102+
103+
// TODO: Verify this output format by running actual rpk shadow status command
104+
Example output showing healthy replication before failover:
105+
----
106+
Shadow Link: <shadow-link-name>
107+
108+
Overview:
109+
NAME <shadow-link-name>
110+
UID <uid>
111+
STATE ACTIVE
112+
113+
Tasks:
114+
Name Broker_ID State Reason
115+
<task-name> 1 ACTIVE
116+
<task-name> 2 ACTIVE
117+
118+
Topics:
119+
Name: <topic-name>, State: ACTIVE
120+
121+
Partition SRC_LSO SRC_HWM DST_HWM Lag
122+
0 1234 1468 1456 12
123+
1 2345 2579 2568 11
124+
----
125+
126+
IMPORTANT: Note the replication lag to estimate potential data loss during failover.
127+
128+
[[initiate-failover]]
129+
=== Initiate failover
130+
131+
A complete cluster failover is appropriate If you observe that the source cluster is no longer reachable:
132+
133+
[,bash]
134+
----
135+
# Fail over all topics in the shadow link
136+
rpk shadow failover <shadow-link-name> --all
137+
----
138+
139+
For selective topic failover (when only specific services are affected):
140+
141+
[,bash]
142+
----
143+
# Fail over individual topics
144+
rpk shadow failover <shadow-link-name> --topic <topic-name>
145+
rpk shadow failover <shadow-link-name> --topic <topic-name>
146+
----
147+
148+
[[monitor-progress]]
149+
=== Monitor failover progress
150+
151+
Track the failover process:
152+
153+
[,bash]
154+
----
155+
# Monitor status until all topics show FAILED_OVER
156+
watch -n 5 "rpk shadow status <shadow-link-name>"
157+
158+
# Check detailed topic status and lag during emergency
159+
rpk shadow status <shadow-link-name> --print-topic
160+
----
161+
162+
// TODO: Verify this output format by running actual rpk shadow status command during failover
163+
Example output during successful failover:
164+
----
165+
Shadow Link: <shadow-link-name>
166+
167+
Overview:
168+
NAME <shadow-link-name>
169+
UID <uid>
170+
STATE ACTIVE
171+
172+
Tasks:
173+
Name Broker_ID State Reason
174+
<task-name> 1 ACTIVE
175+
<task-name> 2 ACTIVE
176+
177+
Topics:
178+
Name: <topic-name>, State: FAILED_OVER
179+
Name: <topic-name>, State: FAILED_OVER
180+
Name: <topic-name>, State: FAILING_OVER
181+
----
182+
183+
**Wait for**: All critical topics to reach `FAILED_OVER` state before proceeding.
184+
185+
[[update-applications]]
186+
=== Update application configuration
187+
188+
Redirect your applications to the shadow cluster by updating connection strings in your applications to point to shadow cluster brokers. If using DNS-based service discovery, update DNS records accordingly. Restart applications to pick up new connection settings and verify connectivity from application hosts to shadow cluster.
189+
190+
[[verify-functionality]]
191+
=== Verify application functionality
192+
193+
Test critical application workflows:
194+
195+
[,bash]
196+
----
197+
# Verify applications can produce messages
198+
rpk topic produce <topic-name> --brokers <shadow-cluster-address>:9092
199+
200+
# Verify applications can consume messages
201+
rpk topic consume <topic-name> --brokers <shadow-cluster-address>:9092 --num 1
202+
----
203+
204+
Test message production and consumption, consumer group functionality, and critical business workflows to ensure everything is working properly.
205+
206+
[[cleanup-stabilize]]
207+
=== Clean up and stabilize
208+
209+
After all applications are running normally:
210+
211+
[,bash]
212+
----
213+
# Optional: Delete the shadow link (no longer needed)
214+
rpk shadow delete <shadow-link-name>
215+
----
216+
217+
Document the time of failover initiation and completion, applications affected and recovery times, data loss estimates based on replication lag, and issues encountered during failover.
218+
219+
== Troubleshoot common issues
220+
221+
=== Topics stuck in FAILING_OVER state
222+
223+
**Problem**: Topics remain in `FAILING_OVER` state for extended periods
224+
225+
**Solution**: Check shadow cluster logs for specific error messages and ensure sufficient cluster resources (CPU, memory, disk space) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes and confirm that all shadow topic partitions have elected leaders and the controller partition is properly replicated with an active leader.
226+
227+
If topics remain stuck after addressing these cluster health issues and you need immediate failover, you can force delete the shadow link to failover all topics:
228+
229+
[,bash]
230+
----
231+
# Force delete the shadow link to failover all topics
232+
rpk shadow delete <shadow-link-name> --force
233+
----
234+
235+
[WARNING]
236+
====
237+
Force deleting a shadow link immediately fails over all topics in the link. This action is irreversible and should only be used when topics are stuck and you need immediate access to all replicated data.
238+
====
239+
240+
=== Topics in FAULTED state
241+
242+
**Problem**: Topics show `FAULTED` state and are not replicating
243+
244+
**Solution**: Check for authentication issues, network connectivity problems, or source cluster unavailability. Verify that the shadow link service account still has the required permissions on the source cluster. Review shadow cluster logs for specific error messages about the faulted topics.
245+
246+
=== Application connection failures
247+
248+
**Problem**: Applications cannot connect to shadow cluster after failover
249+
250+
**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.
251+
252+
=== Consumer group offset issues
253+
254+
**Problem**: Consumers start from beginning or wrong positions
255+
256+
**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions. See link:https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda[How to manage consumer group offsets in Redpanda^] for detailed reset procedures.
257+
258+
== Next steps
259+
260+
After successful failover, focus on recovery planning and process improvement. Begin by assessing the source cluster failure and determining whether to restore the original cluster or permanently promote the shadow cluster as your new primary.
261+
262+
**Immediate recovery planning:**
263+
264+
1. **Assess source cluster**: Determine root cause of the outage
265+
2. **Plan recovery**: Decide whether to restore source cluster or promote shadow cluster permanently
266+
3. **Data synchronization**: Plan how to synchronize any data produced during failover
267+
4. **Fail forward**: Create a new shadow link with the failed over shadow cluster as source to maintain a DR cluster
268+
269+
**Process improvement:**
270+
271+
1. **Document the incident**: Record timeline, impact, and lessons learned
272+
2. **Update runbooks**: Improve procedures based on what you learned
273+
3. **Test regularly**: Schedule regular disaster recovery drills
274+
4. **Review monitoring**: Ensure monitoring caught the issue appropriately

0 commit comments

Comments
 (0)