You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/websites/hosting/introduction-to-high-availability/index.md
+76-33Lines changed: 76 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -39,6 +39,8 @@ Disaster recovery is a process that is employed in the event of a wider-ranging
39
39
40
40
A disaster recovery plan documents key information and procedures that should be adhered to in these scenarios. This can include lists of staff that are responsible for the plan, inventories of systems and software, activation of backup sites and systems, criteria that should be met during the recovery operation (including [RTO and RPO](#rtorpo)), and other considerations.
41
41
42
+
Our [Creating a Disaster Recovery Plan: A Definitive Guide](/docs/guides/disaster-recovery/) contains further guidance for creating a disaster recovery plan.
43
+
42
44
## High Availability Architecture
43
45
44
46
This section describes an example of a high availability architecture that features a WordPress website running in a single data center. There are redundant copies of each component in the architecture, and the health of each set of components is continually monitored. If any component fails, automatic failover is triggered and other healthy components are promoted.
@@ -57,11 +59,11 @@ This specific architecture is implemented in the [host a website with high avail
57
59
58
60
1. When a WordPress plugin is installed, or when an image or other asset is uploaded to WordPress, it is added to the document root. When this happens in this architecture, the application server actually adds these files to one (and only one) of the servers in the GlusterFS cluster. GlusterFS then replicates these changes across the GlusterFS cluster.
59
61
60
-
1. WordPress PHP files from the document root are executed by the application server. These PHP files make requests on a database to retrieve website data. These database requests are fulfilled by a cluster of database servers running Percona XtraDB. One database server within the cluster is the master, and requests are routed to this server.
62
+
1. WordPress PHP files from the document root are executed by the application server. These PHP files make requests on a database to retrieve website data. These database requests are fulfilled by a cluster of database servers running Percona XtraDB. One database server within the cluster is the primary, and requests are routed to this server.
61
63
62
64
1. The database servers use the Galera software to replicate data across the database cluster.
63
65
64
-
1. The Keepalived service runs on each database server and monitors for database failures. If the master database server fails, the Keepalived service reassigns its private IP address to one of the other databases in the cluster, and that database starts responding to requests from WordPress.
66
+
1. The Keepalived service runs on each database server and monitors for database failures. If the primary database server fails, the Keepalived service reassigns its private IP address to one of the other databases in the cluster, and that database starts responding to requests from WordPress.
65
67
66
68
### Systems and Components
67
69
@@ -145,7 +147,7 @@ Different kinds of redundancy can be considered:
145
147
146
148
### Monitoring and Failover
147
149
148
-
In a highly available setup, the system needs to be able to *monitor* itself for failure. This means that there are regular *health checks* to ensure that all components are working properly. *Failover* is the process by which a secondary component becomes primary when monitoring reveals that a primary component has failed.
150
+
In a highly available architecture, the system needs to be able to *monitor* itself for failure. This means that there are regular *health checks* to ensure that all components are working properly. *Failover* is the process by which a secondary component becomes primary when monitoring reveals that a primary component has failed.
149
151
150
152
There are different kinds of health checks that can be performed, including:
151
153
@@ -155,53 +157,90 @@ There are different kinds of health checks that can be performed, including:
155
157
156
158
Akamai offers multiple tools to assist with monitoring and failover, including:
157
159
158
-
-**[NodeBalancers]()** performs health checks on a set of backend application servers within a data center, and can route traffic around backend servers that experience downtime.
160
+
-**[NodeBalancers](https://techdocs.akamai.com/cloud-computing/docs/nodebalancer)** performs health checks on a set of backend application servers within a data center, and can route traffic around backend servers that experience downtime.
161
+
162
+
-**[Global Traffic Management (GTM)](https://techdocs.akamai.com/gtm/docs/welcome-to-global-traffic-management)** continuously monitors the health of application clusters running in multiple regions. If a cluster fails health checks, GTM updates DNS routes for users in real-time and redirects traffic to healthy clusters.
163
+
164
+
-**[Linode Kubernetes Engine (LKE)](https://techdocs.akamai.com/cloud-computing/docs/linode-kubernetes-engine)**, Akamai's managed Kubernetes service: the Kubernetes control plane natively performs monitoring of Pods and other resources in your cluster. For [LKE Enterprise](https://techdocs.akamai.com/cloud-computing/docs/linode-kubernetes-engine#lke-enterprise), the control plane itself has built-in monitoring and failover that is managed by Akamai.
165
+
166
+
-**[IP Sharing and BGP-based failover](https://techdocs.akamai.com/cloud-computing/docs/configure-failover-on-a-compute-instance)**: features that support failover of a service between Linodes.
159
167
160
-
-**Global Traffic Management (GTM)** continuously monitors the health of application clusters running in multiple regions. If a cluster fails health checks, GTM updates DNS routes for users in real-time and redirects traffic to healthy clusters.
168
+
Open source software and tools can support monitoring and failover, including:
169
+
170
+
-[Keepalived](https://www.keepalived.org/): a software package that can run periodic health checks and run notification scripts that are triggered by different health check changes over time. These notification scripts can then interact with features of your cloud platform (like [IP Sharing and BGP-based failover](https://techdocs.akamai.com/cloud-computing/docs/use-keepalived-health-checks-with-bgp-based-failover) on Akamai Cloud) to support failover of infrastructure. In the [high availability architecture](#high-availability-architecture) example in this guide, the database cluster runs keepalived to monitor failures of the primary database server and then promote a backup DB to be the new primary.
171
+
172
+
-[HAProxy](/docs/guides/how-to-configure-haproxy-http-load-balancing-and-health-checks/): a dedicated reverse proxy software solution. HAProxy can perform health checks of backend servers and stop routing traffic to backends that experience failures.
161
173
162
174
### Load Balancing
163
175
164
-
The load balancing component of a high availability system acts as the first barrier to handle traffic from users to the application servers. Load balancing evenly distributes traffic among multiple backend servers, which reduces the chance that any given server fails from being overburdened.
176
+
Load balancing distributes traffic among multiple backend servers or compute regions, which reduces the chance that any given server fails from being overburdened. Different algorithms can be used by different kinds of load balancers to route traffic, including:
177
+
178
+
-**Round-Robin**: Traffic is routed evenly between clusters or servers.
179
+
180
+
-**Weighted Traffic**: Traffic is routed to preferred clusters or servers.
181
+
182
+
-**Geo-Location Routing**: With DNS load balancing tools like Akamai GTM, traffic can be routed to the nearest cluster for reduced latency.
165
183
166
184
Akamai offers multiple tools to assist with load balancing, including:
167
185
168
-
- Use [**Node Balancers**](https://techdocs.akamai.com/cloud-computing/docs/getting-started-with-nodebalancers#putting-the-nodebalancer-in-charge) to distribute traffic evenly across clusters within a data center.
186
+
-**[NodeBalancers](https://techdocs.akamai.com/cloud-computing/docs/nodebalancer)**: A cloud load balancer service that distributes traffic between servers within a data center.
187
+
188
+
-**[Global Traffic Management (GTM)](https://techdocs.akamai.com/gtm/docs/welcome-to-global-traffic-management)**: GTM provides DNS load balancing, which allows traffic to be dynamically routed across multiple regions, including Akamai Cloud regions.
189
+
190
+
-**[Linode Kubernetes Engine (LKE)](https://techdocs.akamai.com/cloud-computing/docs/linode-kubernetes-engine)**, Akamai's managed Kubernetes service: Kubernetes clusters created with LKE have the [Linode Cloud Controller Manager (CCM)](https://github.com/linode/linode-cloud-controller-manager/) preinstalled, which provides an interface for your cluster to interact with Linode resources. In particular, any Kubernetes [LoadBalancer service](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) that you declare is created as a NodeBalancer.
191
+
192
+
Open source software and tools can support load balancing, including:
169
193
170
-
-[LKE](https://techdocs.akamai.com/cloud-computing/docs/linode-kubernetes-engine) and Kubernetes-based load balancing simplifies traffic management and ensures consistent performance.
194
+
-Web servers, like NGINX and Apache: these can be configured as [reverse proxies](/docs/guides/use-nginx-reverse-proxy/#what-is-a-reverse-proxy) for backend servers.
171
195
172
-
-[Akamai GTM](https://techdocs.akamai.com/gtm/docs/welcome-to-global-traffic-management) enables dynamic traffic routing based on real-time health checks and failover policies:
196
+
-[HAProxy](/docs/guides/how-to-configure-haproxy-http-load-balancing-and-health-checks/): a dedicated reverse proxy software solution.
173
197
174
-
-**Round-Robin** – Distributes traffic evenly between clusters.
175
-
-**Weighted Traffic** – Directs more traffic to preferred clusters.
176
-
-**Geo-Location Routing** – Routes traffic to the nearest cluster for reduced latency.
177
-
-**Failover Mode** – Automatically redirects traffic if a cluster becomes unhealthy.
198
+
-[Kubernetes](/docs/guides/beginners-guide-to-kubernetes-part-1-introduction/): [a range of load balancing functionality](https://kubernetes.io/docs/concepts/services-networking/) is offered by Kubernetes. [Services](https://kubernetes.io/docs/concepts/services-networking/service/) are an abstraction later for a set of Pods that run your application code, and traffic is collectively routed across them. [LoadBalancer-type Services](https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer) are available
178
199
179
200
### Replication
180
201
181
-
-[**Akamai Object Storage**](https://techdocs.akamai.com/cloud-computing/docs/object-storage): Enhance redundancy by [synchronizing bucket data across regions using rclone](https://www.linode.com/docs/guides/replicate-bucket-contents-with-rclone/) for robust data replication.
202
+
*Replication* is the process of copying data between redundant servers and systems. Data replication can be synchronous or asynchronous:
182
203
183
-
-[**Block Storage**](https://techdocs.akamai.com/cloud-computing/docs/block-storage): Use multi-attach or cross-AZ replication for persistent data.
204
+
- Synchronous replication prioritizes immediate data consistency across components.
205
+
- Asynchronous replication prioritizes performance of a primary component, and data is eventually copied to a secondary component.
184
206
185
-
-**Database Replication**: Ensure automated backups and replication. Although known for its stability, MySQL is even more reliable if [source-replica replication is configured](https://www.linode.com/docs/guides/configure-source-replica-replication-in-mysql/)
207
+
Replication supports high availability strategies and disaster recovery strategies:
186
208
187
-
-**Networked filesystems**, like GlusterFS.
209
+
-In a high availability architecture, synchronous data replication across redundant components allows each component to serve user requests. These components can be load balanced and run in parallel, or they can be run in a primary-backup configuration with immediate failover to the backup component.
188
210
189
-
### Distributed Application Design
211
+
- For disaster recovery scenarios, data should be replicated to regions and systems that are separated (by geography and/or by system architecture) from primary systems. This copied data can be used in the recovery process.
190
212
191
-
* Use microservices and distributed architectures to minimize the impact of individual component failures.
192
-
* Design for **graceful degradation** so unaffected services remain available even if one component fails.
213
+
Multiple Akamai services provide data replication, or can be used to support data replication workflows:
214
+
215
+
-[Object Storage](https://techdocs.akamai.com/cloud-computing/docs/object-storage): Akamai's object storage has [an internal replication system](https://www.linode.com/products/object-storage/#accordion-7252094bf6-item-97b2f59293) to ensure that data is highly-available.
216
+
217
+
Users can enhance redundancy of their object storage data by [synchronizing bucket data across regions using rclone](/docs/guides/replicate-bucket-contents-with-rclone/), which can support high availability, disaster recovery, and load balancing strategies.
218
+
219
+
Users can also [backup files from a Linode to Object Storage](/docs/guides/rclone-object-storage-file-sync/), which can play a role in backup and recovery.
220
+
221
+
-[Managed Databases](https://techdocs.akamai.com/cloud-computing/docs/aiven-database-clusters): All database clusters created with Akamai's Managed Databases receive daily backups. For 3-node clusters, built-in data replication, redundancy, and automatic failover are provided.
222
+
223
+
-[Block Storage](https://techdocs.akamai.com/cloud-computing/docs/block-storage): Users can choose to attach multiple Block Storage volumes to a Linode instance, and they can replicate data from one volume to another. If a Linode that a volume is destroyed, the volume persists, so it can be attached and used with another Linode.
224
+
225
+
-[Net Storage](https://techdocs.akamai.com/netstorage/docs/welcome-to-netstorage): Net Storage provides [controls for replication across geographic zones](https://techdocs.akamai.com/netstorage/docs/create-a-storage-group#geo-replication-settings).
226
+
227
+
Open source software that supports replication includes:
228
+
229
+
- Database replication tools: Some tools are built into the database system, like MySQL's [source-replica replication mechanism](/docs/guides/configure-source-replica-replication-in-mysql/). Other tools, like [Galera](https://galeracluster.com/), can be additionally installed to support replication.
230
+
231
+
- Networked filesystems, like [GlusterFS](https://www.gluster.org/): these are used to create distributed storage systems across multiple block storage devices, like a Linode's built-in storage disk, or a Block Storage volume.
232
+
233
+
-[Command-line data transfer utilities](/docs/guides/comparing-data-transfer-utilities/) like [rsync](/docs/guides/introduction-to-rsync/) and [rclone](https://www.linode.com/docs/guides/rclone-object-storage-file-sync/).
193
234
194
235
### RTO/RPO
195
236
196
-
Align your architecture with **Recovery Time Objective** (RTO) and **Recovery Point Objective** (RPO) requirements:
237
+
**Recovery Time Objective** (RTO) and **Recovery Point Objective** (RPO) are requirements that should be met in disaster recovery scenarios. RTO refers to the maximum time it should take for your organization to recover from an outage, and RPO refers to the maximum amount of data that may be lost when recovering from an outage. RTO and RPO are influenced by your architecture and by your disaster recovery plan. General categories for your RTO and RPO approach could include:
197
238
198
-
| Approach | RTO | RPO | Complexity | Cost | Use Case |
199
-
| ----- | ----- | ----- | ----- | ----- | ----- |
200
-
|**Backup & Restore**| Minutes to hours | Minutes to hours | Low | $ | Non-critical apps, dev/test environments |
201
-
|**Light/Warm Standby**| Tens of minutes | Seconds to minutes | Moderate | $$ | Faster recovery with minimal data loss |
202
-
|**Multi-Site Active-Active**| Near zero | Near zero | High | $$$$ | Mission-critical apps requiring real-time data sync |
239
+
-**Backup and restore**: placeholder
203
240
204
-
For mission-critical apps, use **Multi-Site Active-Active** with Akamai GTM for real-time failover.
|**Linodes Solutions**| Linode Backup Service (VM), Linode Object Storage for data backups | Linode Backup Service (VM) with scheduled automated backup, data replication at DB level | Warm standby VMs or standby LKE clusters. VM with cross-region data replication | Multi-region LKE clusters, Akamai GTM for traffic management |
217
256
257
+
### Placement Groups
218
258
219
-
### Anti-Affinity Groups
259
+
[*Placement groups*](https://techdocs.akamai.com/cloud-computing/docs/work-with-placement-groups) specify where your compute instances should be created within a data center. An *anti-affinity rule* is used to spread workloads across multiple devices within the same data center, reducing the risk of correlated failures. Akamai Cloud [supports placement groups](https://techdocs.akamai.com/cloud-computing/docs/work-with-placement-groups#create-a-placement-group).
220
260
221
-
Use anti-affinity rules to spread workloads across multiple devices within the same data center, reducing the risk of correlated failures. In Akamai Cloud regions, these rules can be expressed with [placement groups](https://techdocs.akamai.com/cloud-computing/docs/work-with-placement-groups).
261
+
### Distributed Application Design
262
+
263
+
* Use microservices and distributed architectures to minimize the impact of individual component failures.
264
+
* Design for **graceful degradation** so unaffected services remain available even if one component fails.
222
265
223
266
### Live Migrations
224
267
225
-
Akamai Cloud Computing supports [**Linode live migrations**](https://techdocs.akamai.com/cloud-computing/docs/compute-migrations) to minimize downtime during maintenance.
268
+
Akamai Cloud supports [**Linode live migrations**](https://techdocs.akamai.com/cloud-computing/docs/compute-migrations) to minimize downtime during maintenance.
226
269
227
270
### Scaling
228
271
229
272
Scaling ensures performance and availability during increased demand:
230
273
231
-
***Horizontal Scaling**\- Add more instances of an application to handle load.
232
-
***Vertical Scaling**\- Increase resource limits per instance.
233
-
***Auto-Scaling**\- Configure LKE/Kubernetes to adjust resources based on load.
274
+
***Horizontal Scaling**: Add more instances of an application to handle load.
275
+
***Vertical Scaling**: Increase resource limits per instance.
276
+
***Auto-Scaling**: Configure LKE/Kubernetes to adjust resources based on load.
0 commit comments