Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
211 changes: 206 additions & 5 deletions lab/cloudChronicles_Lab001_Disaster_Recovery_Detective.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"source": [
"# Let's begin by printing your name to personalize the notebook\n",
"your_name = \"\"\n",
"print(f\"Welcome to the lab, {your_name}!\")"
"print(f\"Welcome to the lab, {Jaylen Frazier}!\")"
]
},
{
Expand All @@ -32,16 +32,217 @@
"## 🔍 STAR Method Lab Prompt\n",
"\n",
"**Situation:** \n",
"[Define your scenario here.]\n",
"[ Google Cloud announces a significant, widespread outage affecting multiple services in the us-central1 region.
Our Cloud Monitoring dashboards show critical services in us-central1 as unavailable (e.g., high error rates, timeouts, GCE instances not responding).
Pub/Sub alerts, configured via Cloud Monitoring log-based metrics or custom metrics (e.g., application health checks failing), have triggered notifications to the on-call SRE/Cloud Operations team.]\n",
"\n",
"**Task:** \n",
"[Define what the user is expected to solve.]\n",
"[Objective: Restore critical application services in the designated DR region (us-east1) with minimal data loss (low RPO) and within the defined Recovery Time Objective (RTO)..]\n",
"\n",
"**Action:** \n",
"[Step-by-step instructions using GCP tools.]\n",
"[]\n",Okay, as a Cloud Architect, here's a STAR-based Disaster Recovery (DR) plan for a regional outage in us-central1, leveraging Google Cloud tools.

Assumptions & Prerequisites (Crucial for DR Success):

DR Region Identified: us-east1 is our designated DR region.

Infrastructure as Code (IaC): Terraform or Cloud Deployment Manager scripts exist to deploy application infrastructure (Compute Engine, GKE, etc.) in us-east1.

Warm/Hot Standby: Application servers (VMs/GKE clusters) are either pre-provisioned and scaled down in us-east1 (warm) or can be rapidly deployed (cold, but IaC makes this faster). For critical services, a warm standby is preferred.

Global Load Balancer: A Google Cloud Global HTTP(S) Load Balancer is used, with backend services configured for both us-central1 and us-east1.

DNS TTL: Critical DNS records have reasonably low Time-To-Live (TTL) values.

Regular DR Drills: The DR plan is tested periodically.

Monitoring & Alerting Baseline: Cloud Monitoring is configured with dashboards and uptime checks for key services.

STAR-Based Disaster Recovery Plan: us-central1 Outage

SITUATION:

Event: Google Cloud announces a significant, widespread outage affecting multiple services in the us-central1 region.

Detection:

Our Cloud Monitoring dashboards show critical services in us-central1 as unavailable (e.g., high error rates, timeouts, GCE instances not responding).

Pub/Sub alerts, configured via Cloud Monitoring log-based metrics or custom metrics (e.g., application health checks failing), have triggered notifications to the on-call SRE/Cloud Operations team.

External monitoring tools also confirm service unavailability.

Impact: Our primary application services, including web servers, application logic, and the primary Cloud SQL database instance, hosted in us-central1, are inaccessible or severely degraded. Customers are experiencing service disruptions.

TASK:

Objective: Restore critical application services in the designated DR region (us-east1) with minimal data loss (low RPO) and within the defined Recovery Time Objective (RTO).

Key Goals:

Confirm the scope and nature of the outage.

Activate the DR plan.

Failover the database to the cross-region replica.

Ensure application instances in us-east1 are operational and scaled appropriately.

Redirect user traffic to us-east1.

Verify service restoration and communicate status.

ACTION (Failover Process):

Incident Assessment & DR Declaration (Time: T+0 to T+15 mins):

The on-call team convenes (virtual war room).

Verify the outage severity and confirm it's not a localized application issue using Cloud Monitoring, Google Cloud Status Dashboard, and our Pub/Sub-triggered alerts.

Based on the assessment and predefined triggers (e.g., outage duration exceeding X minutes, confirmation from Google), the Incident Commander declares a disaster and formally activates the DR plan.

Internal and external stakeholder communication protocols are initiated.

Database Failover (Time: T+15 to T+45 mins):

Tool: Cloud SQL cross-region replica.

Action:

Access the Google Cloud Console or use gcloud CLI.

Navigate to the Cloud SQL instance in us-central1.

Promote the cross-region read replica located in us-east1 to become the new standalone primary instance. This action breaks replication from the old primary.

Note: The RPO will be minimal, typically seconds or a few minutes, depending on replication lag at the time of the outage.

Record the IP address/connection name of the newly promoted Cloud SQL instance in us-east1.

Application Tier Activation & Configuration (Time: T+30 to T+75 mins):

Tools: Compute Engine (Instance Templates, MIGs), GKE, Secret Manager.

Action:

If using a warm standby (MIGs/GKE in us-east1 scaled to minimum):

Scale up the Managed Instance Groups (MIGs) or GKE node pools/deployments in us-east1 to handle production load.

If using a cold standby:

Execute IaC scripts (Terraform/Deployment Manager) to provision the application infrastructure in us-east1.

Update application configurations (e.g., via environment variables, ConfigMaps in GKE, or by fetching from Secret Manager) to point to the new Cloud SQL primary instance in us-east1. This might involve updating connection strings.

Static Asset & User Data Access (Time: Concurrent with Step 3):

Tool: Multi-Region Cloud Storage.

Action: No direct action is typically required for the data itself. Our application uses multi-region Cloud Storage buckets. Data written to these buckets is geo-redundantly stored across multiple regions. The application instances in us-east1 will be able to read from and write to these buckets seamlessly, as the endpoint is global or resolves to the nearest available location.

Traffic Redirection (Time: T+60 to T+90 mins):

Tool: Google Cloud Global Load Balancer, Cloud DNS.

Action:

The Global Load Balancer, configured with backend services/MIGs/NEGs in both us-central1 and us-east1, should automatically detect that backends in us-central1 are unhealthy.

It will stop sending traffic to us-central1 and direct 100% of traffic to the now healthy and scaled-up backends in us-east1.

Monitor load balancer health checks and traffic flow via Cloud Monitoring.

If any services rely on direct DNS CNAMEs/A records (not behind the Global LB), update these DNS records in Cloud DNS to point to resources in us-east1. The low TTL set during prerequisites will ensure quick propagation.

Verification and Monitoring (Time: T+75 to T+120 mins):

Tools: Cloud Monitoring, application-level logging, synthetic tests.

Action:

Thoroughly test application functionality in us-east1.

Monitor Cloud Monitoring dashboards for us-east1 resources: CPU, memory, error rates, latency, database connections.

Check Pub/Sub alert queues for any new critical alerts originating from the us-east1 environment.

Confirm data integrity by checking recent transactions if possible.

Communication (Ongoing):

Provide regular updates to internal stakeholders and customers regarding service status and expected resolution.

RESULT (Failover Complete):

Critical application services are successfully restored and operational in the us-east1 DR region.

User traffic is now being served entirely from us-east1.

The RTO target (e.g., 2 hours) has been met.

Data loss is minimal, contained within the RPO defined by Cloud SQL replication lag (e.g., < 5 minutes).

The system is stable, though potentially operating at a slightly reduced capacity or with higher latency for users geographically distant from us-east1.

The team transitions from active failover to monitoring the DR environment.

RECOVERY (Failback Process - once us-central1 is fully restored):

This process is initiated once Google Cloud confirms us-central1 is fully stable and operational.

Preparation:

Google Cloud confirms us-central1 stability.

Assess the state of the original infrastructure in us-central1. It might need to be rebuilt or restored.

Data Synchronization:

The Cloud SQL instance in us-east1 (currently primary) needs to replicate its data back to us-central1.

Create a new Cloud SQL instance in us-central1 (or use the old one if recoverable and empty).

Configure this new/old us-central1 instance as a replica of the us-east1 primary.

Allow time for full synchronization.

Failback Execution (Scheduled Maintenance Window):

Database Failback: Promote the Cloud SQL replica in us-central1 to be the primary. Update application configurations in us-central1 to point to this new local primary.

Application Tier Activation: Ensure application instances in us-central1 are healthy and configured correctly.

Traffic Shift: Gradually shift traffic back to us-central1 using the Global Load Balancer (e.g., by adjusting backend capacities or health checks). Monitor closely.

Once 100% of traffic is on us-central1 and stable, proceed.

Demobilize DR Environment:

Scale down or de-provision application resources in us-east1 to their warm standby state or completely if using cold standby.

Re-establish the Cloud SQL cross-region replica from us-central1 (primary) to us-east1 (replica) for future DR preparedness.

Verify all Pub/Sub alerts related to the DR event are resolved.

Post-Mortem:

Conduct a thorough post-mortem of both the outage event and the DR execution.

Identify lessons learned, update documentation, and refine DR procedures and automation.

Review RPO/RTO achievement and adjust if necessary.

This plan provides a structured approach to handling a regional outage, emphasizing the use of specified Google Cloud tools for a robust and efficient recovery.
"\n",
"**Expected Result:** \n",
"[A defined deliverable such as a DR plan, diagram, MVP, etc.]"
"[Critical application services are successfully restored and operational in the us-east1 DR region.
User traffic is now being served entirely from us-east1.
The RTO target (e.g., 2 hours) has been met.
Data loss is minimal, contained within the RPO defined by Cloud SQL replication lag (e.g., < 5 minutes).
The system is stable, though potentially operating at a slightly reduced capacity or with higher latency for users geographically distant from us-east1.
The team transitions from active failover to monitoring the DR environment.]"
]
},
{
Expand Down