diff --git a/lab/cloudChronicles_Lab001_Disaster_Recovery_Detective.ipynb b/lab/cloudChronicles_Lab001_Disaster_Recovery_Detective.ipynb index 6cfd525..33d7ab8 100644 --- a/lab/cloudChronicles_Lab001_Disaster_Recovery_Detective.ipynb +++ b/lab/cloudChronicles_Lab001_Disaster_Recovery_Detective.ipynb @@ -21,7 +21,7 @@ "source": [ "# Let's begin by printing your name to personalize the notebook\n", "your_name = \"\"\n", - "print(f\"Welcome to the lab, {your_name}!\")" + "print(f\"Welcome to the lab, {Jaylen Frazier}!\")" ] }, { @@ -32,16 +32,217 @@ "## 🔍 STAR Method Lab Prompt\n", "\n", "**Situation:** \n", - "[Define your scenario here.]\n", + "[ Google Cloud announces a significant, widespread outage affecting multiple services in the us-central1 region. +Our Cloud Monitoring dashboards show critical services in us-central1 as unavailable (e.g., high error rates, timeouts, GCE instances not responding). +Pub/Sub alerts, configured via Cloud Monitoring log-based metrics or custom metrics (e.g., application health checks failing), have triggered notifications to the on-call SRE/Cloud Operations team.]\n", "\n", "**Task:** \n", - "[Define what the user is expected to solve.]\n", + "[Objective: Restore critical application services in the designated DR region (us-east1) with minimal data loss (low RPO) and within the defined Recovery Time Objective (RTO)..]\n", "\n", "**Action:** \n", - "[Step-by-step instructions using GCP tools.]\n", + "[]\n",Okay, as a Cloud Architect, here's a STAR-based Disaster Recovery (DR) plan for a regional outage in us-central1, leveraging Google Cloud tools. + +Assumptions & Prerequisites (Crucial for DR Success): + +DR Region Identified: us-east1 is our designated DR region. + +Infrastructure as Code (IaC): Terraform or Cloud Deployment Manager scripts exist to deploy application infrastructure (Compute Engine, GKE, etc.) in us-east1. + +Warm/Hot Standby: Application servers (VMs/GKE clusters) are either pre-provisioned and scaled down in us-east1 (warm) or can be rapidly deployed (cold, but IaC makes this faster). For critical services, a warm standby is preferred. + +Global Load Balancer: A Google Cloud Global HTTP(S) Load Balancer is used, with backend services configured for both us-central1 and us-east1. + +DNS TTL: Critical DNS records have reasonably low Time-To-Live (TTL) values. + +Regular DR Drills: The DR plan is tested periodically. + +Monitoring & Alerting Baseline: Cloud Monitoring is configured with dashboards and uptime checks for key services. + +STAR-Based Disaster Recovery Plan: us-central1 Outage + +SITUATION: + +Event: Google Cloud announces a significant, widespread outage affecting multiple services in the us-central1 region. + +Detection: + +Our Cloud Monitoring dashboards show critical services in us-central1 as unavailable (e.g., high error rates, timeouts, GCE instances not responding). + +Pub/Sub alerts, configured via Cloud Monitoring log-based metrics or custom metrics (e.g., application health checks failing), have triggered notifications to the on-call SRE/Cloud Operations team. + +External monitoring tools also confirm service unavailability. + +Impact: Our primary application services, including web servers, application logic, and the primary Cloud SQL database instance, hosted in us-central1, are inaccessible or severely degraded. Customers are experiencing service disruptions. + +TASK: + +Objective: Restore critical application services in the designated DR region (us-east1) with minimal data loss (low RPO) and within the defined Recovery Time Objective (RTO). + +Key Goals: + +Confirm the scope and nature of the outage. + +Activate the DR plan. + +Failover the database to the cross-region replica. + +Ensure application instances in us-east1 are operational and scaled appropriately. + +Redirect user traffic to us-east1. + +Verify service restoration and communicate status. + +ACTION (Failover Process): + +Incident Assessment & DR Declaration (Time: T+0 to T+15 mins): + +The on-call team convenes (virtual war room). + +Verify the outage severity and confirm it's not a localized application issue using Cloud Monitoring, Google Cloud Status Dashboard, and our Pub/Sub-triggered alerts. + +Based on the assessment and predefined triggers (e.g., outage duration exceeding X minutes, confirmation from Google), the Incident Commander declares a disaster and formally activates the DR plan. + +Internal and external stakeholder communication protocols are initiated. + +Database Failover (Time: T+15 to T+45 mins): + +Tool: Cloud SQL cross-region replica. + +Action: + +Access the Google Cloud Console or use gcloud CLI. + +Navigate to the Cloud SQL instance in us-central1. + +Promote the cross-region read replica located in us-east1 to become the new standalone primary instance. This action breaks replication from the old primary. + +Note: The RPO will be minimal, typically seconds or a few minutes, depending on replication lag at the time of the outage. + +Record the IP address/connection name of the newly promoted Cloud SQL instance in us-east1. + +Application Tier Activation & Configuration (Time: T+30 to T+75 mins): + +Tools: Compute Engine (Instance Templates, MIGs), GKE, Secret Manager. + +Action: + +If using a warm standby (MIGs/GKE in us-east1 scaled to minimum): + +Scale up the Managed Instance Groups (MIGs) or GKE node pools/deployments in us-east1 to handle production load. + +If using a cold standby: + +Execute IaC scripts (Terraform/Deployment Manager) to provision the application infrastructure in us-east1. + +Update application configurations (e.g., via environment variables, ConfigMaps in GKE, or by fetching from Secret Manager) to point to the new Cloud SQL primary instance in us-east1. This might involve updating connection strings. + +Static Asset & User Data Access (Time: Concurrent with Step 3): + +Tool: Multi-Region Cloud Storage. + +Action: No direct action is typically required for the data itself. Our application uses multi-region Cloud Storage buckets. Data written to these buckets is geo-redundantly stored across multiple regions. The application instances in us-east1 will be able to read from and write to these buckets seamlessly, as the endpoint is global or resolves to the nearest available location. + +Traffic Redirection (Time: T+60 to T+90 mins): + +Tool: Google Cloud Global Load Balancer, Cloud DNS. + +Action: + +The Global Load Balancer, configured with backend services/MIGs/NEGs in both us-central1 and us-east1, should automatically detect that backends in us-central1 are unhealthy. + +It will stop sending traffic to us-central1 and direct 100% of traffic to the now healthy and scaled-up backends in us-east1. + +Monitor load balancer health checks and traffic flow via Cloud Monitoring. + +If any services rely on direct DNS CNAMEs/A records (not behind the Global LB), update these DNS records in Cloud DNS to point to resources in us-east1. The low TTL set during prerequisites will ensure quick propagation. + +Verification and Monitoring (Time: T+75 to T+120 mins): + +Tools: Cloud Monitoring, application-level logging, synthetic tests. + +Action: + +Thoroughly test application functionality in us-east1. + +Monitor Cloud Monitoring dashboards for us-east1 resources: CPU, memory, error rates, latency, database connections. + +Check Pub/Sub alert queues for any new critical alerts originating from the us-east1 environment. + +Confirm data integrity by checking recent transactions if possible. + +Communication (Ongoing): + +Provide regular updates to internal stakeholders and customers regarding service status and expected resolution. + +RESULT (Failover Complete): + +Critical application services are successfully restored and operational in the us-east1 DR region. + +User traffic is now being served entirely from us-east1. + +The RTO target (e.g., 2 hours) has been met. + +Data loss is minimal, contained within the RPO defined by Cloud SQL replication lag (e.g., < 5 minutes). + +The system is stable, though potentially operating at a slightly reduced capacity or with higher latency for users geographically distant from us-east1. + +The team transitions from active failover to monitoring the DR environment. + +RECOVERY (Failback Process - once us-central1 is fully restored): + +This process is initiated once Google Cloud confirms us-central1 is fully stable and operational. + +Preparation: + +Google Cloud confirms us-central1 stability. + +Assess the state of the original infrastructure in us-central1. It might need to be rebuilt or restored. + +Data Synchronization: + +The Cloud SQL instance in us-east1 (currently primary) needs to replicate its data back to us-central1. + +Create a new Cloud SQL instance in us-central1 (or use the old one if recoverable and empty). + +Configure this new/old us-central1 instance as a replica of the us-east1 primary. + +Allow time for full synchronization. + +Failback Execution (Scheduled Maintenance Window): + +Database Failback: Promote the Cloud SQL replica in us-central1 to be the primary. Update application configurations in us-central1 to point to this new local primary. + +Application Tier Activation: Ensure application instances in us-central1 are healthy and configured correctly. + +Traffic Shift: Gradually shift traffic back to us-central1 using the Global Load Balancer (e.g., by adjusting backend capacities or health checks). Monitor closely. + +Once 100% of traffic is on us-central1 and stable, proceed. + +Demobilize DR Environment: + +Scale down or de-provision application resources in us-east1 to their warm standby state or completely if using cold standby. + +Re-establish the Cloud SQL cross-region replica from us-central1 (primary) to us-east1 (replica) for future DR preparedness. + +Verify all Pub/Sub alerts related to the DR event are resolved. + +Post-Mortem: + +Conduct a thorough post-mortem of both the outage event and the DR execution. + +Identify lessons learned, update documentation, and refine DR procedures and automation. + +Review RPO/RTO achievement and adjust if necessary. + +This plan provides a structured approach to handling a regional outage, emphasizing the use of specified Google Cloud tools for a robust and efficient recovery. "\n", "**Expected Result:** \n", - "[A defined deliverable such as a DR plan, diagram, MVP, etc.]" + "[Critical application services are successfully restored and operational in the us-east1 DR region. +User traffic is now being served entirely from us-east1. +The RTO target (e.g., 2 hours) has been met. +Data loss is minimal, contained within the RPO defined by Cloud SQL replication lag (e.g., < 5 minutes). +The system is stable, though potentially operating at a slightly reduced capacity or with higher latency for users geographically distant from us-east1. +The team transitions from active failover to monitoring the DR environment.]" ] }, {