Skip to content

Latest commit

 

History

History

README.md

🧪 Lab 24: Simulated Incident Drill

📝 Lab Summary

This lab focused on simulating operational incidents and responding with repeatable troubleshooting, documentation, and recovery workflows. It combined service validation, failure injection, runbook usage, and post-incident reporting in a realistic Linux environment.

🎯 Objectives

  • Practice structured incident response procedures
  • Trigger and diagnose common system failures
  • Follow runbook procedures for incident resolution
  • Document incident response activities professionally
  • Develop muscle memory for troubleshooting workflows

📌 Prerequisites

  • Basic Linux command line proficiency
  • Understanding of system services and processes
  • Familiarity with log file locations
  • Basic scripting knowledge (Bash or Python)
  • Understanding of web servers and databases

🖥️ Lab Environment

  • Platform: Ubuntu 24.04 LTS cloud lab environment
  • User: toor
  • Host: ip-172-31-10-184
  • Shell: Bash

🛠️ Task Overview

Task 1: Create Incident Scenarios and Runbooks

  • Set Up Baseline Services
  • Create a simple web application
  • Create Incident Trigger Scripts
  • Create script to simulate disk space issue
  • Create script to simulate service crash
  • Create script to simulate high CPU load

Task 2: Create Incident Scenarios and Runbooks

  • Create Runbook Templates
  • Create master runbook
  • Create specific runbook for disk space
  • Create runbook for service failures

Task 3: Execute Incident Response Drill

  • Create Incident Documentation System
  • Create Drill Execution Script
  • Practice Full Incident Response
  • Follow runbook
  • Resolve incident
  • Post-incident verification

📁 Repository Structure

lab24-simulated-incident-drill/
└── README.md
└── artifacts/
    └── drill-checklist.txt
└── commands.sh
└── configs/
    └── nginx/
        └── incident-app.conf
└── interview_qna.md
└── output.txt
└── runbooks/
    └── post-incident-template.md
    └── runbook-disk-space.md
    └── runbook-service-failure.md
    └── runbook-template.md
└── scripts/
    └── execute-drill.sh
    └── incident_logger.py
    └── trigger-cpu-spike.sh
    └── trigger-disk-full.sh
    └── trigger-service-crash.sh
└── troubleshooting.md
└── web/
    └── incident-app/
        └── index.html

✅ Verification & Validation

  • Confirmed the environment and toolchain were installed correctly
  • Validated the core workflow with command execution and captured outputs
  • Preserved scripts, configuration files, and supporting artifacts used during the lab
  • Documented common failure paths and remediation steps in the troubleshooting guide

📚 What I Learned

  • How to prepare incident runbooks before failures happen
  • How to document baseline, detection, remediation, and verification clearly
  • How to restore service safely after intentional failure injection
  • Why post-incident validation matters as much as the initial fix

🌍 Why This Matters

Structured incident drills build operational confidence before real outages happen and reduce response friction when services fail unexpectedly.

🚀 Real-World Applications

  • Incident response drills
  • Production service recovery
  • Ops runbook design
  • Post-incident review workflows

🔎 Real-World Relevance

The workflow in this lab maps well to practical cloud, DevOps, software assurance, and security operations responsibilities where repeatable procedures and evidence-backed validation matter.

✅ Result

A complete simulated incident workflow was documented, executed, and verified successfully with the service scenario restored to a healthy state.

🏁 Conclusion

You have successfully created and executed a simulated incident response drill. In this walkthrough, you practiced:

  • setting up realistic incident scenarios
  • creating and following structured runbooks
  • documenting incident response activities
  • executing investigation and resolution procedures
  • performing post-incident analysis

These are the core DevOps and operations skills the lab is intended to build: structured troubleshooting, disciplined documentation, and recovery validation.