Skip to content

Latest commit

Β 

History

History
300 lines (222 loc) Β· 13.5 KB

File metadata and controls

300 lines (222 loc) Β· 13.5 KB

πŸ“Œ Lecture 8 β€” SRE & Monitoring: Reliability Is an Engineering Discipline


πŸ“ Slide 1 – πŸ’₯ The Outage You Didn't Know About

  • πŸ—“οΈ A typical Tuesday β€” your QuickNotes deploy goes out at 14:03. CI is green. Lab 7 deploy succeeded. You go to lunch
  • 🚨 At 14:11, error rate jumps from 0.1% to 4.2%. p95 latency triples. Saturation creeps up
  • πŸͺ¦ No alert fires. No one looks at a dashboard. Customers slowly notice ("the app is slow today…")
  • πŸ•“ By 16:30 β€” when someone finally sees a graph β€” you've degraded service for 2.5 hours
  • πŸŽ“ Lesson: Code tells you what your service should do. Monitoring tells you what it actually did

πŸ€” Think: Lab 3 caught the bug at PR time. Lab 8 catches the bug after the deploy. Both matter. Today is the second one.


πŸ“ Slide 2 – 🎯 Learning Outcomes

# πŸŽ“ Outcome
1 βœ… Distinguish SRE from DevOps; explain reliability as a feature
2 βœ… Cite the Four Golden Signals: latency, traffic, errors, saturation
3 βœ… Define SLI, SLO, SLA, error budget
4 βœ… Compare metrics, logs, traces (the three pillars of observability)
5 βœ… Read PromQL, set up Prometheus + Grafana for QuickNotes
6 βœ… Write an alert that doesn't wake you at 3 a.m. for no reason

πŸ“ Slide 3 – πŸ—ΊοΈ Lecture Overview

graph LR
    A["πŸ“ Reliability as a feature"] --> B["🎯 SLI / SLO / SLA"]
    B --> C["🌟 Golden Signals"]
    C --> D["πŸ“Š Prometheus"]
    D --> E["πŸ“ˆ Grafana"]
    E --> F["🚨 Alerting hygiene"]
    F --> G["πŸ’₯ Real Incidents"]
Loading
  • πŸ“ Slides 1-6 β€” Why SRE exists; SLOs and error budgets
  • πŸ“ Slides 7-10 β€” Metrics, logs, traces; Prometheus model
  • πŸ“ Slides 11-14 β€” Dashboards, alerts, on-call
  • πŸ“ Slides 15-18 β€” Real incidents, lab preview, takeaways

πŸ“ Slide 4 – πŸ“œ Where SRE Came From

  • 🏒 2003 β€” Ben Treynor Sloss joins Google to run production engineering. His pitch: "Hire software engineers to run operations"
  • πŸ“š 2016 β€” Google publishes Site Reliability Engineering, ed. Beyer, Jones, Petoff, Murphy β€” free at sre.google
  • πŸ› οΈ 2018 β€” The Site Reliability Workbook β€” applied examples, postmortem templates
  • πŸŽ“ In 2026, SRE is mainstream β€” almost every product company has SLOs, error budgets, and on-call rotations

πŸ’¬ "Hope is not a strategy." β€” Google SRE motto


πŸ“ Slide 5 – βš–οΈ DevOps vs SRE: Cousins, Not Rivals

DevOps SRE
Mantra "You build it, you run it" "Class SRE implements DevOps"
Focus Speed + collaboration Reliability + automation
Tooling CI/CD, IaC, observability SLOs, runbooks, postmortems
Where you find it Most product teams Larger orgs with explicit SRE function
Source Patrick Debois 2009 movement Google ~2003, popularized 2016
  • 🀝 DevOps is the broader culture. SRE is one prescriptive way to implement parts of it
  • 🎯 In a 10-engineer startup, the same person wears both hats. At 10,000 engineers, you have an SRE team

πŸ“ Slide 6 – 🎯 SLI, SLO, SLA, Error Budget

Term What it is Example for QuickNotes
SLI (Indicator) A measured number "% of GET /notes returning 2xx in < 300ms"
SLO (Objective) The internal target "99.9% over a 28-day window"
SLA (Agreement) The external promise (legal) "99.5% or we refund 10%" β€” looser than SLO
Error budget 1 - SLO Γ— time 0.1% Γ— 28d = ~40 min/month of failure
graph LR
    Real["πŸ“ˆ Reality (SLI measured)"] --> Compare{vs SLO?}
    Compare -- "above SLO" --> Spend["βœ… Spend error budget<br/>ship faster"]
    Compare -- "below SLO" --> Freeze["πŸ›‘ Freeze features<br/>fix reliability"]
Loading
  • πŸͺͺ Error budget = freedom to take risks. Used wisely, it lets you ship at speed
  • πŸ’Έ SLA breach often costs money. SLO breach should cost feature freeze

πŸ“ Slide 7 – 🌟 The Four Golden Signals

From the Google SRE book, Chapter 6 β€” for every user-facing service, measure these:

# Signal What it tells you QuickNotes example
1 ⏱️ Latency Time per request (separate success vs failure!) p50 / p95 / p99 of GET /notes
2 🚦 Traffic Demand on the system Requests per second
3 πŸ› Errors Rate of failed requests Rate of 5xx + 4xx (depends on intent)
4 πŸ“ˆ Saturation How "full" the service is CPU %, memory %, queue depth
  • πŸ”Ž Get these four, and you've caught 80% of production problems
  • 🚦 The "RED" method (Tom Wilkie, Weaveworks): Rate, Errors, Duration β€” same idea, shortened

πŸ“ Slide 8 – πŸ” The Three Pillars of Observability

graph TB
    O["πŸ‘οΈ Observability"]
    O --> M["πŸ“Š Metrics<br/>cheap, aggregated"]
    O --> L["πŸ“œ Logs<br/>rich, expensive"]
    O --> T["🧡 Traces<br/>per-request paths"]
Loading
Pillar Best for Tools
πŸ“Š Metrics "How is the system now?" Prometheus, OpenMetrics
πŸ“œ Logs "What happened to this request?" journald, Loki, ELK
🧡 Traces "Where did time go across services?" OpenTelemetry, Jaeger, Tempo
  • 🎯 DevOps-Intro covers metrics deeply (this lecture). Logs were Lecture 4. Traces are SRE-Intro Lab 8

πŸ“ Slide 9 – πŸ“Š Prometheus: The Pull Model

graph LR
    P["πŸ“Š Prometheus Server"] -- "HTTP GET /metrics every 15s" --> S1["🟒 QuickNotes (:8080)"]
    P -- "scrape" --> S2["🟒 node-exporter"]
    P --> TSDB["⏳ Time-series DB"]
    G["πŸ“ˆ Grafana"] -- "PromQL" --> P
Loading
  • πŸ†• Prometheus (CNCF, graduated 2018) pulls metrics from targets β€” opposite of push-based StatsD/graphite
  • πŸ†” Each target exposes /metrics in text format (your QuickNotes already does β€” see handlers.go)
  • ⏳ Stored as time series β€” metric_name{label1="v",label2="v"} value timestamp
  • πŸ†“ Open source, single binary, no clustering required for small setups

πŸ“ Slide 10 – πŸ“ PromQL: The Three Queries You Need

# 1) Instant rate of requests over the last minute
rate(quicknotes_http_requests_total[1m])

# 2) p95 latency of /notes
histogram_quantile(0.95,
  sum by (le) (rate(quicknotes_http_request_duration_seconds_bucket{route="/notes"}[5m]))
)

# 3) Error ratio (5xx + 4xx) over 5 minutes
sum(rate(quicknotes_http_responses_by_code_total{code=~"5..|4.."}[5m]))
  /
sum(rate(quicknotes_http_requests_total[5m]))
  • πŸ”§ Operators: rate(), sum by(), histogram_quantile, </> for alert thresholds
  • πŸͺ€ rate() works on counters only; for gauges use avg_over_time() / delta()
  • πŸ“š PromQL playground: promlabs.com/promql-cheat-sheet/

πŸ“ Slide 11 – πŸ“ˆ Grafana: Picture Worth a Thousand Logs

  • πŸ“Š Open source dashboarding (since 2014), data-source-agnostic (Prometheus, InfluxDB, Loki, …)
  • πŸͺŸ A dashboard is a JSON file β†’ can be provisioned, diff-ed, code-reviewed
  • 🎁 The Lab 8 plumbing ships a grafana/provisioning/dashboards/golden-signals.json β€” you fill in the panels
  • 🚨 Grafana also does alerting (since v8, 2021) β€” many teams use it as one-stop monitoring + alerting
Panel type Best for
Time series The default β€” rates, latencies, saturation over time
Stat Big-number current value ("requests/sec right now")
Gauge Bounded percentages ("CPU %")
Table Top-N (slowest endpoints, biggest tenants)

πŸ“ Slide 12 – 🚨 Alerting Hygiene

πŸͺ€ The most common reliability problem isn't too few alerts. It's too many.

πŸ”₯ Bad alert βœ… Good alert
Fires on a single error Fires after sustained breach (β‰₯ 5 min)
CPU > 80% Latency exceeds SLO for 5 of last 10 minutes
30 alerts per outage (cascade) 1 actionable alert at the root cause
No link to a runbook Links to a step-by-step docs/runbook/X.md
Wakes someone at 3 a.m. for a non-urgent issue Pages only for user-impacting, can't-wait-till-morning issues
  • 🎯 Symptom-based, not cause-based. Alert on "users are seeing errors", not "disk is 91% full"
  • πŸ›Œ Alert fatigue is real β€” once on-call ignores half the pages, you're worse off than having no alerts

πŸ“ Slide 13 – 🌐 Synthetic Monitoring & Checkly

graph LR
    P["🌎 Checkly probe (Tokyo)"] -- "GET https://quicknotes.example/health every 60s" --> S["🟒 QuickNotes prod"]
    P2["🌎 Checkly probe (Frankfurt)"] -- "GET ..." --> S
    P3["🌎 Checkly probe (Sao Paulo)"] -- "GET ..." --> S
Loading
  • 🌍 Synthetic monitoring = a robot hits your site every minute from multiple regions
  • πŸ›°οΈ Real users come from everywhere; metrics inside the cluster don't see what they see
  • πŸ› οΈ Tools: Checkly, Pingdom, AWS Route 53 health checks, Better Stack
  • 🎁 Lab 8 Bonus task wires Checkly free-tier against your deployed QuickNotes

πŸ“ Slide 14 – 🩺 The On-Call Mindset

graph LR
    A["🚨 Alert fires"] --> B["πŸ“œ Runbook"]
    B --> C{Mitigate?}
    C -- "yes" --> D["πŸͺ‘ Fix the symptom now"]
    D --> E["πŸ“ Postmortem"]
    C -- "no" --> F["πŸ†˜ Escalate"]
    E --> G["πŸ› οΈ Action items<br/>(prevent next time)"]
Loading
  • 🎯 First mitigate, then fix. A 5-minute hack that stops the bleeding beats a 5-hour root-cause fix
  • πŸ“ Every page β†’ a postmortem (Lecture 1 covered the blameless format). The action items are the real value
  • 🀝 Healthy rotations: ≀ 1 page per shift average, week-on/week-off, follow-the-sun across timezones

πŸ“ Slide 15 – πŸ“œ Real Story: The Slack 2022 Outage

  • πŸ—“οΈ February 22, 2022 (yes, "2/22/22") β€” Slack offline for ~5 hours
  • πŸ€– Triggered by an automated config push that broke a database. Cascading failures took out web, messaging, file uploads
  • 🩺 The status page itself struggled β€” hosted on the same infrastructure
  • πŸ“ Slack's public postmortem β€” text-book example: blameless, specific, action-items dated
  • πŸŽ“ Lesson: Sub-systems you treat as "magically reliable" (config push, status page, auth) deserve the same SLOs as the front door

πŸ“ Slide 16 – ❌ Monitoring Antipatterns

πŸ”₯ Antipattern βœ… Better
50 dashboards, none updated since 2023 One "Golden Signals" dashboard, kept honest
Alerts on every individual host Alerts on the service-level SLI
2xx ratio close to 100% but no latency tracking Latency p95 + error rate together
Logging full request bodies (PII risk + cost) Sample; redact; structure (JSON)
print() debugging in production Structured logs at appropriate levels
CPU > 80% page that fires for backups Alert on user-facing symptoms

πŸ“ Slide 17 – πŸ§ͺ Lab 8 Preview: Observability for QuickNotes

  • πŸ“Š Task 1 (6 pts): Run Prometheus + Grafana via Compose against your QuickNotes container. Provision a "Golden Signals" dashboard (rate, error %, p95 latency, memory)
  • 🚨 Task 2 (4 pts): Write one good alert: error rate > 5% for 5 minutes. Trigger it deliberately by setting FAIL_RATE=1.0. Verify the alert fires, document the runbook step
  • 🌍 Bonus (2 pts): Spin up a Checkly free-tier synthetic check against your QuickNotes (deployed via Lab 7 or running locally with ngrok). Compare external view vs internal metrics
  • πŸ“œ Deliverable: submissions/lab8.md β€” dashboard screenshots, alert evidence, written analysis

πŸ“ Slide 18 – 🧠 Key Takeaways

  1. πŸ“ Reliability is a feature β€” budget for it, like any other feature
  2. 🌟 The Four Golden Signals (Latency, Traffic, Errors, Saturation) catch 80% of incidents
  3. 🎯 SLI β†’ SLO β†’ error budget is the framework that makes "reliable enough" measurable
  4. πŸ“Š Metrics + Logs + Traces are complementary, not redundant
  5. 🚨 Fewer, sharper alerts beat many noisy ones β€” alert on user-visible symptoms
  6. πŸ“ Every page β†’ a runbook β†’ a postmortem β€” closing the loop is how systems get better

πŸ“ Slide 19 – πŸš€ What's Next + πŸ“š Resources

  • πŸ“ Next lecture: DevSecOps β€” shift security left; scan deps and images
  • πŸ§ͺ Lab 8: Prometheus + Grafana for QuickNotes, one good alert, Bonus: Checkly synthetic
  • πŸ“– Read this week:
  • πŸ› οΈ Tools this week: Prometheus v3.x, Grafana 13.x, optionally Checkly free-tier
graph LR
    P["πŸ”§ Week 7<br/>Ansible CM"] --> Y["πŸ“ You Are Here<br/>SRE & Monitoring"]
    Y --> N["πŸ›‘οΈ Week 9<br/>DevSecOps"]
    N --> M["☁️ Week 10<br/>Cloud"]
Loading

🎯 Remember: A service without monitoring is a service running in the dark. With the Four Golden Signals dashboard up, every incident response starts with "what does the data say?" β€” not "who knows the most about this code?"