You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
π Lecture 8 β SRE & Monitoring: Reliability Is an Engineering Discipline
π Slide 1 β π₯ The Outage You Didn't Know About
ποΈ A typical Tuesday β your QuickNotes deploy goes out at 14:03. CI is green. Lab 7 deploy succeeded. You go to lunch
π¨ At 14:11, error rate jumps from 0.1% to 4.2%. p95 latency triples. Saturation creeps up
πͺ¦ No alert fires. No one looks at a dashboard. Customers slowly notice ("the app is slow todayβ¦")
π By 16:30 β when someone finally sees a graph β you've degraded service for 2.5 hours
π Lesson:Code tells you what your service should do. Monitoring tells you what it actually did
π€ Think: Lab 3 caught the bug at PR time. Lab 8 catches the bug after the deploy. Both matter. Today is the second one.
π Slide 2 β π― Learning Outcomes
#
π Outcome
1
β Distinguish SRE from DevOps; explain reliability as a feature
2
β Cite the Four Golden Signals: latency, traffic, errors, saturation
3
β Define SLI, SLO, SLA, error budget
4
β Compare metrics, logs, traces (the three pillars of observability)
5
β Read PromQL, set up Prometheus + Grafana for QuickNotes
6
β Write an alert that doesn't wake you at 3 a.m. for no reason
π Slide 3 β πΊοΈ Lecture Overview
graph LR
A["π Reliability as a feature"] --> B["π― SLI / SLO / SLA"]
B --> C["π Golden Signals"]
C --> D["π Prometheus"]
D --> E["π Grafana"]
E --> F["π¨ Alerting hygiene"]
F --> G["π₯ Real Incidents"]
πͺͺ Error budget = freedom to take risks. Used wisely, it lets you ship at speed
πΈ SLA breach often costs money. SLO breach should cost feature freeze
π Slide 7 β π The Four Golden Signals
From the Google SRE book, Chapter 6 β for every user-facing service, measure these:
#
Signal
What it tells you
QuickNotes example
1
β±οΈ Latency
Time per request (separate success vs failure!)
p50 / p95 / p99 of GET /notes
2
π¦ Traffic
Demand on the system
Requests per second
3
π Errors
Rate of failed requests
Rate of 5xx + 4xx (depends on intent)
4
π Saturation
How "full" the service is
CPU %, memory %, queue depth
π Get these four, and you've caught 80% of production problems
π¦ The "RED" method (Tom Wilkie, Weaveworks): Rate, Errors, Duration β same idea, shortened
π Slide 8 β π The Three Pillars of Observability
graph TB
O["ποΈ Observability"]
O --> M["π Metrics<br/>cheap, aggregated"]
O --> L["π Logs<br/>rich, expensive"]
O --> T["π§΅ Traces<br/>per-request paths"]
Loading
Pillar
Best for
Tools
π Metrics
"How is the system now?"
Prometheus, OpenMetrics
π Logs
"What happened to this request?"
journald, Loki, ELK
π§΅ Traces
"Where did time go across services?"
OpenTelemetry, Jaeger, Tempo
π― DevOps-Intro covers metrics deeply (this lecture). Logs were Lecture 4. Traces are SRE-Intro Lab 8
π Slide 9 β π Prometheus: The Pull Model
graph LR
P["π Prometheus Server"] -- "HTTP GET /metrics every 15s" --> S1["π’ QuickNotes (:8080)"]
P -- "scrape" --> S2["π’ node-exporter"]
P --> TSDB["β³ Time-series DB"]
G["π Grafana"] -- "PromQL" --> P
Loading
π Prometheus (CNCF, graduated 2018) pulls metrics from targets β opposite of push-based StatsD/graphite
π Each target exposes /metrics in text format (your QuickNotes already does β see handlers.go)
β³ Stored as time series β metric_name{label1="v",label2="v"} value timestamp
π Open source, single binary, no clustering required for small setups
π Slide 10 β π PromQL: The Three Queries You Need
# 1) Instant rate of requests over the last minute
rate(quicknotes_http_requests_total[1m])
# 2) p95 latency of /notes
histogram_quantile(0.95,
sum by (le) (rate(quicknotes_http_request_duration_seconds_bucket{route="/notes"}[5m]))
)
# 3) Error ratio (5xx + 4xx) over 5 minutes
sum(rate(quicknotes_http_responses_by_code_total{code=~"5..|4.."}[5m]))
/
sum(rate(quicknotes_http_requests_total[5m]))
π§ Operators:rate(), sum by(), histogram_quantile, </> for alert thresholds
πͺ€ rate() works on counters only; for gauges use avg_over_time() / delta()
graph LR
A["π¨ Alert fires"] --> B["π Runbook"]
B --> C{Mitigate?}
C -- "yes" --> D["πͺ‘ Fix the symptom now"]
D --> E["π Postmortem"]
C -- "no" --> F["π Escalate"]
E --> G["π οΈ Action items<br/>(prevent next time)"]
Loading
π― First mitigate, then fix. A 5-minute hack that stops the bleeding beats a 5-hour root-cause fix
π Every page β a postmortem (Lecture 1 covered the blameless format). The action items are the real value
π€ Healthy rotations: β€ 1 page per shift average, week-on/week-off, follow-the-sun across timezones
π Slide 15 β π Real Story: The Slack 2022 Outage
ποΈ February 22, 2022 (yes, "2/22/22") β Slack offline for ~5 hours
π€ Triggered by an automated config push that broke a database. Cascading failures took out web, messaging, file uploads
π Task 1 (6 pts): Run Prometheus + Grafana via Compose against your QuickNotes container. Provision a "Golden Signals" dashboard (rate, error %, p95 latency, memory)
π¨ Task 2 (4 pts): Write one good alert: error rate > 5% for 5 minutes. Trigger it deliberately by setting FAIL_RATE=1.0. Verify the alert fires, document the runbook step
π Bonus (2 pts): Spin up a Checkly free-tier synthetic check against your QuickNotes (deployed via Lab 7 or running locally with ngrok). Compare external view vs internal metrics
π Deliverable: submissions/lab8.md β dashboard screenshots, alert evidence, written analysis
π Slide 18 β π§ Key Takeaways
π Reliability is a feature β budget for it, like any other feature
π The Four Golden Signals (Latency, Traffic, Errors, Saturation) catch 80% of incidents
π― SLI β SLO β error budget is the framework that makes "reliable enough" measurable
π Metrics + Logs + Traces are complementary, not redundant
π¨ Fewer, sharper alerts beat many noisy ones β alert on user-visible symptoms
π Every page β a runbook β a postmortem β closing the loop is how systems get better
π Slide 19 β π What's Next + π Resources
π Next lecture: DevSecOps β shift security left; scan deps and images
π§ͺ Lab 8: Prometheus + Grafana for QuickNotes, one good alert, Bonus: Checkly synthetic
graph LR
P["π§ Week 7<br/>Ansible CM"] --> Y["π You Are Here<br/>SRE & Monitoring"]
Y --> N["π‘οΈ Week 9<br/>DevSecOps"]
N --> M["βοΈ Week 10<br/>Cloud"]
Loading
π― Remember: A service without monitoring is a service running in the dark. With the Four Golden Signals dashboard up, every incident response starts with "what does the data say?" β not "who knows the most about this code?"