Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions app/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
data/
*.pcap
lab4-trace*
sudo
wsl
.git
32 changes: 32 additions & 0 deletions app/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# syntax=docker/dockerfile:1

FROM golang:1.24.5-alpine AS builder
WORKDIR /src

COPY go.mod ./
RUN go mod download

COPY . .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-trimpath -ldflags='-s -w' -o /quicknotes .

# Tiny static HTTP probe — distroless has no shell/curl for HEALTHCHECK
RUN printf '%s\n' \
'package main' \
'import ("net/http"; "os")' \
'func main() {' \
' r, err := http.Get("http://127.0.0.1:8080/health")' \
' if err != nil || r == nil || r.StatusCode != http.StatusOK { os.Exit(1) }' \
'}' \
> /healthcheck.go && \
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-trimpath -ldflags='-s -w' -o /healthcheck /healthcheck.go

FROM gcr.io/distroless/static:nonroot
COPY --from=builder /quicknotes /quicknotes
COPY --from=builder /healthcheck /healthcheck
COPY seed.json /seed.json

EXPOSE 8080
USER nonroot
ENTRYPOINT ["/quicknotes"]
69 changes: 69 additions & 0 deletions compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
services:
vol-init:
image: busybox:1.36
user: "0"
volumes:
- quicknotes-data:/data
command: ["sh", "-c", "mkdir -p /data && chown 65532:65532 /data"]
restart: "no"

quicknotes:
build:
context: ./app
dockerfile: Dockerfile
image: quicknotes:lab6
depends_on:
vol-init:
condition: service_completed_successfully
ports:
- "8080:8080"
environment:
ADDR: ":8080"
DATA_PATH: "/data/notes.json"
SEED_PATH: "/seed.json"
volumes:
- quicknotes-data:/data
restart: unless-stopped
healthcheck:
test: ["CMD", "/healthcheck"]
interval: 10s
timeout: 3s
retries: 3
start_period: 5s
cap_drop:
- ALL
read_only: true
tmpfs:
- /tmp
security_opt:
- no-new-privileges:true

prometheus:
image: prom/prometheus:v3.2.1
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./monitoring/prometheus/rules:/etc/prometheus/rules:ro
depends_on:
quicknotes:
condition: service_healthy
restart: unless-stopped

grafana:
image: grafana/grafana:13.0.3
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: lab8-grafana-admin
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards:ro
depends_on:
- prometheus
restart: unless-stopped

volumes:
quicknotes-data:
29 changes: 29 additions & 0 deletions docs/runbook/high-error-rate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Runbook: High HTTP Error Rate

## What this alert means

QuickNotes is returning more than 5% HTTP 4xx/5xx responses sustained for five minutes — users are likely seeing failed requests.

## Triage steps

1. **Confirm the alert** — open Prometheus (`http://localhost:9090/alerts`) or Grafana and verify `HighErrorRate` is `Firing`; note the start time.
2. **Check QuickNotes health** — `curl -s http://localhost:8080/health` and `docker compose ps quicknotes`; confirm the container is `healthy` and `status` is `ok`.
3. **Inspect recent logs** — `docker compose logs --tail=100 quicknotes` for panics, permission errors, or repeated 4xx patterns.
4. **Check the error ratio query** — in Prometheus, run:
```promql
sum(rate(quicknotes_http_responses_by_code_total{code=~"4..|5.."}[5m]))
/
sum(rate(quicknotes_http_requests_total[5m]))
```
Break down by `code` label to see whether errors are mostly 400s (bad clients) or 5xx (server faults).

## Mitigations

1. **Restart QuickNotes** — `docker compose restart quicknotes` to clear a stuck process or bad in-memory state while you investigate.
2. **Stop bad traffic** — if a script or client is sending malformed `POST /notes` bodies, pause or throttle it; errors should fall below 5% within the next evaluation window.

## Post-incident

1. Write a **blameless postmortem** using the format in [Lecture 1 — postmortems](../../lectures/lec1.md) (what happened, why, action items with owners and dates).
2. Add or tighten tests/alerts if the root cause was preventable (e.g., validation bug, missing rate limit).
3. Update this runbook if any triage step was missing or misleading.
55 changes: 55 additions & 0 deletions monitoring/docs/bonus-checkly-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Lab 8 Bonus — Checkly + ngrok setup

## 1. Expose QuickNotes publicly

QuickNotes must be running (`docker compose up -d`).

In a **new PowerShell terminal** (keep it open):

```powershell
ngrok http 8080
```

Copy the **Forwarding** HTTPS URL, e.g. `https://abc123.ngrok-free.app`

Test it:

```powershell
Invoke-RestMethod https://YOUR-NGROK-URL/health
```

## 2. Create Checkly API check (free account)

1. Sign up at https://www.checklyhq.com/
2. **Checks → Add check → API check**
3. Settings:
- **Name:** `QuickNotes health (Lab 8)`
- **URL:** `https://YOUR-NGROK-URL/health`
- **Method:** GET
- **Frequency:** 1 minute
- **Locations:** pick **2 regions** (e.g. `Frankfurt (eu-central-1)` + `Singapore (ap-southeast-1)`)
- **Assertion:** status code equals `200`
- **Assertion:** response time less than `2000` ms
4. Save and enable the check.

## 3. Let it run >= 30 minutes

Leave ngrok + Checkly running. Optionally generate light traffic:

```bash
bash monitoring/scripts/generate-traffic.sh
```

## 4. Collect numbers for `submissions/lab8.md`

**Prometheus (internal):**

```bash
bash monitoring/scripts/bonus-prometheus-snapshot.sh
```

**Checkly (external):** open the check → **Check results** / **Metrics** → note p50/p95 latency and failures per region over the same 30-minute window.

## 5. Stop ngrok when done

`Ctrl+C` in the ngrok terminal.
180 changes: 180 additions & 0 deletions monitoring/grafana/dashboards/golden-signals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
{
"annotations": {
"list": []
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": null,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "rate(quicknotes_http_requests_total[1m])",
"legendFormat": "requests/sec (latency proxy)",
"refId": "A"
}
],
"title": "Latency (proxy: request rate)",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"unit": "reqps"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"id": 2,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "rate(quicknotes_http_requests_total[5m])",
"legendFormat": "traffic",
"refId": "A"
}
],
"title": "Traffic",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"max": 1,
"min": 0,
"unit": "percentunit"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"id": 3,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "sum(rate(quicknotes_http_responses_by_code_total{code=~\"4..|5..\"}[5m])) / sum(rate(quicknotes_http_requests_total[5m]))",
"legendFormat": "error ratio",
"refId": "A"
}
],
"title": "Errors (4xx+5xx / total)",
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"fieldConfig": {
"defaults": {
"unit": "short"
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"id": 4,
"options": {
"legend": {
"displayMode": "list",
"placement": "bottom"
},
"tooltip": {
"mode": "single"
}
},
"targets": [
{
"expr": "quicknotes_notes_total",
"legendFormat": "notes stored",
"refId": "A"
}
],
"title": "Saturation (notes stored)",
"type": "timeseries"
}
],
"refresh": "10s",
"schemaVersion": 39,
"tags": [
"lab8",
"golden-signals"
],
"templating": {
"list": []
},
"time": {
"from": "now-15m",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "QuickNotes Golden Signals",
"uid": "quicknotes-golden-signals",
"version": 1
}
12 changes: 12 additions & 0 deletions monitoring/grafana/provisioning/dashboards/dashboard.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: 1

providers:
- name: golden-signals
orgId: 1
folder: QuickNotes
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
Loading