Skip to content

Latest commit

Β 

History

History
421 lines (318 loc) Β· 15.9 KB

File metadata and controls

421 lines (318 loc) Β· 15.9 KB

πŸ“Œ Lecture 4 β€” Operating Systems & Networking: The Substrate Underneath Everything


πŸ“ Slide 1 – πŸ’₯ The Day Facebook Disappeared from the Internet

  • πŸ—“οΈ October 4, 2021, 15:39 UTC β€” Facebook engineers push a BGP configuration change that withdraws all Facebook's prefixes from the public internet
  • πŸšͺ Within minutes, DNS servers (which lived inside those prefixes) become unreachable. Facebook, Instagram, WhatsApp, Oculus β€” all gone
  • πŸͺ¦ Six hours offline. Engineers couldn't even badge into the building to physically fix it β€” the access-control system used the same DNS
  • πŸ’΅ Estimated cost: $60 million in lost revenue, plus a 5% stock drop
  • πŸŽ“ Lesson: Every layer of your stack rides on networking and an operating system. When they fail, the world stops

πŸ€” Think: Where does your deployment end and the substrate begin? When something breaks, who debugs the OSI layer 3 problem β€” Dev or Ops?


πŸ“ Slide 2 – 🎯 Learning Outcomes

# πŸŽ“ Outcome
1 βœ… Sketch the OSI / TCP-IP stack and place real protocols on it
2 βœ… Trace a packet from curl to a remote server: DNS β†’ TCP β†’ TLS β†’ HTTP
3 βœ… Read ss, ip, dig, tcpdump, journalctl output without flinching
4 βœ… Explain how systemd runs a service and what systemctl status reveals
5 βœ… Reason about UNIX permissions (rwx, ugo, chmod/chown)
6 βœ… Capture and analyze a packet trace of QuickNotes traffic

πŸ“ Slide 3 – πŸ—ΊοΈ Lecture Overview

graph LR
    A["🌐 Networking Stack"] --> B["πŸ” DNS"]
    B --> C["πŸ” TLS"]
    C --> D["βš–οΈ Load Balancing"]
    D --> E["🐧 Linux Essentials"]
    E --> F["πŸ”§ systemd"]
    F --> G["πŸ“œ Logs & Debugging"]
Loading
  • πŸ“ Slides 1-9 β€” Networking from OSI to TLS to load balancers
  • πŸ“ Slides 10-15 β€” Linux: filesystem, processes, systemd, permissions
  • πŸ“ Slides 16-18 β€” Debugging the substrate
  • πŸ“ Slides 19-21 β€” Real incidents, Lab 4, takeaways

πŸ“ Slide 4 – πŸ“š The OSI Model & Where Real Things Live

Layer Name Real example What you debug here
7 Application HTTP, gRPC curl, status codes
6 Presentation TLS, JSON encoding Cert chains
5 Session (mostly folded in) Sticky sessions
4 Transport TCP, UDP, QUIC ss -tn, retransmits
3 Network IP, ICMP, BGP ip route, traceroute
2 Data Link Ethernet, ARP ip neigh
1 Physical Copper, fiber, radio The blinking light
  • 🎯 In practice, DevOps work happens at L3 (IP routing), L4 (TCP/load balancers), L7 (HTTP)
  • 🀣 The joke: "It's never DNS. Until it is. Then it's always DNS." β€” and DNS is L7 over UDP at L4 over IP at L3

πŸ“ Slide 5 – 🌍 DNS: The First Thing That Breaks

# βœ… what IP does a name resolve to?
$ dig +short github.com
140.82.112.4

# βœ… which DNS server gave that answer?
$ dig github.com | grep SERVER
;; SERVER: 1.1.1.1#53(1.1.1.1)

# βœ… trace the resolution chain end-to-end
$ dig +trace github.com
Record Holds Used for
A IPv4 address Most lookups
AAAA IPv6 address IPv6
CNAME Alias to another name "www β†’ github.com"
MX Mail exchange Mail routing
TXT Arbitrary text SPF, DKIM, domain ownership
  • ⏳ TTL β€” how long a resolver may cache the answer. Low TTL = fast cutover, more queries
  • πŸ§ͺ In Lab 4 you'll dig and host against the QuickNotes test domain

πŸ“ Slide 6 – πŸ“‘ HTTP and What curl Actually Sends

# βœ… raw request
GET /notes HTTP/1.1
Host: localhost:8080
User-Agent: curl/8.5.0
Accept: */*

# βœ… response
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 245
Code class Meaning Example
2xx Success 200 OK, 201 Created, 204 No Content
3xx Redirection 301 Moved Permanently, 304 Not Modified
4xx Client error 400 Bad Request, 401 Unauth, 404 Not Found
5xx Server error 500 Internal, 502 Bad Gateway, 503 Unavailable
  • πŸ†• HTTP/2 (2015): multiplexing, header compression. HTTP/3 (2022): QUIC over UDP, no head-of-line blocking
  • πŸ§ͺ curl -v http://localhost:8080/health shows you everything below the JSON

πŸ“ Slide 7 – πŸ” TLS in 90 Seconds

sequenceDiagram
    participant C as πŸ’» Client
    participant S as πŸ–₯️ Server
    C->>S: ClientHello (TLS 1.3, ciphers, SNI)
    S->>C: ServerHello + cert + key share
    C->>S: HKDF derive keys, Finished
    S->>C: Finished
    C->>S: πŸ”’ GET /notes (encrypted)
Loading
  • πŸͺͺ The certificate proves the server controls the domain (issued by a trusted CA)
  • 🀝 TLS 1.3 (2018): one round trip; old TLS 1.2 needed two
  • πŸ†“ Let's Encrypt (since 2016) issues free, automated certs β€” kills the "pay $300/year for HTTPS" excuse
  • πŸ” Check a server's cert:
openssl s_client -connect github.com:443 -servername github.com </dev/null | openssl x509 -noout -dates

πŸ’‘ The padlock in your browser only means encryption + identity-of-domain. Not "this site is honest."


πŸ“ Slide 8 – βš–οΈ Load Balancing: One Service, Many Backends

graph LR
    U["πŸ‘€ Users"] --> LB["βš–οΈ Load Balancer<br/>nginx / HAProxy / ALB"]
    LB --> B1["🟒 backend 1"]
    LB --> B2["🟒 backend 2"]
    LB --> B3["🟒 backend 3"]
Loading
L4 LB L7 LB
Routes by IP:port Routes by URL path, host, header
Faster, dumber Smarter, can do TLS termination
Example: iptables, AWS NLB nginx, HAProxy, AWS ALB
  • πŸ” Algorithms: round-robin, least-connections, IP-hash (sticky)
  • πŸ’Š Health checks decide which backends get traffic β€” bad health-check thresholds caused the 2017 GitHub git-backend outage
  • 🐧 In Lab 4 you'll inspect how Docker Compose itself runs a tiny L4 LB via its embedded DNS

πŸ“ Slide 9 – πŸ” Debugging the Network: Five Commands

# 1) what's listening?
ss -tln

# 2) what routes do I have?
ip route show

# 3) can I reach the target?
mtr -rwc 5 github.com    # combines ping + traceroute

# 4) what's the wire actually carrying?
sudo tcpdump -i lo -nn -A 'tcp port 8080'

# 5) is DNS the problem?
dig +short example.com @1.1.1.1
  • πŸͺ€ netstat is deprecated β€” ss is the modern replacement (faster, more output)
  • πŸ” Wireshark / tcpdump decode packets all the way up to L7 β€” invaluable for "the response was wrong" bugs
  • πŸ†“ Lab 4 Task 1 runs all five against a live QuickNotes server

πŸ“ Slide 10 – 🐧 Linux File System Hierarchy

/             # root of everything
β”œβ”€β”€ bin/      # essential user binaries (ls, cat, sh)
β”œβ”€β”€ boot/     # kernel + bootloader files
β”œβ”€β”€ etc/      # system configuration
β”‚   β”œβ”€β”€ systemd/
β”‚   β”œβ”€β”€ ssh/
β”‚   └── nginx/
β”œβ”€β”€ home/     # per-user homes
β”œβ”€β”€ var/
β”‚   β”œβ”€β”€ log/      # logs (text, often the FIRST place to look)
β”‚   β”œβ”€β”€ lib/      # service data (postgres, docker)
β”‚   └── cache/    # cacheable, regenerable
β”œβ”€β”€ tmp/      # ephemeral; often a tmpfs
β”œβ”€β”€ proc/     # virtual: process & kernel info
β”œβ”€β”€ sys/      # virtual: kernel objects
└── usr/      # user-installed packages
  • πŸ€“ FHS (Filesystem Hierarchy Standard) β€” a real spec, not folklore
  • πŸ§ͺ /proc/$PID/status, /proc/$PID/limits are how ps, top, debuggers actually work
  • πŸ“ "Everything is a file" β€” including network sockets in /proc/$PID/net/tcp

πŸ“ Slide 11 – πŸ”„ Processes & Signals

# βœ… tree of processes
ps auxf | less

# βœ… live view
top    # or `htop` for the modern, color one

# βœ… stop a process politely (SIGTERM, give it time to clean up)
kill 12345

# ⚠️ stop it brutally (SIGKILL, no chance to flush)
kill -9 12345
Signal Number Meaning
SIGHUP 1 Reload config (convention)
SIGINT 2 Ctrl-C
SIGTERM 15 Default kill β€” graceful
SIGKILL 9 Uncatchable, immediate
SIGSTOP 19 Pause until SIGCONT
  • πŸ›‘οΈ QuickNotes (main.go) traps SIGTERM and runs an http.Server.Shutdown β€” that's a graceful drain. Lab 4 Task 1 verifies it.

πŸ“ Slide 12 – πŸ”§ systemd: How Linux Runs Services

# /etc/systemd/system/quicknotes.service
[Unit]
Description=QuickNotes API
After=network-online.target

[Service]
ExecStart=/usr/local/bin/quicknotes
Restart=on-failure
RestartSec=2
User=quicknotes
Environment=ADDR=:8080

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now quicknotes
sudo systemctl status quicknotes
journalctl -u quicknotes -f       # tail logs
  • 🎯 systemd unifies init, logging (journald), cron (timers), and process supervision
  • πŸ§ͺ Lab 7 (Ansible) will write this exact unit file via a playbook

πŸ“ Slide 13 – πŸ” UNIX Permissions

$ ls -l app/main.go
-rw-r--r-- 1 quicknotes quicknotes 1.2K Mar 12 10:42 main.go
 β”‚ β”‚  β”‚  β”‚  β”” group: quicknotes
 β”‚ β”‚  β”‚  └── owner: quicknotes
 β”‚ β”‚  └───── other: r--
 β”‚ └──────── group: r--
 └────────── owner: rw-
Mode Symbolic Octal Means
Owner rw, group r, other r rw-r--r-- 644 Common for files
Owner rwx, group rx, other rx rwxr-xr-x 755 Common for binaries / dirs
Owner rwx, no one else rwx------ 700 SSH keys, secrets
setuid bit rwsr-xr-x 4755 Runs as the file's owner (e.g. passwd)
  • πŸ›‘οΈ Never chmod 777 to "make it work" β€” it's the security equivalent of git push --force to main
  • 🧰 chmod, chown, umask β€” practice all three in Lab 4

πŸ“ Slide 14 – πŸ“œ Logs: journalctl + /var/log/

# everything since boot
journalctl -b

# only the quicknotes unit, follow live
journalctl -u quicknotes -f

# JSON output for piping into jq
journalctl -u quicknotes -o json | jq '.MESSAGE'

# old-school text logs still exist
tail -f /var/log/nginx/access.log
  • πŸ—ƒοΈ journald stores structured logs; text files in /var/log/ are still common for daemons that pre-date systemd
  • πŸ” Log rotation β€” logrotate(8) compresses old logs nightly. Forgetting to configure it = disk full = outage
  • πŸ§ͺ In Lab 8 you'll ship these logs to Grafana Loki β€” but Lab 4 just teaches you to read them

πŸ“ Slide 15 – πŸ› οΈ Useful Linux Tools Cheat Sheet

Need to Tool Modern?
List files ls -la eza (rusty rewrite)
Find a file find . -name 'main.go' fd
Search file contents grep -rn TODO . ripgrep (rg)
Watch a metric watch -n 1 'free -h' btop, htop
Pretty cat cat file bat
Diff diff -u a b delta
  • βš™οΈ The modern tools are written in Rust/Go, faster, with better UX. None are required β€” the classics still work
  • 🐒 Avoid find / 2>/dev/null β€” it scans your entire filesystem. Always start from a directory you know

πŸ“ Slide 16 – 🩻 Debugging Substrate Failures

graph TD
    P["❓ QuickNotes returns 502"] --> Q1["πŸ’» Is the binary running?<br/>systemctl status"]
    Q1 -- "yes" --> Q2["🌐 Is it listening?<br/>ss -tln | grep 8080"]
    Q2 -- "yes" --> Q3["πŸ”— Can you reach it from the host?<br/>curl localhost:8080/health"]
    Q3 -- "yes" --> Q4["πŸ”₯ Is the firewall blocking?<br/>iptables -L / nft list"]
    Q4 -- "yes" --> Q5["πŸͺͺ DNS resolving correctly?<br/>dig +short quicknotes.example.com"]
    Q1 -- "no" --> R["πŸ“œ journalctl -u quicknotes"]
Loading
  • 🧠 Always work outside-in β€” start with the symptom, peel back layers
  • πŸ§ͺ Lab 4 Task 2 walks this exact chain for a deliberately-broken QuickNotes deploy

πŸ“ Slide 17 – ❌ Common Antipatterns

πŸ”₯ Antipattern βœ… Better
sudo everywhere because "it works" Diagnose the permission, then add the narrow capability
chmod 777 /var/something to fix an upload Find the right user/group, chown + chmod 770
Running services as root Dedicated user; User=quicknotes in the systemd unit
Editing /etc/hosts to "fix DNS" Fix DNS instead β€” host overrides drift between machines
Logging to a file in /tmp (cleared on reboot) Log to stdout (journald captures it)
`ps aux grep myproc` to "check if running"

πŸ“ Slide 18 – πŸ”₯ Real Story: It Wasn't DNS… It Was BGP

  • πŸ—“οΈ October 4, 2021 β€” Facebook's planned BGP maintenance withdraws all routes to FB's data centers
  • πŸͺ¦ As a side effect, DNS servers (inside those data centers) become unreachable. Every cached answer eventually expires. The internet forgets Facebook exists
  • πŸšͺ On-site engineers can't fix it remotely (no VPN), and the badge readers used the same DNS β€” so they can't physically enter the building
  • βŒ› Total outage: ~6 hours, ~$60M direct loss
  • πŸŽ“ The lesson is dependency mapping: your monitoring, your access control, your remote console all also live on the network they're supposed to fix

πŸ“ Read: Cloudflare's blog on the FB outage (Oct 2021) β€” best outside narrative of what happened


πŸ“ Slide 19 – πŸ§ͺ Lab 4 Preview

  • πŸ› οΈ Task 1 (6 pts): Run QuickNotes locally. Use ss, dig, tcpdump, curl -v to map exactly what the network does when a single POST /notes happens
  • 🩺 Task 2 (4 pts): Walk a deliberately-broken deploy through the outside-in debugging chain β€” find the root cause, document it as a mini-postmortem
  • 🎁 Bonus (2 pts): Capture the TLS handshake with tcpdump, then decode ClientHello + ServerHello with Wireshark
  • πŸ“œ Deliverable: submissions/lab4.md with packet timestamps, output snippets, and a one-paragraph "what surprised me"

πŸ“ Slide 20 – 🧠 Key Takeaways

  1. 🌐 Every deploy rides on networking and an OS β€” when those fail, your app fails, but the fix is at a lower layer
  2. πŸ” DNS is almost always involved β€” but it's worth checking BGP, firewall, and TLS too
  3. 🐧 systemd is the default front door β€” unit files, journalctl, restarts; learn them once, use them everywhere
  4. πŸ› οΈ ss, dig, tcpdump, journalctl are your tier-1 debugging tools β€” every DevOps engineer should be fluent
  5. πŸ›‘οΈ Permissions matter β€” 777 is not a fix
  6. 🀝 The dependency graph is recursive β€” your monitoring is a service. Your access control is a service. Plan for them to fail

πŸ“ Slide 21 – πŸš€ What's Next + πŸ“š Resources

  • πŸ“ Next lecture: Virtualization β€” running QuickNotes inside a VM, then comparing to a container
  • πŸ§ͺ Lab 4: Network capture + Linux debugging on QuickNotes (Task 1 + Task 2 + Bonus TLS dump)
  • πŸ“– Read this week:
  • πŸ› οΈ Tools to install this week: dig, ss, tcpdump, mtr, htop, jq, ripgrep
graph LR
    P["πŸ€– Week 3<br/>CI/CD"] --> Y["πŸ“ You Are Here<br/>OS + Networking"]
    Y --> N["πŸ“¦ Week 5<br/>Virtualization"]
    N --> M["🐳 Week 6<br/>Containers"]
Loading

🎯 Remember: The substrate isn't glamorous, but it's where every "production outage" actually lives. The engineers who can debug it become very valuable, very fast.