You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
π Lecture 4 β Operating Systems & Networking: The Substrate Underneath Everything
π Slide 1 β π₯ The Day Facebook Disappeared from the Internet
ποΈ October 4, 2021, 15:39 UTC β Facebook engineers push a BGP configuration change that withdraws all Facebook's prefixes from the public internet
πͺ Within minutes, DNS servers (which lived inside those prefixes) become unreachable. Facebook, Instagram, WhatsApp, Oculus β all gone
πͺ¦ Six hours offline. Engineers couldn't even badge into the building to physically fix it β the access-control system used the same DNS
π΅ Estimated cost: $60 million in lost revenue, plus a 5% stock drop
π Lesson: Every layer of your stack rides on networking and an operating system. When they fail, the world stops
π€ Think: Where does your deployment end and the substrate begin? When something breaks, who debugs the OSI layer 3 problem β Dev or Ops?
π Slide 2 β π― Learning Outcomes
#
π Outcome
1
β Sketch the OSI / TCP-IP stack and place real protocols on it
2
β Trace a packet from curl to a remote server: DNS β TCP β TLS β HTTP
3
β Read ss, ip, dig, tcpdump, journalctl output without flinching
4
β Explain how systemd runs a service and what systemctl status reveals
5
β Reason about UNIX permissions (rwx, ugo, chmod/chown)
6
β Capture and analyze a packet trace of QuickNotes traffic
π Slide 3 β πΊοΈ Lecture Overview
graph LR
A["π Networking Stack"] --> B["π DNS"]
B --> C["π TLS"]
C --> D["βοΈ Load Balancing"]
D --> E["π§ Linux Essentials"]
E --> F["π§ systemd"]
F --> G["π Logs & Debugging"]
Loading
π Slides 1-9 β Networking from OSI to TLS to load balancers
π Slides 19-21 β Real incidents, Lab 4, takeaways
π Slide 4 β π The OSI Model & Where Real Things Live
Layer
Name
Real example
What you debug here
7
Application
HTTP, gRPC
curl, status codes
6
Presentation
TLS, JSON encoding
Cert chains
5
Session
(mostly folded in)
Sticky sessions
4
Transport
TCP, UDP, QUIC
ss -tn, retransmits
3
Network
IP, ICMP, BGP
ip route, traceroute
2
Data Link
Ethernet, ARP
ip neigh
1
Physical
Copper, fiber, radio
The blinking light
π― In practice, DevOps work happens at L3 (IP routing), L4 (TCP/load balancers), L7 (HTTP)
π€£ The joke: "It's never DNS. Until it is. Then it's always DNS." β and DNS is L7 over UDP at L4 over IP at L3
π Slide 5 β π DNS: The First Thing That Breaks
# β what IP does a name resolve to?
$ dig +short github.com
140.82.112.4
# β which DNS server gave that answer?
$ dig github.com | grep SERVER
;; SERVER: 1.1.1.1#53(1.1.1.1)
# β trace the resolution chain end-to-end
$ dig +trace github.com
Record
Holds
Used for
A
IPv4 address
Most lookups
AAAA
IPv6 address
IPv6
CNAME
Alias to another name
"www β github.com"
MX
Mail exchange
Mail routing
TXT
Arbitrary text
SPF, DKIM, domain ownership
β³ TTL β how long a resolver may cache the answer. Low TTL = fast cutover, more queries
π§ͺ In Lab 4 you'll dig and host against the QuickNotes test domain
π Slide 6 β π‘ HTTP and What curl Actually Sends
# β raw request
GET /notes HTTP/1.1
Host: localhost:8080
User-Agent: curl/8.5.0
Accept: */*
# β response
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 245
Code class
Meaning
Example
2xx
Success
200 OK, 201 Created, 204 No Content
3xx
Redirection
301 Moved Permanently, 304 Not Modified
4xx
Client error
400 Bad Request, 401 Unauth, 404 Not Found
5xx
Server error
500 Internal, 502 Bad Gateway, 503 Unavailable
π HTTP/2 (2015): multiplexing, header compression. HTTP/3 (2022): QUIC over UDP, no head-of-line blocking
π§ͺ curl -v http://localhost:8080/health shows you everything below the JSON
π Slide 7 β π TLS in 90 Seconds
sequenceDiagram
participant C as π» Client
participant S as π₯οΈ Server
C->>S: ClientHello (TLS 1.3, ciphers, SNI)
S->>C: ServerHello + cert + key share
C->>S: HKDF derive keys, Finished
S->>C: Finished
C->>S: π GET /notes (encrypted)
Loading
πͺͺ The certificate proves the server controls the domain (issued by a trusted CA)
π€ TLS 1.3 (2018): one round trip; old TLS 1.2 needed two
π Let's Encrypt (since 2016) issues free, automated certs β kills the "pay $300/year for HTTPS" excuse
π Health checks decide which backends get traffic β bad health-check thresholds caused the 2017 GitHub git-backend outage
π§ In Lab 4 you'll inspect how Docker Compose itself runs a tiny L4 LB via its embedded DNS
π Slide 9 β π Debugging the Network: Five Commands
# 1) what's listening?
ss -tln
# 2) what routes do I have?
ip route show
# 3) can I reach the target?
mtr -rwc 5 github.com # combines ping + traceroute# 4) what's the wire actually carrying?
sudo tcpdump -i lo -nn -A 'tcp port 8080'# 5) is DNS the problem?
dig +short example.com @1.1.1.1
πͺ€ netstat is deprecated β ss is the modern replacement (faster, more output)
π Wireshark / tcpdump decode packets all the way up to L7 β invaluable for "the response was wrong" bugs
π Lab 4 Task 1 runs all five against a live QuickNotes server
π Slide 10 β π§ Linux File System Hierarchy
/ # root of everything
βββ bin/ # essential user binaries (ls, cat, sh)
βββ boot/ # kernel + bootloader files
βββ etc/ # system configuration
β βββ systemd/
β βββ ssh/
β βββ nginx/
βββ home/ # per-user homes
βββ var/
β βββ log/ # logs (text, often the FIRST place to look)
β βββ lib/ # service data (postgres, docker)
β βββ cache/ # cacheable, regenerable
βββ tmp/ # ephemeral; often a tmpfs
βββ proc/ # virtual: process & kernel info
βββ sys/ # virtual: kernel objects
βββ usr/ # user-installed packages
π€ FHS (Filesystem Hierarchy Standard) β a real spec, not folklore
π§ͺ /proc/$PID/status, /proc/$PID/limits are how ps, top, debuggers actually work
π "Everything is a file" β including network sockets in /proc/$PID/net/tcp
π Slide 11 β π Processes & Signals
# β tree of processes
ps auxf | less
# β live view
top # or `htop` for the modern, color one# β stop a process politely (SIGTERM, give it time to clean up)kill 12345
# β οΈ stop it brutally (SIGKILL, no chance to flush)kill -9 12345
Signal
Number
Meaning
SIGHUP
1
Reload config (convention)
SIGINT
2
Ctrl-C
SIGTERM
15
Default kill β graceful
SIGKILL
9
Uncatchable, immediate
SIGSTOP
19
Pause until SIGCONT
π‘οΈ QuickNotes (main.go) traps SIGTERM and runs an http.Server.Shutdown β that's a graceful drain. Lab 4 Task 1 verifies it.
π Slide 12 β π§ systemd: How Linux Runs Services
# /etc/systemd/system/quicknotes.service[Unit]Description=QuickNotes API
After=network-online.target
[Service]ExecStart=/usr/local/bin/quicknotes
Restart=on-failure
RestartSec=2
User=quicknotes
Environment=ADDR=:8080
[Install]WantedBy=multi-user.target
# everything since boot
journalctl -b
# only the quicknotes unit, follow live
journalctl -u quicknotes -f
# JSON output for piping into jq
journalctl -u quicknotes -o json | jq '.MESSAGE'# old-school text logs still exist
tail -f /var/log/nginx/access.log
ποΈ journald stores structured logs; text files in /var/log/ are still common for daemons that pre-date systemd
π Log rotation β logrotate(8) compresses old logs nightly. Forgetting to configure it = disk full = outage
π§ͺ In Lab 8 you'll ship these logs to Grafana Loki β but Lab 4 just teaches you to read them
graph TD
P["β QuickNotes returns 502"] --> Q1["π» Is the binary running?<br/>systemctl status"]
Q1 -- "yes" --> Q2["π Is it listening?<br/>ss -tln | grep 8080"]
Q2 -- "yes" --> Q3["π Can you reach it from the host?<br/>curl localhost:8080/health"]
Q3 -- "yes" --> Q4["π₯ Is the firewall blocking?<br/>iptables -L / nft list"]
Q4 -- "yes" --> Q5["πͺͺ DNS resolving correctly?<br/>dig +short quicknotes.example.com"]
Q1 -- "no" --> R["π journalctl -u quicknotes"]
Loading
π§ Always work outside-in β start with the symptom, peel back layers
π§ͺ Lab 4 Task 2 walks this exact chain for a deliberately-broken QuickNotes deploy
π Slide 17 β β Common Antipatterns
π₯ Antipattern
β Better
sudo everywhere because "it works"
Diagnose the permission, then add the narrow capability
chmod 777 /var/something to fix an upload
Find the right user/group, chown + chmod 770
Running services as root
Dedicated user; User=quicknotes in the systemd unit
Editing /etc/hosts to "fix DNS"
Fix DNS instead β host overrides drift between machines
Logging to a file in /tmp (cleared on reboot)
Log to stdout (journald captures it)
`ps aux
grep myproc` to "check if running"
π Slide 18 β π₯ Real Story: It Wasn't DNSβ¦ It Was BGP
ποΈ October 4, 2021 β Facebook's planned BGP maintenance withdraws all routes to FB's data centers
πͺ¦ As a side effect, DNS servers (inside those data centers) become unreachable. Every cached answer eventually expires. The internet forgets Facebook exists
πͺ On-site engineers can't fix it remotely (no VPN), and the badge readers used the same DNS β so they can't physically enter the building
β Total outage: ~6 hours, ~$60M direct loss
π The lesson is dependency mapping: your monitoring, your access control, your remote console all also live on the network they're supposed to fix
π οΈ Tools to install this week:dig, ss, tcpdump, mtr, htop, jq, ripgrep
graph LR
P["π€ Week 3<br/>CI/CD"] --> Y["π You Are Here<br/>OS + Networking"]
Y --> N["π¦ Week 5<br/>Virtualization"]
N --> M["π³ Week 6<br/>Containers"]
Loading
π― Remember: The substrate isn't glamorous, but it's where every "production outage" actually lives. The engineers who can debug it become very valuable, very fast.