You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1.[Data Engineering & Data Scientists Vocab 101](#data-engineering-data-scientists-vocab-101)
16
17
1.[Data Management in Distributed Systems (Partitioning, Shuffling and Bucketing)](#data-management-in-distributed-systems-partitioning-shuffling-and-bucketing)
@@ -23,12 +24,14 @@
23
24
1.[Gartner's PACE Layered Application Strategy](#gartners-pace-layered-application-strategy)
%% NOTE: syscall only triggered at system resource boundary
302
+
NoteSyscall["⚑ User code is handed over to kernel via SYSCALL<br/>Kernel manages system interactions (files · network · memory)<br/>User gets result back in rax"]
class Firmware,Bootloader,KernelInit,InitProcess boot
321
+
class IP,Registers cpu
322
+
class MMU,PageTable,KernelPages,UserPages mem
323
+
class IDT,Scheduler kernel
324
+
class Timer hw
325
+
class Fork,Exec proc
326
+
class NoteRing3,NoteSyscall note
327
+
```
328
+
329
+
🔹 **Fetch-Execute Cycle**: The CPU holds an **instruction pointer** (register) pointing into RAM. It endlessly repeats: fetch instruction → execute → advance pointer. Jump instructions alter the pointer; this is how control flow works.
330
+
331
+
🔹 **Registers**: Small, extremely fast storage buckets inside the CPU (e.g., `eax`, `ebx`). One special register is the instruction pointer. Others control CPU modes and permission levels.
332
+
333
+
🔹 **Privilege Rings (Kernel vs User mode)**: Modern CPUs have at least two modes.
334
+
-**Kernel mode (Ring 0)**: unrestricted — any instruction, any memory.
335
+
-**User mode (Ring 3)**: limited — no direct I/O, no arbitrary memory access, no changing CPU settings.
336
+
The kernel runs in Ring 0; user programs run in Ring 3. The CPU starts in kernel mode at boot; the OS switches to user mode before running programs.
337
+
338
+
🔹 **System Calls (Syscalls)**: The only safe way for user-mode code to request kernel services (open file, allocate memory, spawn process, etc.).
339
+
1. OS pre-registers handler addresses in an **Interrupt Descriptor Table (IDT)** at boot.
340
+
2. Program triggers a **software interrupt** (`INT 0x80`) or uses `SYSCALL` / `SYSENTER` instructions.
341
+
3. CPU switches to kernel mode and jumps to the registered handler.
342
+
4. Kernel does the work, then executes `IRET` / `SYSRET` to return to user mode.
343
+
344
+
🔹 **Paging & Virtual Memory**: Every memory address a program uses is a **virtual address**. The **Memory Management Unit (MMU)** translates it to a physical RAM address using a **page table** (a dictionary stored in RAM, pointed to by a CPU register). Benefits:
345
+
- Each process has its own isolated address space (e.g., two processes can both use `0x400000` pointing to different physical memory).
346
+
- Kernel marks its own pages as ring-0-only, so user-mode code cannot read kernel memory even though kernel addresses are present in the virtual map.
347
+
-**Demand paging**: pages are only loaded into physical RAM when first accessed (page fault → kernel loads the page → retries the instruction).
348
+
349
+
🔹 **Preemptive Multitasking**: A **timer chip (PIT)** fires a **hardware interrupt** every few milliseconds. The CPU switches to kernel mode, the OS scheduler saves the current process state (registers, instruction pointer) and restores another process — the **context switch**. Timeslices on Linux are typically 0.75 – 6 ms.
350
+
351
+
🔹 **Boot → Run sequence**:
352
+
`Firmware (UEFI/BIOS)` → `Bootloader (GRUB)` → `Kernel init` → `Page tables set up, interrupts registered` → `init process (PID 1, e.g. systemd)` → `fork/exec` → user programs running
353
+
354
+
🔹 **fork & exec pattern**:
355
+
-`fork()` — clones the current process; child gets PID 0 return value, parent gets child PID. Memory pages are marked **copy-on-write (COW)**; no physical copy until a write occurs.
356
+
-`exec()` — replaces the current process image with a new program (parsed from an ELF binary: load `.text`, `.data`, `.bss` sections into virtual memory, jump to entry point).
357
+
Every process on Linux traces its ancestry back to PID 1 via fork-exec.
@@ -486,6 +585,14 @@ public static void consume(List<? super Shape> shapes) {
486
585
487
586
---
488
587
588
+
#### Idempotent, Backfill
589
+
<aid="idempotent-backfill-2"></a>
590
+
591
+
🔹 **Idempotent**: An operation that produces the same result regardless of how many times it is applied. For example, a database upsert or an HTTP PUT request. Critical for safe retries in distributed systems.
592
+
🔹 **Backfill**: The process of reprocessing or reloading historical data into a system, often used in data pipelines to populate missing or updated records retroactively.
593
+
594
+
---
595
+
489
596
#### JIT vs AOT
490
597
<aid="jit-vs-aot-2"></a>
491
598
🔹[JIT vs AOT](https://stackoverflow.com/questions/32653951/when-does-ahead-of-time-aot-compilation-happen): **JIT** and **AOT** are two types of compilers that differ in when they convert a program from one language to another, either at run-time or build-time.
@@ -545,6 +652,105 @@ Memory consistency model: [A Primer on Memory Consistency and Cache Coherence](h
545
652
546
653
---
547
654
655
+
#### Network Design 101
656
+
<aid="network-design-101-2"></a>
657
+
658
+
###### Routing Protocols
659
+
1. Internal routing
660
+
RIPv2 – Distance-vector, small networks
661
+
OSPF – Link-state, fast internal routing
662
+
EIGRP – Cisco hybrid, efficient IGP
663
+
2. External routing
664
+
BGP – Inter-domain routing, policy-based
665
+
666
+
###### Network Functions / Devices
667
+
NAT – Private ↔ public IP translation
668
+
PAT (NATP) – Many private IPs → one public IP
669
+
L2 Switch – MAC-based forwarding
670
+
L3 Switch – Routing + switching combined
671
+
VLAN – Logical network segmentation
672
+
ICMP – Network error & reachability checks
673
+
SNMP – Device monitoring & alerts
674
+
ARP – IP → MAC address resolution
675
+
676
+
###### VLAN vs VNET vs VPC
677
+
In classic networking, VLANs are used for internal traffic segmentation, while a virtual network (referred to here as VNet) focuses on subnetting and routing. In public cloud, VPC (AWS/GCP) and VNet (Azure) represent tenant-scoped network, service, and security boundaries. These constructs operate at different abstraction levels and should not be treated as the same object, as they serve different roles in each context.
678
+
679
+
###### E2E Network flow
680
+
```mermaid
681
+
sequenceDiagram
682
+
autonumber
683
+
684
+
participant Host as Client Endpoint<br/>(PC / Laptop, VLAN 10)
685
+
participant L2 as Access Switch<br/>(L2 Switch)
686
+
participant L3 as Distribution Switch<br/>(L3 Switch / Router)
687
+
participant Core as Core Router<br/>(Core Routing)
688
+
participant Edge as Edge Firewall / Router<br/>(NAT / PAT)
689
+
participant ISP as ISP Router<br/>(Internet Gateway)
690
+
participant Remote as External Server<br/>(Public Service)
691
+
692
+
%% --- Participant Notes (Layman) ---
693
+
Note over Host: User device that sends and receives data
694
+
Note over L2: Connects devices and forwards frames by MAC
695
+
Note over L3: Routes traffic between local IP networks
696
+
Note over Core: High-speed backbone for internal traffic
697
+
Note over Edge: Internet exit that translates addresses
698
+
Note over ISP: Provider router carrying Internet traffic
699
+
Note over Remote: Remote system providing the service
700
+
701
+
%% --- Design Rationale ---
702
+
Note over Host,L2: L2 access retained for endpoint scale,<br/>VLAN segmentation, and broadcast control
703
+
704
+
%% --- L2 / VLAN / ARP ---
705
+
Host->>L2: Ethernet Frame (VLAN 10)
706
+
Note right of L2: 802.1Q VLAN tagging
707
+
708
+
Host->>L2: ARP Request (Who is default gateway?)
709
+
L2->>L3: Forward ARP request (VLAN 10)
710
+
711
+
%% --- SVI ---
712
+
Note right of L3: SVI (Vlan10)<br/>Virtual L3 interface<br/>Default gateway for VLAN 10
713
+
714
+
L3->>L2: ARP Reply (SVI MAC)
715
+
L2->>Host: ARP Reply delivered
716
+
717
+
%% --- L3 Routing ---
718
+
Host->>L2: IP Packet to default gateway
719
+
L2->>L3: Frame forwarded to SVI
720
+
Note right of L3: Inter-VLAN routing via SVI
721
+
722
+
%% --- Internal Routing ---
723
+
Note over L3,Core: IGP (OSPF / EIGRP / RIP)<br/>Fast routing inside one network
724
+
L3->>Core: Forward packet (best internal path)
725
+
726
+
%% --- IGP vs BGP Explanation ---
727
+
Note over Core,Edge: IGP = internal path selection<br/>BGP = external path & policy control
728
+
729
+
%% --- Edge / NAT ---
730
+
Core->>Edge: Forward to perimeter
731
+
Edge->>Edge: NAT / PAT translation
732
+
Note right of Edge: Private IP → Public IP
733
+
734
+
%% --- External Routing ---
735
+
Note over Edge,ISP: BGP (External Routing)<br/>Policy-based Internet path selection
736
+
Edge->>ISP: Forward packet
737
+
ISP->>Remote: Deliver packet
738
+
739
+
%% --- Return Traffic ---
740
+
Remote->>ISP: Response
741
+
ISP->>Edge: Return packet
742
+
Edge->>Edge: Reverse NAT
743
+
Edge->>Core: Forward
744
+
Core->>L3: Forward
745
+
L3->>L2: Frame to VLAN 10
746
+
L2->>Host: Packet delivered
747
+
748
+
%% --- Monitoring ---
749
+
Note over L3,Edge: SNMP monitoring (health & counters)
750
+
```
751
+
752
+
---
753
+
548
754
#### OLAP vs OLTP
549
755
<aid="olap-vs-oltp-2"></a>
550
756
🔹**OLAP**: Used for complex data analysis and business reporting, such as financial analysis and sales forecasting.
|**1 – Single Server**| 0 – 100 | Ship fast | Everything on one VM | Dev speed, no load yet | Monolith, single VM + DB (e.g. $20–50/mo VPS), reverse proxy (Nginx) | Optimize for iteration speed, not scalability. Don't over-engineer. |
851
+
|**2 – Separate DB**| 100 – 1K | Stabilize | App server + dedicated DB | App & DB compete for same CPU/memory | Move DB to its own server (managed: RDS/Supabase), connection pooling (PgBouncer) | Isolate DB resource contention; use managed services to save ops time. |
852
+
|**3 – Load Balancer + Horizontal Scale**| 1K – 10K | Handle burst | Stateless app tier behind LB | Single app server is a SPOF | Add load balancer, 2+ stateless app servers, shared session store (Redis), auto-scaling group | Make app tier stateless so any server can handle any request. |
853
+
|**4 – Caching + CDN**| 10K – 100K | Protect DB | Read-heavy architecture | DB read saturation | CDN for static assets, cache-aside with Redis/Memcached, read replicas, DB query optimization | 80–90%+ of reads can be served from cache; CDN removes static load entirely. |
**Slowly Changing Dimensions** change over time, but at a slow pace and unpredictably. For example, a customer's address in a retail business.
699
925
926
+
| Type | Strategy | Description | Trade-off |
927
+
| ---- | -------- | ----------- | --------- |
928
+
|**SCD Type 0**| Retain original | Dimension values never change; original value is always preserved. | No history; ignores real-world changes. |
929
+
|**SCD Type 1**| Overwrite | Old value is replaced with the new value; no history kept. | Simple to implement; history is lost. |
930
+
|**SCD Type 2**| Add new row | A new row is inserted for each change; old row is marked inactive (with `start_date` / `end_date` or `is_current` flag). | Full history preserved; table can grow large. |
931
+
|**SCD Type 3**| Add new column | A new column stores the previous value alongside the current value. | Limited history (only one prior value). |
932
+
700
933
---
701
934
702
935
#### Software Defined Networking (SDN)
@@ -894,3 +1127,20 @@ graph TD
894
1127
```
895
1128
896
1129
**[`^ back to top ^`](#index)**
1130
+
1131
+
---
1132
+
1133
+
#### Zanzibar
1134
+
<aid="zanzibar-2"></a>
1135
+
1136
+
**Zanzibar** is Google's global authorization system (published 2019) that underpins access control for Google Drive, YouTube, Maps, and other services.
1137
+
1138
+
🔹 **Tuple-based approach**: Permissions are stored as relationship tuples `(object#relation@user)`, e.g., `doc:readme#owner@user:alice`. This makes relationships explicit and queryable.
1139
+
🔹 **Zookie**: A consistency token returned on each write. Clients pass it back on subsequent reads to guarantee "read-your-writes" consistency without requiring full global linearizability on every read.
1140
+
🔹 **Configuration language**: A schema DSL defines object types, relations, and permission inheritance rules (e.g., "viewer inherits from editor"), making access policies auditable and reusable.
1141
+
🔹 **Leopard**: An indexing subsystem inside Zanzibar that pre-computes and caches transitive group membership, optimizing large fan-out permission checks.
1142
+
🔹 **Spanner**: Zanzibar uses Google Spanner as its underlying storage, providing globally distributed, externally consistent transactions via TrueTime.
1143
+
🔹 **External consistency**: Reads and writes are globally ordered using Spanner's TrueTime API, ensuring no stale permission grants across distributed replicas.
1144
+
🔹 **Open-source adoptions**: OpenFGA (CNCF), SpiceDB (Authzed), and Ory Keto are popular open-source implementations inspired by Zanzibar.
0 commit comments