Skip to content

Commit 32a825b

Browse files
committed
up
1 parent e062f95 commit 32a825b

60 files changed

Lines changed: 791 additions & 1293 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

_CONTENT/eng/data/acid.md

Lines changed: 9 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,21 @@
11
---
22
---
3-
atomicity: tx
4-
consistency: invariants, FK, constraints
5-
isolation: concurrent txs do not interfere
6-
durability: WAL, replicas, fsync, power hardware
7-
8-
## tx
9-
group ops. to atomic units
10-
no partial failure
11-
consistency
12-
fk integrity
13-
data sync
14-
15-
## isolation problems
16-
lost updates
3+
```
4+
acid: tx, FK, CHECK, ..
175
6+
isolation problems
187
1. dirty-read: see uncommitted
198
2. non-repeatable read: same query, different result
209
3. phantom read: new rows appear or disappear
2110
4. lost updates: conc. writes lose data
2211
5. read-skew: existing rows change between reads
2312
6. write skew: two tx reads same data to decide writes
24-
may violate invariants, eg. no doctors remaining
25-
needs serial isolation
26-
27-
# solutions
28-
29-
## read committed
30-
only see committed rows
31-
read-skew possible
32-
33-
## repeatable read
34-
snapshot
35-
only see data committed before tx began
36-
37-
solves read-skew
3813
39-
impl. by MVCC
40-
multi-version conc. control
41-
each tx gets a snapshot (a set of tx ids)
42-
43-
analytics, backups
44-
45-
## serializable
46-
prevent all races
47-
solves phantoms and write-skew
48-
49-
MVCC + SSI
50-
20% perf cost
14+
solutions
15+
1. read committed
16+
2. snapshot/repeatable read: MVCC, must for analytics, backups
17+
3. serializable MVCC + SSI
5118
5219
SSI: serializable snapshot isolation
53-
predicate locks
54-
dependency cycles
55-
56-
## dist tx
57-
saga: commit or compensate
58-
59-
or replicate entire transaction log through consensus
20+
predicate locks + dep. cycles
21+
```

_CONTENT/eng/data/dataflow.md

Lines changed: 51 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -1,89 +1,53 @@
11
---
22
---
3-
## encoding
4-
backward comp: old data, new code
5-
6-
breaking:
7-
deleting required fields
8-
changing field types
9-
10-
keep unknown fields
11-
tags vs names: compact + rename later
12-
13-
schema evolution
14-
avro
15-
protobuf
16-
17-
## ipc
18-
db
19-
api
20-
msg passing
21-
22-
push: pubsub, ws, sse, webhook
23-
pull: query, poll
24-
q: decouple, buffer
25-
26-
MPI
27-
message passing interface
28-
no central coordinator
29-
nodes communicate directly
30-
31-
## delivery guarantees
32-
at-most-once
33-
at-least-once: retries + idempotent receiver
34-
exactly-once: at-least once + dedup
35-
36-
producer: add uniq msg ids, retry, track sent msgs on outbox table
37-
consumer: store seen ids, dedup, process+ack in tx
38-
39-
producer retry + consumer dedup
40-
41-
## batch
42-
immutable inputs
43-
atomic ops
44-
45-
batch
46-
partition
47-
compose
48-
49-
data locality
50-
sequential i/o
51-
vectorize
52-
columnar
53-
54-
pre-compute expensive ops
55-
checkpoints
56-
57-
declerative apis = better optimization
58-
59-
spark df
60-
Delta lake: parquet + transaction log + metadata
61-
62-
## stream
63-
immutable events
64-
side effects
65-
66-
log compaction
67-
68-
event time
69-
delivery time
70-
e2e latency
71-
72-
consumer lag
73-
checkpoint
74-
75-
grace period
76-
watermark
77-
78-
backpressure
79-
circuit breaker
80-
81-
exactly once: idempotence + tx commits
82-
83-
probabilistic dsa like bloomfilter, hyperloglog
84-
85-
windows: fixed, overlapping, sliding, session
86-
87-
stream + stream : window
88-
stream + table : enrich
89-
table + table : materialized view of join
3+
## comm.
4+
```
5+
ipc
6+
db
7+
services, api, rpc, http
8+
msg passing, q
9+
10+
push: pubsub, ws, sse, webhook
11+
pull: query, poll
12+
q: decouple, buffer
13+
14+
MPI, message passing interface
15+
no central coordinator
16+
nodes communicate directly
17+
18+
delivery semantics
19+
exactly once
20+
producer retry + consumer dedup
21+
producer outbox + consumer ack in tx
22+
```
23+
24+
## dataproc
25+
```
26+
batch
27+
atomic ops on seq. data
28+
delta lake: parquet + transaction log + metadata
29+
30+
stream
31+
immutable events
32+
33+
time
34+
event
35+
delivery
36+
processing
37+
38+
flow control
39+
backpressure
40+
circuit breaker
41+
42+
consumer lag
43+
checkpoint
44+
watermark
45+
grace period
46+
publish correction
47+
48+
windows: fixed, overlapping, sliding, session
49+
50+
log compaction
51+
joins
52+
probabilistic dsa
53+
```

_CONTENT/eng/data/dist.md

Lines changed: 58 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,72 @@
11
---
22
---
3-
https://muratbuffalo.blogspot.com/2023/10/hints-for-distributed-systems-design.html
3+
## replication
4+
```
5+
(part x, rep y)
6+
replicate WAL or rows
47
5-
## consistency
6-
linearizable
7-
single copy illusion
8-
single leader
9-
election consensus
10-
sync replication or quorum
8+
leader
9+
single
10+
multi leader cause write conflicts
11+
none, dynamo, cassandra, quorums are still eventual
12+
13+
failover: detect, elect, fence
14+
15+
lag
16+
read your writes
17+
monotonic reads
18+
consistent prefix reads
19+
20+
conflicts
21+
avoid: CRDT, OT, same-writer
22+
resolve: read-repair, anti-entropy, app logic
23+
24+
other
25+
order events by version vectors
26+
different sort order per replica
27+
topology
28+
29+
```
30+
31+
## partitioning
32+
```
33+
types
34+
key range
35+
hash of key
36+
(hashkey, sortkey)
37+
38+
hot spots: random prefix suffix
39+
40+
indexes
41+
local vs global
42+
scatter-gatter: tail-latency amp.
43+
44+
rebalancing is expensive
45+
fixed # of parts
46+
dynamic, split large, merge small. good for key-range
47+
hybrid
1148
12-
causal
13-
vector clocks + dependency tracking
49+
service discovery, request routing
50+
```
1451

15-
eventual$$
16-
async replication
17-
conflict resolution (LWW, CRDTs, version vectors)
52+
53+
## consistency
54+
linearizable: single copy illusion, single leader + election consensus + sync replication
55+
causal: vector clocks + dependency tracking
56+
eventual: async replication + conflict resolution
1857

1958
## consensus
20-
raft: majority ack, one leader per term
21-
split brain: lease + fencing token
59+
raft: majority ack, term number fencing
60+
2261
paxos
2362
pbft: o(n2)
2463

64+
## atomic commit
65+
2PC: ask all, commit if they all ack, like marriage, coordinator spof
66+
practical: 2pc + raft for coordinator failover
67+
k
2568
## time and order
26-
NTP
27-
GPS
69+
NTP, GPS
2870

2971
lamport clock: single counter per process, can only tell if A happens-before B
3072
vector clocks: list of counters per process, can detect concurrency, detects conflicts
@@ -35,6 +77,3 @@ heartbeat pings with timeout, adapt to network conditions
3577
lease with ttl
3678
gossip
3779

38-
## atomic commit
39-
2PC: ask all, commit if they all ack, like marriage, coordinator spof
40-
practical: 2pc + raft for coordinator failover

_CONTENT/eng/data/ml/ml.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

_CONTENT/eng/data/rep.md

Lines changed: 0 additions & 46 deletions
This file was deleted.

0 commit comments

Comments
 (0)