PDC Visual Cheat Sheet
Spring 2026 · ESE · Dr. Khuram Shahzad
Scope W02 · W03 (P2P only) · W04–W07 · W10–W14
Pattern 4 × 25 marks → Q1 MCQ · Q2-Q4 multi-part theory + cases + activities
W02 Distributed Architectures
Software Architecture
Logical organization — components, interfaces, how they connect & communicate.
System Architecture
Physical realization — which component runs on which machine.
4 Architectural Styles
Layered
Stacked layers, only adjacent talk. OSI, TCP/IP
Object-based
Objects + interfaces + encapsulation. Client-server
Data-centered
Shared data store; passive DB or active blackboard.
Event-based
Pub/Sub event propagation. Kafka
System Architectures
Centralized
Client-server, vertical hierarchy, request-reply. LAN=connectionless; WAN=TCP/IP.
Decentralized
P2P, horizontal, symmetric. Every node = servant.
Hybrid
Mixes C/S + P2P. Edge servers, BitTorrent
Delivery Semantics & Idempotency
| Semantic | Meaning |
|---|---|
| At-most-once | 0 or 1 delivery — may LOSE |
| At-least-once | 1+ delivery — may DUPLICATE |
| Exactly-once | Exactly 1 — no loss, no dup |
Idempotent = safe to repeat (e.g., "return X"). Non-idempotent = "increment X", "charge card".
🔗 Chord DHT (Quiz 1 favorite!)
m-bit ring. Key k stored at successor(k) = smallest node id ≥ k. Finger table has m rows.
Lookup: O(log N). Doubling network adds 1 row.
Asymmetric Distance
m=4, A=2, B=15: dist(A,B)=13; dist(B,A)=2+16−15=3
N8 Finger Table (m=6)
| i | N8 + 2i−1 | successor |
|---|---|---|
| 1 | 9 | N14 |
| 2 | 10 | N14 |
| 3 | 12 | N14 |
| 4 | 16 | N21 |
| 5 | 24 | N32 |
| 6 | 40 | N42 |
Other W02 Concepts
Multi-tier
Two-tier: thin-client vs fat-client. Three-tier: Presentation / Logic / Data.Vertical vs Horizontal
Vertical = split layers across machines. Horizontal = replicate same layer (load balance).Blockchain
Append-only chain, immutable blocks, massive replication. Need consensus on who appends.Cloud Layers
Hardware → Infrastructure (VMs) → Platform (S3) → Application.W03 P2P Systems
Common primitives: Join · Publish · Search · Fetch
The Big 5 — Compare & Contrast
| System | Architecture | Search | Decentralization | Key Innovation |
|---|---|---|---|---|
| Napster | Centralized index | Central server | None | Easy UI |
| Skype | Hybrid + supernodes | Supernode lookup | Partial | NAT traversal |
| Gnutella | Pure P2P | Query flooding | Full | No central server |
| KaZaA | Hybrid + supernodes | Supernode-routed | Partial | FastTrack supernodes |
| BitTorrent | Hybrid (tracker/DHT) | External/DHT | High | Swarming + tit-for-tat |
🌊 P2P Architecture Visual
Gnutella — 5 Message Types
⚠️ Flooding Explosion
With TTL=7 and b=5 neighbors per peer: 5 + 5·4 + 5·4² + … + 5·4⁶ messages — exponential. This is why KaZaA introduced supernodes and BitTorrent uses swarming.
BitTorrent Vocabulary
| Term | Meaning |
|---|---|
| Torrent file | Metadata + tracker URL |
| Tracker | Coordinates peers in swarm (optional with DHT) |
| Seeder | Peer with complete file |
| Leecher | Peer still downloading |
| Swarm | Peers sharing same file |
| Tit-for-tat | Upload fairness; rewards uploaders |
W04 Time & Clocks
Cristian's Algorithm (Physical, Passive Server)
T1T2, sends reply at T3T4NTP: repeats 8 times, takes min-delay sample for best estimate.
Berkeley Algorithm
No machine has accurate clock. Master polls slaves → averages → sends OFFSETS (not absolute times) back. Sending offsets avoids fresh RTT uncertainty at slave.
⏱️ Lamport's Logical Clocks
Happens-Before (→)
- Same process, a before b →
a → b - a = send, b = matching receive →
a → b - Transitive: a→b, b→c ⇒ a→c
Concurrent (a ‖ b): neither a→b nor b→a.
Update Rules
Cᵢ = Cᵢ + 1ts(m) = CᵢCⱼ = max(Cⱼ, ts(m)) + 1⚠️ Lamport Limitation
If a → b then C(a) < C(b) ✓ but the converse does NOT hold. C(a) < C(b) doesn't prove a→b — could be concurrent.
📍 Vector Clocks
Each process keeps VCᵢ[1..n]. VCᵢ[i] = own events; VCᵢ[j] = what i knows about j.
Rules
VCᵢ[i]++VCᵢVCⱼ[k] = max(VCⱼ[k], ts(m)[k]) ∀k, then VCⱼ[j]++Concurrent ⟺ neither dominates
Matrix Clocks
n×n matrix per process. M[i][*] = own vector clock. M[j][*] = what i knows about j's vector clock. Used when you need to know what others know (e.g., log GC).
W05 Synchronization Algorithms
Totally-Ordered Multicast
🏦 Bank Problem
$1000 in NY & SF. NY adds 1%, SF adds $100. Without ordering → NY: $1111, SF: $1110. Need same order at every replica.
Algorithm
Ties broken by process ID. This guarantees same delivery order at every node.
Causality vs Total Order
Causality
a→b means a may affect bPartial Order
Only causally-related events orderedTotal Order
ALL events ordered (concurrent via tie-break)Schiper-Eggli-Sandoz (SES)
Causal ordering with unicast only (no broadcast needed).
Each message carries a vector of "what sender has sent to each process".
Trade-off: fewer messages, but bigger messages + more state. Clock increments only on receive.
Matrix Clocks (recap)
Needed when application requires knowing what other processes know — e.g., garbage-collect message log entries that everyone has already seen.
W06 Mutual Exclusion & Election
3 Required Properties
Safety
≤1 process in CS at a timeLiveness
Every request eventually grantedFairness
Requests served in arrival order (logical clock)📊 Mutual Exclusion Algorithms — Compared
| Algorithm | Type | Messages/CS | Notes |
|---|---|---|---|
| Centralized | Coordinator | 3 | SPOF |
| Lamport | Permission | 3(N−1) | REQUEST + REPLY + RELEASE |
| Ricart-Agrawala | Permission | 2(N−1) | No separate release |
| Token Ring | Token | 1 to ∞ | No starvation |
| Suzuki-Kasami | Token+broadcast | 0 or N | Counter-based |
| Decentralized (voting) | Quorum | 2m·N | Possible starvation |
Lamport's Mutex
3 message types: REQUEST, REPLY, RELEASE. Channels must be FIFO.
Enter CS when:
- L1: Received a msg with TS > my request TS from all other sites
- L2: My request is at head of RQᵢ (sorted by TS)
Exit: Remove from own RQ, broadcast RELEASE → others remove it.
Ricart-Agrawala (improved Lamport)
2 message types: REQUEST, REPLY.
On receiving REQUEST <Tⱼ, Pⱼ> at Pᵢ:
- If state = Held → queue it
- If state = Wanted AND
(Tᵢ,i) < (Tⱼ,j)→ queue it - Else → send REPLY immediately
Enter CS when received REPLY from ALL N−1 others. On exit → REPLY to queued.
Suzuki-Kasami (Token-based)
Each site has request counter RNᵢ[]; token holds LN[] (last completed per site).
Request: broadcast REQUEST(i, n) with n = RNᵢ[i]+1.
Token holder sends token to Pⱼ if RN[j] = LN[j]+1.
🗳️ Election Algorithms
Bully Algorithm
The biggest ID always wins.
Ring Algorithm
Processes in a logical ring. P notices failure → builds ELECTION message with own ID, sends to successor. Each node appends its ID, forwards. When message returns to originator → pick max ID → recirculate as COORDINATOR.
Chang-Roberts (uniform, N unknown)
Hirschberg-Sinclair (bidirectional ring)
Phase r: probe left & right to distance 2ʳ. Survive if largest in 2ʳ-neighborhood. Messages: O(N log N).
W07 Global State & Chandy-Lamport Snapshot
Global state = state of every process + state of every channel (msgs in transit).
Uses
Checkpoint
Restart on failureGC
Find orphaned objectsDeadlock
Detect cyclesTermination
Detect job doneAssumptions (Chandy-Lamport)
- Channels are FIFO
- No process or channel failures
- Messages arrive intact, no duplication
🎯 Algorithm
Initiator Pᵢ
- Records own state Sᵢ
- Sends Marker on each outgoing channel
- Starts recording incoming msgs on every incoming channel
On receiving Marker at Pᵢ from channel C_ki
If first marker:
- Record own state
- Mark C_ki = ∅
- Send Markers on all outgoing channels
- Start recording on all other incoming channels
Else (already saw a marker):
- C_ki state = msgs received since recording started
✂️ Consistent Cut
A cut is consistent iff: for every event e in the cut, every event f with f→e is also in the cut.
Inconsistent: A receive is in the cut but its send is outside — a message from the "future"!
✅ Chandy-Lamport always produces a consistent cut.
W10 Hadoop & MapReduce
Open-source impl of Google MapReduce (Doug Cutting, 2005). Master-slave, shared-nothing, commodity hardware.
Two-Layer Architecture
MapReduce Phases
HDFS Quick Facts
NameNode
Master; metadata; file → block → datanode mapDataNode
Slaves; store blocks (64 or 128 MB), replicated 3×Hadoop vs RDBMS
| RDBMS | Hadoop | |
|---|---|---|
| Schema | Fixed, ACID | Schema-on-read |
| Mode | Read/Write | Mostly read |
| Hardware | Expensive servers | Commodity |
| Failures | Rare | Normal |
| Unit of work | Transaction | Job |
📈 PageRank
Damping (d): models random surfer who jumps to random page with prob (1−d). Iterate until convergence.
| Page | Init | Iter 1 | Iter 2 | Rank |
|---|---|---|---|---|
| A | 1 | 1.5 | 1.5 | 🥇 1 |
| B | 1 | 1.25 | 1.375 | 🥈 2 |
| D | 1 | 0.75 | 0.625 | 🥉 3 |
| C | 1 | 0.5 | 0.5 | 4 |
W11 Paxos — Distributed Consensus
Goal & Properties
Validity
Chosen value must be one that was proposedMajority
(N+1)/2 votes needed. N = 2m+1 tolerates m failures3 Roles
Proposer
Receives client req, drives consensusAcceptor
Votes on proposalsLearner
Announces outcomeIn practice, all 3 roles co-exist on each node.
🎯 Two-Phase Protocol
Phase 1: PREPARE → PROMISE
- Proposer picks unique ID
N(e.g.,counter.pid) - Sends
Prepare(N)to majority - Acceptor: if N > all previously seen → return
Promise(N, U)where U = highest already-accepted (or none) - Promise = vow to refuse anything < N
Phase 2: ACCEPT → ACCEPTED
- If proposer got majority promises:
- If any promise carries accepted U → must propose U's value (validity!)
- Else propose own value V
- Send
Accept(N, V)→ acceptors confirm - Majority Accepted → consensus reached
🇵🇰 Worked Example: NADRA CNIC Update
5 acceptors R1–R5. P1 (ID=15.1, "Islamabad") vs P2 (ID=15.2, "Karachi"). Higher ID wins promises.
- P2's
Prepare(15.2)arrives → all promise (15.2 > nothing) - P1's
Prepare(15.1)→ rejected (already promised 15.2) - P2's
Accept(15.2, Karachi)→ majority accepts → consensus: Karachi - P1 retries with ID=16.1; promises now carry accepted "Karachi" → P1 MUST use Karachi (validity)
R3 crashes: still 4 alive ≥ 3 majority → ✅ tolerated. 3 crashes: only 2 alive < 3 → halt.
W12 Fault Tolerance
Fault → Error → Failure
Fault Types by Duration
Transient
Temporary, vanishes (cosmic ray, glitch)Intermittent
Random/repeated (loose cable)Permanent
Stays until repair (dead disk)Fault Behavior
✅ Fail-silent (fail-stop)
Stops outputting → easy to detect☠️ Byzantine
Wrong/malicious output → HARD to detect4 Handling Strategies
| Strategy | What it does | Example |
|---|---|---|
| Prevention | Avoid the fault | Better HW, careful coding |
| Tolerance | Mask the fault | Redundancy, voting |
| Removal | Reduce frequency | Testing, patches |
| Forecasting | Estimate future | Monitoring, ML |
Redundancy Types
Information
Extra bits (parity, Hamming, ECC)Time
Retry on failurePhysical
Backup HW/SW (RAID, replicas)🎯 Triple Modular Redundancy (TMR)
Primary-Backup
Primary handles requests; backup gets heartbeats; on primary failure → backup takes over.
Availability
| Nines | Annual downtime |
|---|---|
| 99% (two) | ~3.65 days |
| 99.9% (three) | ~8.76 hours |
| 99.99% (four) | ~52 minutes |
| 99.999% (five) | ~5.26 minutes |
W13 CAP, Consistency, Eventual
ACID (single DB)
Atomic
All-or-nothingConsistent
Preserves invariantsIsolated
No interferenceDurable
Commit = permanent2PC — Two-Phase Commit
⚠️ Blocking Problem
If coordinator crashes after Phase 1 but before Phase 2 → participants stuck "in-doubt", resources locked. Fix: 3PC or Paxos/Raft.
🔺 CAP Theorem
Partitions are inevitable in real networks → you MUST pick P → real choice is C vs A.
CP Banking, ZooKeeper, HBase
Refuse minority side during partition → no split-brain. Safety over uptime.
AP Cassandra, DynamoDB
Accept writes both sides → reconcile later. Uptime over safety.
BASE (alternative to ACID)
Basically Available
Uptime firstSoft state
State may shift without input (bg sync)Eventual consistency
Replicas eventually convergeConflict Resolution (Eventual Consistency)
LWW
Last Writer Wins by timestamp → may lose dataVector Clocks
Detect causality, surface real conflictsCRDTs
Auto-merge data types (Counter, Set)⚖️ Quorum Math
N = total replicas, W = write quorum, R = read quorum.
Why? Read set and write set must overlap on ≥1 node → reader sees latest write.
Examples
- N=3, W=2, R=2: 4 > 3 ✓ strong
- N=3, W=1, R=1: 2 < 3 ✗ eventual only
- N=5, W=3, R=3: 6 > 5 ✓ tolerates 2 failures
Summary Table
| Protocol | Strength | Weakness | Use Case |
|---|---|---|---|
| 2PC | Strict ACID | Blocking | Local Tx |
| Paxos/Raft | Strong consistency | Latency | Distributed ledger |
| Eventual | Massive scale | Divergence | Feeds, carts |
| Quorum | Configurable | Tuning complex | NoSQL |
W14 Distributed AI — Billion-Parameter Training
💾 Memory Math (memorize!)
Per parameter with Adam optimizer + FP16 training:
| Component | Bytes/param | 30B model | 175B model |
|---|---|---|---|
| Weights (FP16) | 2 | 60 GB | 350 GB |
| Gradients (FP16) | 2 | 60 GB | 350 GB |
| Adam states (FP32: master+momentum+variance) | 12 | 360 GB | 2.1 TB |
| Total | 16 | ~480 GB | ~2.8 TB |
Interconnect Hierarchy
| Link | Bandwidth | Scope |
|---|---|---|
| NVLink | 600 GB/s | Intra-node (8 GPUs) |
| InfiniBand | 50–100 GB/s | Inter-node |
| PCIe Gen4/5 | 32–64 GB/s | GPU ↔ CPU |
🎲 3D Parallelism
⚡ Topology Rule (Activity 2)
TP must stay intra-node — communicates every layer, needs NVLink 600 GB/s.
PP can span nodes — only sends activations between stages, tolerates InfiniBand 50 GB/s.
NCCL Collective Primitives
| Primitive | What it does |
|---|---|
| All-Reduce | Sum from all → result to all (core of DP) |
| Broadcast | Rank 0 → everyone |
| Scatter / Gather | Split / collect |
| Reduce-Scatter | Reduce, then distribute shards |
| All-Gather | Collect shards from all |
FSDP & ZeRO
FSDP (Fully Sharded Data Parallel)
Shards weights + grads + optimizer across GPUs. 1/N memory vs DDP.
- Forward: All-Gather weights → compute → discard
- Backward: All-Gather + Reduce-Scatter grads
ZeRO Stages (DeepSpeed)
| Stage | Shards | Saving |
|---|---|---|
| 1 | Optimizer | ~4× |
| 2 | + Gradients | ~8× |
| 3 | + Parameters | ~64× on 64 GPUs (= FSDP) |
ZeRO-Offload: push to CPU RAM / NVMe → 10× larger models, slower (PCIe latency).
Other Tricks
BF16 / FP16
BF16 = same range as FP32; 2× throughput on Tensor Cores; 50% memory.Activation Checkpointing
Discard activations → recompute on backward. 33% extra compute for 5× memory.DDP vs DataParallel
DDP (multi-process, no GIL) overlaps comm + compute. DataParallel is GIL-bound.Industry Cases
Llama-2 (Meta, 2000 A100)
HW failures certain → frequent checkpoints to NVMe + DCGM monitoring + auto-drain via Slurm.BLOOM Straggler
20% slower nodes blocked All-Reduce → NCCL_DEBUG profiling → found 5 faulty IB cables → replaced → 100% speed.⚡ Practice MCQs (Q1 style)
Click each to reveal answer. 30 questions across all topics.
1. Which P2P system uses a centralized index but P2P file transfer?
2. In Lamport timestamps, if a→b then:
3. If C(a) < C(b), does a→b necessarily hold?
4. Vector clocks (2,0,0) vs (0,0,1) are:
5. Cristian's delay estimate:
6. Berkeley algorithm differs from Cristian's because:
7. Ricart-Agrawala messages per CS:
8. Lamport's mutex messages per CS:
9. Suzuki-Kasami is:
10. In the Bully algorithm, who wins?
11. Chandy-Lamport requires channels to be:
12. The marker message in Chandy-Lamport is used to:
13. A consistent cut requires:
14. HDFS default replication factor:
15. Shuffle & Sort happens:
16. Paxos can tolerate m failures with how many nodes?
17. In Paxos, a proposer whose promises contain accepted values:
18. CAP theorem — during a partition you must:
19. BASE stands for:
20. For strong consistency with quorums:
21. TMR can mask how many faulty modules?
22. Fail-silent faults are:
23. Byzantine faults:
24. Tensor Parallelism is typically confined to:
25. ZeRO Stage 3 / FSDP shards:
26. NCCL primitive used to sum gradients in DDP:
27. In Chord with m-bit IDs, lookup complexity:
28. Idempotent operation example:
29. 2PC blocking problem occurs when:
30. Vector clock receive rule:
📝 Worked Cases (Q2-Q4 style)
Case A: Lamport Trace
P1 sends m1 to P2 at C1=2. P2 sends m2 to P3 at next event. P3 sends m3 back to P1.
P1: 0 → 1 → 2 (send m1, ts=2)
P2: on receive: max(0,2)+1 = 3 → 4 (send m2)
P3: on receive: max(0,4)+1 = 5 → 6 (send m3)
P1: on receive: max(2,6)+1 = 7
Case B: Ricart-Agrawala
P1 (ts=10), P2 (ts=8), P3 idle.
- P1 multicasts REQ<10,1>; P2 multicasts REQ<8,2>
- P2 receives <10,1>: P2 Wanted, (8,2)<(10,1) → queues
- P1 receives <8,2>: P1 Wanted, (10,1)>(8,2) → sends REPLY
- P3 has no request → REPLYs to both
- P2 has all REPLYs → enters CS
- P2 exits → REPLYs to queued P1 → P1 enters CS
Case C: Bully Election
5 processes {1..5}, P5 crashes, P2 notices.
P2 sends ELECTION to {3,4} → both reply OK and start own elections. P3 sends to {4} → P4 replies OK. P4 sends to nobody (P5 dead). P4 wins → broadcasts COORDINATOR.
Case D: 2PC Blocking
Coordinator C with P1, P2, P3. C → Prepare. All vote Commit, enter wait. C crashes. P1-P3 stuck "in-doubt" — must hold locks until C recovers. Throughput on affected records = 0. Fix: 3PC or Paxos/Raft.
Case E: Quorum Design (TV Voting)
N=5, want strong consistency, tolerate failures.
W=3, R=3 → W+R=6 > N=5 ✓
With 2 node failures: 3 alive ≥ 3 → ✅ still works
With 3 failures: 2 alive < 3 → halt (correctness preserved)
Case F: Paxos Contention
P1 (ID=10) and P2 (ID=12). 3 acceptors.
- P1 sends Prepare(10). All promise.
- P2 sends Prepare(12). All promise (12>10).
- P1's Accept(10, "X") → rejected (promised ≥12)
- P2's Accept(12, "Y") → accepted → consensus on Y
- P1 retries with Prepare(13). Promises now carry ("Y"). P1 must use "Y" (validity)
Case G: PageRank One Iteration
Pages A,B,C,D with links A→B, A→C, B→C, C→A, D→C. d=0.85. Init PR=1.
Out: |A|=2, |B|=1, |C|=1, |D|=1.
- PR(A) = 0.0375 + 0.85·(1/1) = 0.8875
- PR(B) = 0.0375 + 0.85·(1/2) = 0.4625
- PR(C) = 0.0375 + 0.85·(0.5 + 1 + 1) = 2.1625
- PR(D) = 0.0375 + 0 = 0.0375
Case H: 13B Llama Memory (BF16)
- Weights: 13 × 2 = 26 GB
- Gradients: 13 × 2 = 26 GB
- Adam (FP32, 12 B/param): 13 × 12 = 156 GB
- Total ≈ 208 GB
One 80 GB H100 can't hold it → need FSDP/ZeRO across multi-GPU.