A Field Manual № 01 · Spring MMXXVI Prepared for the ESE · NUST SEECS

Parallel
& Distributed
Computing

A condensed but unhurried reading on time, order, consensus, and the strange politics of machines that must agree across an unreliable wire.

Course
CS / BSCS-13AB-2k23
Instructor
Dr. M. Khuram Shahzad
Scope
Weeks II, III (P2P), IV–VII, X–XIV
Paper
100 marks · 4 × 25 · MCQ + Theory
IWeek 02

Architectures of the Distributed

Before time, before consensus, the first question: where do the components live, and how do they speak. A distributed system is a collection of independent computers that, by some sleight of middleware, appear to its user as a single coherent thing.

Two definitions sit at the root of the field. A decentralized system is one in which processes and resources are necessarily spread across multiple computers. A distributed system is one in which they are sufficiently spread — with the goal of presenting, to the user, the illusion of a single machine.

Two views guide its construction: the integrative, in which existing networked computers are knit into one larger system, and the expansive, in which an existing networked system is grown by addition of more computers.

Software versus System Architecture

Software architecture

The logical organization of components: their interfaces, the data they exchange, and the manner of their connection. A middleware layer that hides distribution is software architecture in the small.

System architecture

The physical realization: which component runs on which machine. A centralized client-server is one such system; a fully decentralized peer-to-peer mesh is another.

Four Styles, Plainly Stated

The principal architectural styles
StylePremiseCanonical example
LayeredComponents in a strict stack; only adjacent layers converse.OSI, TCP/IP
Object-basedObjects encapsulate data and expose methods through well-defined interfaces.Client–server, CORBA
Data-centredProcesses communicate by reading and writing a shared repository.Database (passive); Blackboard (active)
Event-basedCommunication by propagation of events. Often fused with data-centred to make shared data spaces.Publish / subscribe; Kafka

Goals of the Design

Resource sharing
storage · files · media
Transparency
access · location · failure
Openness
interfaces · portability
Dependability
availability · reliability
Security
confidentiality · integrity
Scalability
size · geography · admin

Seven Transparencies

TransparencyHides
AccessDifferences in data representation and how an object is accessed.
LocationWhere an object is located.
MigrationThat an object may move to another location.
RelocationThat an object may be moved while in use.
ReplicationThat an object is replicated.
ConcurrencyThat an object may be shared by several users.
FailureThe failure and recovery of an object.

Delivery Semantics & Idempotency

Three contracts a delivery mechanism may keep:

  • At-most-once — zero or one delivery. Messages may be lost.
  • At-least-once — one or more deliveries. Messages may be duplicated.
  • Exactly-once — exactly one. Neither lost nor duplicated. The most expensive guarantee.

An operation is idempotent when performing it twice has the same effect as performing it once. READ X is idempotent; INCREMENT X is not. In an unreliable network, the temptation is to retransmit; the danger is in retransmitting that which is not safely repeated.

Fail-safe rule Design APIs to be idempotent at the boundary. Then "at-least-once" plus "deduplicate-on-arrival" becomes, for free, "exactly-once" in effect.

The Chord Ring — A Distributed Hash Table

Nodes are organized on a logical ring of 2m positions. Each node has an m-bit identifier; each data item is hashed to an m-bit key. The item with key k is stored at the smallest node whose identifier is at least k — the successor of k.

Definition · Finger Table
For node n, the i-th finger points to the successor of n + 2i−1, indices 1 … m.
N1 N8 N14 N21 N32 N42 N48 N51 FINGERS FROM N8 FINGER TABLE · NODE 8 iN + 2ⁱ⁻¹succ. 19N14 210N14 312N14 416N21 524N32 640N42 LOOKUP COST · O ( LOG N )
fig. 1.1 The Chord ring with eight nodes. Node N8 maintains six fingers; lookups proceed by repeated halving of the remaining arc, yielding logarithmic time. Distance is asymmetric: dist(A,B) = (B − A) mod 2m.

Two- and Three-Tiered Realizations

Two-tier

Thin-client sends only display work to the server, which handles processing and data. Easier to manage; performance loss at the client. Fat-client moves processing and some data to the client — reduces server load, scales further, but is harder to administer.

Three-tier

Presentation, processing, data; each a separate machine. Vertical distribution splits logical layers across machines; horizontal distribution replicates the same layer for load.

Cloud, Edge, Blockchain

Cloud computing is layered: hardware (the metal), infrastructure (virtualization), platform (e.g. S3-style buckets), application. Edge-server systems push servers to the network's boundary — closer to ISP, closer to the user. Blockchains are append-only chains of immutable, massively replicated blocks; their hard problem is not the chain but the question of who may append.

The Eight Fallacies Many distributed systems are needlessly complex, repaired post-hoc. The recurring sins: the network is reliable; the network is secure; the network is homogeneous; the topology does not change; latency is zero; bandwidth is infinite; transport cost is zero; there is one administrator. None of these are true.
IIWeek 03

Peer-to-Peer Systems

A history of the last twenty-five years told as a single argument: how to find a file when no one is in charge. Each system proposes an answer; each answer is broken by the next.

Every peer-to-peer system answers four primitive verbs: join the network, publish what one has, search for what one wants, fetch what one finds. The interesting differences live in the second and third.

Five Architectures, in Order of Their Defeats

Napster CENTRAL INDEX idx P2P TRANSFER (DASHED) Gnutella FLOODED MESH QUERY · TTL · FLOOD KaZaA HIERARCHICAL SUPERNODES SN SN SN · 60–150 LEAVES Skype SUPERNODES + LOGIN login SN SN NAT TRAVERSAL BitTorrent TRACKER + SWARM tracker S PIECES · TIT-FOR-TAT
fig. 2.1 Five P2P architectures arranged by historical succession. Each addresses the failure mode of its predecessor: Napster's single index; Gnutella's exponential flooding; KaZaA's brittle supernode election; Skype's NAT problem; BitTorrent's swarm coordination.
Comparison — the five canonical systems
SystemArchitectureSearchDecentralizationInnovation
NapsterCentralized indexCentral serverNoneEasy UI; first scale
GnutellaPure P2PQuery flood with TTLFullNo central authority
KaZaAHybrid · FastTrackThrough supernodesPartialHierarchical search
SkypeHybrid · supernodes + loginSupernode discoveryPartialNAT traversal
BitTorrentHybrid · tracker / DHTOut-of-band sites or DHTHighSwarming, tit-for-tat

Gnutella in Detail

The protocol's five messages:

PingProbe the network for other peers.
PongReply to Ping; carries an IP and port.
QuerySearch request, propagated to neighbours until TTL expires.
QueryHitReturned along the reverse path when a match is found.
PushUsed when the supplier sits behind a firewall.
Result · Flooding Explosion
With TTL = 7 and b = 5 neighbours per peer, the maximum number of query messages generated is 5 + 5·4 + 5·4² + … + 5·4⁶ — exponential. The price of full decentralization is bandwidth.

BitTorrent's Vocabulary

Torrent fileMetadata referring to a tracker.
TrackerServer keeping account of swarm membership.
Seeder · LeecherPeer with the complete file · peer still downloading.
SwarmThe peers sharing one file.
DHTReplaces the tracker for fully decentralized discovery.
Tit-for-tatThe incentive mechanism. Uploaders earn priority as downloaders.
IIIWeek 04

Time, Clocks & the Order of Things

Two computers will never agree on the exact time. The question is whether they need to. The answer, almost always, is no — they need to agree only on the order of events that matter.

Cristian's Algorithm

The server is passive and carries an accurate clock. The client estimates round-trip delay and applies half of it as a correction. The four observed times are recorded:

  1. Client A sends a request at T1.
  2. Server B receives at T2 and replies at T3, piggybacking both.
  3. Client A records the arrival at T4.
delay = ((T2 − T1) + (T4 − T3)) / 2

NTP repeats this eight times and takes the minimum-delay sample as its best estimate.

Berkeley alternative When no machine has an accurate clock, an elected master polls slaves, averages, and sends offsets back. Sending an offset avoids fresh RTT uncertainty at the slave.
A · client B · server T1 REQUEST T2 T3 REPLY · <T2, T3> T4
fig. 3.1 Cristian's protocol: four observations, one symmetric assumption about path delay, one corrected clock.

Lamport's Logical Clocks

Lamport's insight: processes that do not communicate need not agree on time at all. Among those that do, we need only agree on the order of events that touch them.

Definition · Happens-before
a → b when (i) a precedes b on the same process; or (ii) a is the sending of a message and b its receipt; or (iii) by transitivity.

Events with neither a → b nor b → a are concurrent. The clock rules then write themselves:

  1. Before any event at process Pi: Ci ← Ci + 1.
  2. On sending a message: stamp it with Ci.
  3. On receiving a message m at Pj: Cj ← max(Cj, ts(m)) + 1.
P₁ P₂ P₃ 1 2 m₁ · ts=2 3 4 m₂ · ts=4 5 6 m₃ · ts=6 7
fig. 3.2 A three-process Lamport trace. The receiver of m sets its clock to max(local, ts(m)) + 1; here, P3's clock advances from 0 to 5 upon receiving m₂ with stamp 4.
The Limitation a → b implies C(a) < C(b), but the converse does not hold. A small Lamport stamp does not imply a causal relation. To distinguish concurrency from precedence, we need vectors.

Vector Clocks

Each process holds a vector VCi[1…n]. Position i counts its own events; position j records what Pi knows about Pj's clock.

  1. Before any event: VCi[i] ← VCi[i] + 1.
  2. On send: attach VCi.
  3. On receive m: for each k, VCj[k] ← max(VCj[k], ts(m)[k]); then VCj[j]++.
Property · Causality detected exactly
a → b VC(a) < VC(b), componentwise and strictly. If neither dominates, the events are concurrent.
P₁ P₂ P₃ (1,0,0) (2,0,0) (2,1,0) (2,2,0) (2,2,1) (2,2,2) (3,0,0) CONCURRENT WITH (2,2,2)
fig. 3.3 Vector clocks make concurrency visible. P1's late event at (3,0,0) and P3's (2,2,2) are concurrent: neither dominates the other.

Matrix Clocks — What Others Know

Each process maintains an n × n matrix. The row Mi[i] is the process's own vector clock, the principal vector. Other rows record what Pi knows of what others know. This is precisely the structure required to garbage-collect message logs, decide stable predicates, or run causal multicast protocols.

IVWeek 05

Synchronization & Ordered Multicast

If every replica must see the same updates in the same order — and updates do not commute — then logical clocks alone are insufficient. We must agree on a total order, even for concurrent events.

Motivating Problem A bank account stands at $1,000. New York applies a 1 % interest; San Francisco deposits $100. If NY runs first, the balance becomes $1,110; if SF runs first, $1,111. With unordered multicast, the two replicas diverge. The order must be the same everywhere.

Totally-Ordered Multicast — Extended Lamport

  1. The sender stamps its update with its Lamport clock and multicasts to all processes, itself included.
  2. Each receiver places the message in a local queue, sorted by timestamp (ties broken by sender id).
  3. Each receiver replies with a timestamped acknowledgement.
  4. A message is delivered to the application only when it stands at the head of the queue and has been acknowledged by every other process.

Ordering — the Distinctions

Causal
a → b means a may affect b
Partial order
only causally related
Total order
all events, with tie-break

Schiper–Eggli–Sandoz

For causal ordering when broadcast is unavailable. Each message carries a vector of "what this sender has sent to each other process". The receiver delivers only when its own state shows all causally prior messages have arrived. Trades message size for the absence of broadcast; clock advances only on receive.

Matrix Clocks — redux

Stronger than vector clocks where the application needs to know what other processes know. The principal vector is, on any process, larger or equal to every non-principal vector.

VWeek 06

Mutual Exclusion & Election

A shared resource demands one user at a time. Across a network, without a shared semaphore, this becomes a problem of agreement, of message economy, of failure handling. Then: who shall preside?

Three Requirements

Safety
at most one in CS
Liveness
every request eventually granted
Fairness
served in logical order

The Algorithms, in Increasing Subtlety

Mutual exclusion — costs & characteristics
AlgorithmClassMessages / CSNotes
Centralized coordinatortoken at a master3SPOF; trivial fairness
Lamportpermission, non-token3 (N − 1)REQUEST · REPLY · RELEASE
Ricart & Agrawalapermission, non-token2 (N − 1)Reply replaces release
Token ringtoken, circulating1 to ∞No starvation; token loss is hard
Suzuki–Kasamitoken, broadcast request0 or NCounter-based
Decentralized votingquorum2m · NStarvation possible

Lamport's Mutex

Every site keeps a local request queue ordered by timestamp. Channels must be FIFO. Three message types — REQUEST, REPLY, RELEASE — carry the protocol.

Entry Conditions · L1 and L2 L1. Site Si has received from every other site a message with timestamp greater than its request's timestamp.
L2. Its own request stands at the head of its local queue.

Ricart & Agrawala

The same idea, refined. The RELEASE message vanishes; its work is folded into the deferred REPLY. On receiving a competing request:

  1. If state is Held, queue the request.
  2. If state is Wanted and one's own (Ti, i) is lexicographically less than the incoming (Tj, j), queue the request.
  3. Otherwise, send REPLY at once.

Enter the critical section when REPLY has been received from all N − 1 others. On exit, send REPLY to all queued requests.

Suzuki–Kasami

The single token holds an array LN[] (the latest request number serviced for each site). Each site holds RN[] (the highest request number seen). To request entry, a site broadcasts REQUEST(i, n) with n = RNi[i] + 1. The token holder forwards the token to Pj exactly when RN[j] = LN[j] + 1.

Performance Metrics

Message complexity
messages per CS
Sync. delay
exit → next entry
Response time
request → finish
Throughput
1 / (SD + E)

Election — Who Shall Preside

Many algorithms presume a coordinator. When one fails, an election must produce another. Three classical strategies follow.

The Bully Algorithm

The biggest live process wins. When a process notices the coordinator has gone silent, it sends ELECTION to every higher-numbered process. If none replies, it declares itself; otherwise, it stands down.

P₁ P₂ P₃ P₄ P₅ CRASHED P₂ → ELECTION → {P₃, P₄} P₄ HAS NO HIGHER · WINS
fig. 5.1 A Bully election after P5's death. P2 initiates; higher-numbered processes participate; P4, finding no higher process responding, declares itself coordinator.

Ring & Chang–Roberts

Processes are arranged in a logical ring; messages flow in one direction. In the classical ring algorithm, an ELECTION message accumulates every visited process's id; when it returns to the originator, the maximum id wins, and a second pass announces the coordinator.

Chang–Roberts is uniform — the number of processes need not be known. Each node sends its id to the left. On receipt, if the incoming id exceeds one's own, forward; if less, discard; if equal to one's own, declare oneself the leader. Cost: O(N²) worst, O(N log N) average.

1 2 3 4 5 6 7 8 8 RECEIVES ITS OWN ID · LEADER
fig. 5.2 Chang–Roberts on a ring of eight. Each id propagates leftward, surviving only as long as it exceeds the receiver. The highest id is the only message that ever completes a full circuit.

Hirschberg–Sinclair

On a bidirectional ring. In phase r, a candidate probes both directions to distance 2r; only those who are the largest in their 2r-neighbourhood survive to phase r+1. Cost: O(N log N) — better than Chang–Roberts, the price of bidirectionality.

VIWeek 07

Global State & the Chandy–Lamport Snapshot

A photograph of a distributed system must capture every process and every channel — without halting either. Chandy & Lamport's algorithm achieves this by inserting markers into the flow of ordinary messages.

The global state is the union of every process's local state and every channel's contents (messages in flight). The naive approach — synchronize clocks, ask each process to record at the same wall-clock instant — fails twice: clock skew makes the instant fuzzy, and the in-flight messages go unobserved.

Why One Records the State

Checkpoint
restart after failure
Garbage
unreachable objects
Deadlock
detect cycles in waits
Termination
is the job done
Assumptions Channels are FIFO; no failures of processes or channels; messages arrive intact and exactly once. Subsequent work relaxes these; the original argument requires them.

The Algorithm

At the initiator Pi
  1. Record local state Si.
  2. Send a Marker on each outgoing channel.
  3. Begin recording every incoming channel.
On receiving a Marker on Ck → i

If this is the first marker Pi has seen: record own state; mark Ck → i as empty; send markers on all outgoing channels; begin recording every other incoming channel.

Otherwise: the state of Ck → i is precisely the sequence of messages received on that channel since recording began.

P₁ P₂ P₃ S₁ MARKER S₂ S₃ SUBSEQUENT MARKERS · CLOSE CHANNELS RECORDED MESSAGES = CHANNEL STATE
fig. 6.1 The Chandy–Lamport propagation. P1 initiates; the marker fans across every outgoing channel. Each first receipt triggers a process to record its own state and forward markers in turn. Subsequent marker arrivals close their channels with whatever messages had arrived since recording began.

The Consistent Cut

A cut partitions every process's events into "before" and "after". A cut is consistent when, for every event e in the cut, every event f with f → e is also in the cut. Equivalently: no message is received before it is sent.

Theorem · Chandy–Lamport
Every state recorded by the algorithm corresponds to a consistent cut of the global execution — even though the recording was not synchronous.
VIIWeek 10

Hadoop & the MapReduce Paradigm

Storage and compute, both at the scale of a warehouse, designed on commodity hardware around the certainty of failure. The result is a programming model so spare it can be taught in two functions.

Hadoop is the open-source rendering of Google's MapReduce, born from Doug Cutting's effort to scale the Nutch search engine in 2005. Its philosophy: any data will fit; failure is the norm; compute moves to the data, not the reverse.

The Architecture, in Two Layers

Master Node NAMENODE · JOBTRACKER Slave 1 DATANODE TASKTRACKER Slave 2 DATANODE TASKTRACKER Slave 3 DATANODE TASKTRACKER Slave N DATANODE TASKTRACKER B₁ B₂ B₅ B₁ B₃ B₄ B₂ B₄ B₅ B₂ B₃ B₅ EACH BLOCK REPLICATED ×3
fig. 7.1 HDFS topology. A single NameNode holds the file-to-block-to-DataNode map. Blocks (64 or 128 MB) are replicated three times by default, spread across the cluster.

The Compute Layer · MapReduce

Input HDFS Map Map Map Shuffle + Sort Reduce Reduce Output HDFS <K, V> <K, [V]> <K′, V′>
fig. 7.2 The MapReduce pipeline. Mappers emit key-value pairs; the shuffle phase groups by key; reducers process one key and its list of values. The programmer provides only the map and reduce functions.

Hadoop versus Relational Databases

RDBMSHadoop
SchemaFixed; ACIDSchema on read
ModeRead & writeMostly read
HardwareExpensive serversCommodity
FailuresRareNormal
Work unitTransactionJob

PageRank, As MapReduce

PR(p) = (1 − d) / N + d · Σ PR(q) / L(q)   for q ∈ in-links(p)

The damping factor d (commonly 0.85) models the probability that a random surfer follows a link rather than jumping. PageRank is iterated until convergence; in MapReduce, mappers emit contributions to each out-neighbour, reducers sum incoming contributions.

PageRank · four pages · d = 0.5 · initial PR = 1
PageIter 0Iter 1Iter 2Rank
A1.0001.5001.5001
B1.0001.2501.3752
D1.0000.7500.6253
C1.0000.5000.5004
VIIIWeek 11

Paxos & the Politics of Consensus

When multiple processes propose values, and a majority must agree on exactly one, Paxos provides the protocol. It is among the simplest distributed algorithms — and, famously, among the most misunderstood.

The Three Roles

Proposer
drives consensus
Acceptor
votes on proposals
Learner
announces outcome

In practice, each node plays all three; the roles are logical separations within a single process.

Required Properties

Concurrent proposalsMore than one proposer may act at once.
ValidityThe chosen value must be one that was proposed — not invented.
Majority ruleTo tolerate m failures the protocol requires N = 2m + 1 acceptors.
UnicastNo reliance on atomic multicast.

The Two Phases

Proposer Acceptors · majority Learner PHASE 1 PREPARE ( N ) PROMISE ( N, accepted U ) PHASE 2 ACCEPT ( N, V ) ACCEPTED
fig. 8.1 Paxos in two phases. Prepare and Promise establish the right to lead; Accept and Accepted establish the value. The validity property requires that if any Promise reports a previously accepted value, that value — not the proposer's preferred one — must be re-proposed.

Phase 1 · Prepare & Promise

  1. The proposer chooses a unique, monotonically increasing identifier N — typically counter.pid.
  2. It sends Prepare(N) to at least a majority of acceptors.
  3. An acceptor: if N exceeds every identifier it has seen, replies Promise(N, U) where U is the highest-numbered proposal already accepted (or none); it vows to refuse anything lower than N. Otherwise, it ignores the request.

Phase 2 · Accept & Accepted

  1. If the proposer receives Promises from a majority:
  2. If any Promise reported an accepted value U, the proposer must propose the value of the highest-numbered such U — validity demands it.
  3. Otherwise, it proposes its own value V.
  4. It sends Accept(N, V) to the majority.
  5. An acceptor accepts unless it has, since, promised a higher number.
  6. When a majority accepts, consensus is reached and learners are notified.
Result · Fault tolerance
To tolerate m simultaneous failures, deploy N = 2m + 1 acceptors. Five acceptors survive two; three acceptors survive one.

A Worked Contention — NADRA CNIC Update

Five acceptors R1…R5. Two district offices propose simultaneously: P1 with id 15.1 ("Update to Islamabad"), P2 with id 15.2 ("Update to Karachi").

  1. P2's Prepare(15.2) arrives first; all acceptors promise (nothing previously accepted).
  2. P1's Prepare(15.1) is now too low and is rejected.
  3. P2's Accept(15.2, Karachi) is accepted by a majority. Consensus on Karachi.
  4. P1 retries with id 16.1. Promises now report "Karachi" as accepted — P1 must re-propose Karachi.
  5. R3 crashes: four acceptors remain — majority of three still attainable. R3 and R4 crash: three remain — majority still possible. A third failure halts progress; correctness is preserved.
IXWeek 12

Fault Tolerance & Redundancy

Failure is not exceptional; it is the working condition. The discipline of fault tolerance is the discipline of building systems whose correctness survives the inevitable misbehaviour of their parts.

The Chain

Fault ROOT CAUSE Error INCORRECT STATE Failure SERVICE LOST
fig. 9.1 The fault chain. A latent fault produces an erroneous internal state, which in turn produces an externally visible failure.

By Duration

TransientAppears once and is gone. A cosmic-ray bit flip.
IntermittentRecurrent and unpredictable. A loose cable.
PermanentPersists until repair. A dead disk.

By Behaviour

Fail-silent (fail-stop)

The component produces no output, or stops. Easy to detect — the absence of signal is itself a signal.

Byzantine

The component produces wrong, arbitrary, or malicious output indistinguishable from correct. Hard to detect. Requires either trust, attestation, or quorum.

Strategies of Handling

PreventionAvoid the fault at the source.Careful engineering
ToleranceMask the fault when it occurs.Redundancy, voting
RemovalReduce frequency through correction.Testing, patches
ForecastingEstimate future incidence.Monitoring, telemetry

Three Kinds of Redundancy

Information
parity · Hamming · ECC
Time
retry on failure
Physical
backup hardware · replicas

Triple Modular Redundancy

Module A CORRECT Module B FAULTY Module C CORRECT Voter
fig. 9.2 Triple Modular Redundancy: three identical modules feed a majority voter. The output remains correct so long as no more than one module fails. The pattern scales to N-Modular Redundancy.

Primary–Backup

The primary alone handles requests. The backup observes heartbeats; on the primary's silence, it assumes the role. The model trades double the hardware for continuous availability.

Availability, By the Nines

NinesAnnual downtimeRoughly
99%3.65 daysinternal tools
99.9%8.76 hoursweb applications
99.99%52 minutesfinancial services
99.999%5.26 minutestelephony, payments
XWeek 13

CAP & the Eventual

A theorem about three desirable properties — consistency, availability, partition tolerance — states that one may have at most two. Since partitions are inevitable, the practical choice is between consistency and availability. Eric Brewer, 2000.

ACID, Recalled

Atomic
all or nothing
Consistent
invariants kept
Isolated
no interference
Durable
commit is forever

Two-Phase Commit

Coordinator Participants PHASE 1 PREPARE VOTE · COMMIT / ABORT PHASE 2 GLOBAL COMMIT ACK
fig. 10.1 Two-Phase Commit. Prepare gathers votes; Commit (or Abort) enforces the decision. Its weakness is the blocking problem: if the coordinator dies between phases, participants hold their locks indefinitely.

The CAP Theorem

Consistency Availability Partition CP BANKING · ZK AP CASSANDRA CA SINGLE NODE
fig. 10.2 The CAP triangle. Network partitions will occur. The honest choice, in a real distributed system, is therefore between the CP corner (refuse minority writes) and the AP corner (accept divergent writes, reconcile later).

BASE — the Alternative to ACID

Basically
always available
Soft state
may shift unprompted
Eventual
replicas converge in time

Conflict Resolution

Last writer winsDecide by timestamp. Simple; may discard real work.
Vector clocksSurface genuine conflicts and defer to the application.
CRDTsConflict-free replicated data types — merge automatically. Examples: G-Counter, OR-Set.

Quorum Arithmetic

With N replicas, W required for write, R required for read:

Quorum overlap
W + R > N  ⟺  strong consistency. The write set and read set must share at least one node.
Write quorum Read quorum Overlap AT LEAST ONE NODE W + R > N
fig. 10.3 Quorum overlap. The intersection guarantees that any reader, looking at R nodes, will encounter at least one node that participated in the latest write.

Protocols, Compared

ProtocolStrengthWeaknessUse case
2PCStrict ACIDBlockingLocal transactions
Paxos · RaftStrong consistencyLatencyDistributed ledger
EventualMassive scaleTemporary divergenceFeeds, carts
QuorumTunableComplex to operateNoSQL
XIWeek 14

Distributed Intelligence at Scale

Training a model of a hundred billion parameters is a problem of distributed systems first, machine learning second. The compute fits on no single device; the cluster must agree, partition, replicate, and survive failure on the scale of a small data centre.

The Arithmetic of Memory

For N parameters in FP16 with an Adam optimizer, the per-parameter bookkeeping is rigid:

ComponentBytes per parameter30 B model175 B model
Weights · FP16260 GB350 GB
Gradients · FP16260 GB350 GB
Adam states · FP32 (master + m + v)12360 GB2.1 TB
Total16480 GB2.8 TB
FLOPstrain6 · N · D  ·  2 forward + 4 backward; N parameters, D tokens

Interconnect Hierarchy

LinkBandwidthScope
NVLink600 GB/sIntra-node · 8 GPUs
InfiniBand50–100 GB/sInter-node
PCIe Gen 4/532–64 GB/sGPU ↔ CPU

Three-Dimensional Parallelism

Data parallel REPLICATE · SPLIT BATCH FULL FULL FULL ALL-REDUCE GRADIENTS Tensor parallel SPLIT LAYER · INTRA-NODE slice 1 slice 2 slice 3 REQUIRES NVLINK Pipeline parallel SPLIT LAYERS · INTER-NODE OK L1-12 L13-24 L25-36 PIPELINE BUBBLES 3D Parallelism = DP + TP + PP MEGATRON-LM · DEEPSPEED · INDUSTRY STANDARD FOR LLMS
fig. 11.1 The three axes of parallelism. Tensor parallelism, with its per-layer communication, must remain intra-node where NVLink furnishes the bandwidth. Pipeline parallelism, with its rarer transfers, tolerates inter-node links.

NCCL Primitives

All-ReduceReduce across all workers, result to all. The core of DP.
BroadcastRank 0 to every worker.
Scatter / GatherSplit a tensor / collect tensors.
Reduce-ScatterReduce, then distribute shards.
All-GatherCollect shards from all workers into all.

FSDP & ZeRO

FSDP · Fully Sharded Data Parallel

Shards weights, gradients and optimizer state across GPUs. 1/N memory footprint relative to DDP. Forward: All-Gather weights for the current layer; compute; discard. Backward: All-Gather plus Reduce-Scatter for gradients.

ZeRO Stages · DeepSpeed

Stage 1: shard optimizer states (~4× saving).
Stage 2: shard optimizer + gradients (~8×).
Stage 3: shard everything (FSDP-equivalent).
Offload: push to CPU RAM or NVMe; ten times larger models, much slower.

Lesser Techniques That Buy A Lot

Mixed precision
BF16 · same FP32 range
Activation ckpt
33% compute, 5× memory
DDP > DP
no GIL · overlap

Industry Notes

Llama-2 · 2 000 A100 Hardware failures certain at scale. Frequent local checkpoints to NVMe; DCGM monitoring; Slurm auto-drain and replace. Sustained high utilization despite daily hardware drops.
BLOOM · The Straggler Problem Synchronous All-Reduce travels at the speed of the slowest worker. NCCL_DEBUG profiling identified five faulty InfiniBand cables; cable replacement restored full throughput.
XIIAppendix

Formulas at a Glance

Every numeric rule worth memorizing, gathered. The kind of facts that lose marks when forgotten — message counts, fault tolerance bounds, memory budgets, bandwidth tiers — laid out for one last pass before walking in.

Time & Logical Clocks

RuleMeaning
delay = ((T2−T1) + (T4−T3)) / 2Cristian's symmetric-path delay estimate
offset = T3 + delay − T4Correction applied to client clock
Cⱼ ← max(Cⱼ, ts(m)) + 1Lamport's receive rule
VCⱼ[k] ← max(VCⱼ[k], ts(m)[k]) ∀kVector clock receive · then VCⱼ[j]++
a → b ⟺ VC(a) < VC(b)Causality test · componentwise & strict

Mutual Exclusion · Messages per CS

AlgorithmMessages
Centralized coordinator3
Lamport3 (N − 1)
Ricart & Agrawala2 (N − 1)
Token ring1 to ∞
Suzuki–Kasami0 (have token) or N (broadcast)
Decentralized voting2m · N
Throughput1 / (SD + E)

Election · Worst-Case Complexity

AlgorithmMessages
BullyO ( N² ) · worst case
Ring2 N messages · two passes
Chang–RobertsO ( N² ) worst · O ( N log N ) avg
Hirschberg–SinclairO ( N log N ) · bidirectional

Chord DHT

PropertyValue
Lookup complexityO ( log N )
Finger entry isucc ( n + 2ⁱ⁻¹ ) mod 2ᵐ
Asymmetric distancedist ( A, B ) = ( B − A ) mod 2ᵐ
Doubling network+1 finger row
Finger table sizem rows for m-bit IDs

P2P · Gnutella Flooding

QuantityFormula
Max query messagesb + b(b−1) + b(b−1)² + … + b(b−1)ᵀᵀᴸ⁻¹
For TTL = 7, b = 55 + 5·4 + 5·4² + … + 5·4⁶ · exponential
KaZaA leaves per SN60 – 150 · TTL typically 7

Hadoop & HDFS

ItemValue
Default replication factor3
Block size64 MB (older) or 128 MB
Storage for a file F|F| × 3
PageRank(1 − d) / N + d · Σ [ PR(q) / L(q) ]
Damping factor d≈ 0.85

Paxos

RuleValue
Acceptors to tolerate m failuresN = 2m + 1
Majority of N⌊N / 2⌋ + 1
3 acceptorstolerates 1 failure
5 acceptorstolerates 2 failures
7 acceptorstolerates 3 failures
Proposal IDcounter . pid · unique & monotonic

Fault Tolerance · Availability

ItemValue
TMR tolerates1 faulty module · majority vote
99 %~ 3.65 days / year
99.9 %~ 8.76 hours / year
99.99 %~ 52 minutes / year
99.999 %~ 5.26 minutes / year
Availabilityuptime / (uptime + downtime)

CAP · Quorum Arithmetic

RuleMeaning
W + R > NStrong consistency · read & write sets overlap
Max failures = N − max(W, R)How many crashes the cluster survives
QUORUM = ⌊N / 2⌋ + 1Cassandra-style majority
N = 3, W = 2, R = 24 > 3 ✓ · tolerates 1 failure
N = 5, W = 3, R = 36 > 5 ✓ · tolerates 2 failures

Distributed AI · The Numbers

QuantityValue
Total memory per parameter · FP16 + Adam16 B
· Weights · FP162 B
· Gradients · FP162 B
· Adam states · FP32 (master + m + v)12 B (4 + 4 + 4)
Training FLOPs≈ 6 · N · D · 2 fwd + 4 bwd
NVLink600 GB/s · intra-node
InfiniBand50 – 100 GB/s · inter-node
PCIe Gen 4 / 532 – 64 GB/s · GPU ↔ CPU
Activation checkpoint+33 % compute · memory
ZeRO-1 / 2 / 3 saving~ 4× / 8× / 64×
30 B model · total VRAM≈ 480 GB
175 B model · total VRAM≈ 2.8 TB
Universal Method For any "calculate X" question: (1) identify the formula; (2) write it down before plugging numbers — earns partial marks; (3) substitute carefully and show the step; (4) state the result with units; (5) name the rule in one short sentence ("by Lamport's protocol, total = 3 (N − 1)"). Examiners reward showing the formula and naming the rule — not just the final number.