A Field Manual on Parallel & Distributed Computing

IWeek 02

Architectures of the Distributed

Before time, before consensus, the first question: where do the components live, and how do they speak. A distributed system is a collection of independent computers that, by some sleight of middleware, appear to its user as a single coherent thing.

Two definitions sit at the root of the field. A decentralized system is one in which processes and resources are necessarily spread across multiple computers. A distributed system is one in which they are sufficiently spread — with the goal of presenting, to the user, the illusion of a single machine.

Two views guide its construction: the integrative, in which existing networked computers are knit into one larger system, and the expansive, in which an existing networked system is grown by addition of more computers.

Software versus System Architecture

Software architecture

The logical organization of components: their interfaces, the data they exchange, and the manner of their connection. A middleware layer that hides distribution is software architecture in the small.

System architecture

The physical realization: which component runs on which machine. A centralized client-server is one such system; a fully decentralized peer-to-peer mesh is another.

Four Styles, Plainly Stated

The principal architectural styles
Style	Premise	Canonical example
Layered	Components in a strict stack; only adjacent layers converse.	OSI, TCP/IP
Object-based	Objects encapsulate data and expose methods through well-defined interfaces.	Client–server, CORBA
Data-centred	Processes communicate by reading and writing a shared repository.	Database (passive); Blackboard (active)
Event-based	Communication by propagation of events. Often fused with data-centred to make shared data spaces.	Publish / subscribe; Kafka

Goals of the Design

Resource sharing

storage · files · media

Transparency

access · location · failure

Openness

interfaces · portability

Dependability

availability · reliability

Security

confidentiality · integrity

Scalability

size · geography · admin

Seven Transparencies

Transparency	Hides
Access	Differences in data representation and how an object is accessed.
Location	Where an object is located.
Migration	That an object may move to another location.
Relocation	That an object may be moved while in use.
Replication	That an object is replicated.
Concurrency	That an object may be shared by several users.
Failure	The failure and recovery of an object.

Delivery Semantics & Idempotency

Three contracts a delivery mechanism may keep:

At-most-once — zero or one delivery. Messages may be lost.
At-least-once — one or more deliveries. Messages may be duplicated.
Exactly-once — exactly one. Neither lost nor duplicated. The most expensive guarantee.

An operation is idempotent when performing it twice has the same effect as performing it once. READ X is idempotent; INCREMENT X is not. In an unreliable network, the temptation is to retransmit; the danger is in retransmitting that which is not safely repeated.

Fail-safe rule Design APIs to be idempotent at the boundary. Then "at-least-once" plus "deduplicate-on-arrival" becomes, for free, "exactly-once" in effect.

The Chord Ring — A Distributed Hash Table

Nodes are organized on a logical ring of 2^m positions. Each node has an m-bit identifier; each data item is hashed to an m-bit key. The item with key k is stored at the smallest node whose identifier is at least k — the successor of k.

Definition · Finger Table

For node n, the i-th finger points to the successor of n + 2ⁱ⁻¹, indices 1 … m.

fig. 1.1 The Chord ring with eight nodes. Node N8 maintains six fingers; lookups proceed by repeated halving of the remaining arc, yielding logarithmic time. Distance is asymmetric: dist(A,B) = (B − A) mod 2^m.

Two- and Three-Tiered Realizations

Two-tier

Thin-client sends only display work to the server, which handles processing and data. Easier to manage; performance loss at the client. Fat-client moves processing and some data to the client — reduces server load, scales further, but is harder to administer.

Three-tier

Presentation, processing, data; each a separate machine. Vertical distribution splits logical layers across machines; horizontal distribution replicates the same layer for load.

Cloud, Edge, Blockchain

Cloud computing is layered: hardware (the metal), infrastructure (virtualization), platform (e.g. S3-style buckets), application. Edge-server systems push servers to the network's boundary — closer to ISP, closer to the user. Blockchains are append-only chains of immutable, massively replicated blocks; their hard problem is not the chain but the question of who may append.

The Eight Fallacies Many distributed systems are needlessly complex, repaired post-hoc. The recurring sins: the network is reliable; the network is secure; the network is homogeneous; the topology does not change; latency is zero; bandwidth is infinite; transport cost is zero; there is one administrator. None of these are true.

IIWeek 03

Peer-to-Peer Systems

A history of the last twenty-five years told as a single argument: how to find a file when no one is in charge. Each system proposes an answer; each answer is broken by the next.

Every peer-to-peer system answers four primitive verbs: join the network, publish what one has, search for what one wants, fetch what one finds. The interesting differences live in the second and third.

Five Architectures, in Order of Their Defeats

fig. 2.1 Five P2P architectures arranged by historical succession. Each addresses the failure mode of its predecessor: Napster's single index; Gnutella's exponential flooding; KaZaA's brittle supernode election; Skype's NAT problem; BitTorrent's swarm coordination.

Comparison — the five canonical systems
System	Architecture	Search	Decentralization	Innovation
Napster	Centralized index	Central server	None	Easy UI; first scale
Gnutella	Pure P2P	Query flood with TTL	Full	No central authority
KaZaA	Hybrid · FastTrack	Through supernodes	Partial	Hierarchical search
Skype	Hybrid · supernodes + login	Supernode discovery	Partial	NAT traversal
BitTorrent	Hybrid · tracker / DHT	Out-of-band sites or DHT	High	Swarming, tit-for-tat

Gnutella in Detail

The protocol's five messages:

Ping	Probe the network for other peers.
Pong	Reply to Ping; carries an IP and port.
Query	Search request, propagated to neighbours until TTL expires.
QueryHit	Returned along the reverse path when a match is found.
Push	Used when the supplier sits behind a firewall.

Result · Flooding Explosion

With TTL = 7 and b = 5 neighbours per peer, the maximum number of query messages generated is 5 + 5·4 + 5·4² + … + 5·4⁶ — exponential. The price of full decentralization is bandwidth.

BitTorrent's Vocabulary

Torrent file	Metadata referring to a tracker.
Tracker	Server keeping account of swarm membership.
Seeder · Leecher	Peer with the complete file · peer still downloading.
Swarm	The peers sharing one file.
DHT	Replaces the tracker for fully decentralized discovery.
Tit-for-tat	The incentive mechanism. Uploaders earn priority as downloaders.

IIIWeek 04

Time, Clocks & the Order of Things

Two computers will never agree on the exact time. The question is whether they need to. The answer, almost always, is no — they need to agree only on the order of events that matter.

Cristian's Algorithm

The server is passive and carries an accurate clock. The client estimates round-trip delay and applies half of it as a correction. The four observed times are recorded:

Client A sends a request at T1.
Server B receives at T2 and replies at T3, piggybacking both.
Client A records the arrival at T4.

delay = ((T2 − T1) + (T4 − T3)) / 2

NTP repeats this eight times and takes the minimum-delay sample as its best estimate.

Berkeley alternative When no machine has an accurate clock, an elected master polls slaves, averages, and sends offsets back. Sending an offset avoids fresh RTT uncertainty at the slave.

fig. 3.1 Cristian's protocol: four observations, one symmetric assumption about path delay, one corrected clock.

Lamport's Logical Clocks

Lamport's insight: processes that do not communicate need not agree on time at all. Among those that do, we need only agree on the order of events that touch them.

Definition · Happens-before

a → b when (i) a precedes b on the same process; or (ii) a is the sending of a message and b its receipt; or (iii) by transitivity.

Events with neither a → b nor b → a are concurrent. The clock rules then write themselves:

Before any event at process P_i: C_i ← C_i + 1.
On sending a message: stamp it with C_i.
On receiving a message m at P_j: C_j ← max(C_j, ts(m)) + 1.

fig. 3.2 A three-process Lamport trace. The receiver of m sets its clock to max(local, ts(m)) + 1; here, P₃'s clock advances from 0 to 5 upon receiving m₂ with stamp 4.

The Limitation a → b implies C(a) < C(b), but the converse does not hold. A small Lamport stamp does not imply a causal relation. To distinguish concurrency from precedence, we need vectors.

Vector Clocks

Each process holds a vector VC_i[1…n]. Position i counts its own events; position j records what P_i knows about P_j's clock.

Before any event: VC_i[i] ← VC_i[i] + 1.
On send: attach VC_i.
On receive m: for each k, VC_j[k] ← max(VC_j[k], ts(m)[k]); then VC_j[j]++.

Property · Causality detected exactly

a → b ⟺ VC(a) < VC(b), componentwise and strictly. If neither dominates, the events are concurrent.

fig. 3.3 Vector clocks make concurrency visible. P₁'s late event at (3,0,0) and P₃'s (2,2,2) are concurrent: neither dominates the other.

Matrix Clocks — What Others Know

Each process maintains an n × n matrix. The row M_i[i] is the process's own vector clock, the principal vector. Other rows record what P_i knows of what others know. This is precisely the structure required to garbage-collect message logs, decide stable predicates, or run causal multicast protocols.

IVWeek 05

Synchronization & Ordered Multicast

If every replica must see the same updates in the same order — and updates do not commute — then logical clocks alone are insufficient. We must agree on a total order, even for concurrent events.

Motivating Problem A bank account stands at $1,000. New York applies a 1 % interest; San Francisco deposits $100. If NY runs first, the balance becomes $1,110; if SF runs first, $1,111. With unordered multicast, the two replicas diverge. The order must be the same everywhere.

Totally-Ordered Multicast — Extended Lamport

The sender stamps its update with its Lamport clock and multicasts to all processes, itself included.
Each receiver places the message in a local queue, sorted by timestamp (ties broken by sender id).
Each receiver replies with a timestamped acknowledgement.
A message is delivered to the application only when it stands at the head of the queue and has been acknowledged by every other process.

Ordering — the Distinctions

Causal

a → b means a may affect b

Partial order

only causally related

Total order

all events, with tie-break

Schiper–Eggli–Sandoz

For causal ordering when broadcast is unavailable. Each message carries a vector of "what this sender has sent to each other process". The receiver delivers only when its own state shows all causally prior messages have arrived. Trades message size for the absence of broadcast; clock advances only on receive.

Matrix Clocks — redux

Stronger than vector clocks where the application needs to know what other processes know. The principal vector is, on any process, larger or equal to every non-principal vector.

VWeek 06

Mutual Exclusion & Election

A shared resource demands one user at a time. Across a network, without a shared semaphore, this becomes a problem of agreement, of message economy, of failure handling. Then: who shall preside?

Three Requirements

Safety

at most one in CS

Liveness

every request eventually granted

Fairness

served in logical order

The Algorithms, in Increasing Subtlety

Mutual exclusion — costs & characteristics
Algorithm	Class	Messages / CS	Notes
Centralized coordinator	token at a master	3	SPOF; trivial fairness
Lamport	permission, non-token	3 (N − 1)	REQUEST · REPLY · RELEASE
Ricart & Agrawala	permission, non-token	2 (N − 1)	Reply replaces release
Token ring	token, circulating	1 to ∞	No starvation; token loss is hard
Suzuki–Kasami	token, broadcast request	0 or N	Counter-based
Decentralized voting	quorum	2m · N	Starvation possible

Lamport's Mutex

Every site keeps a local request queue ordered by timestamp. Channels must be FIFO. Three message types — REQUEST, REPLY, RELEASE — carry the protocol.

Entry Conditions · L1 and L2 L1. Site S_i has received from every other site a message with timestamp greater than its request's timestamp.
L2. Its own request stands at the head of its local queue.

Ricart & Agrawala

The same idea, refined. The RELEASE message vanishes; its work is folded into the deferred REPLY. On receiving a competing request:

If state is Held, queue the request.
If state is Wanted and one's own (T_i, i) is lexicographically less than the incoming (T_j, j), queue the request.
Otherwise, send REPLY at once.

Enter the critical section when REPLY has been received from all N − 1 others. On exit, send REPLY to all queued requests.

Suzuki–Kasami

The single token holds an array LN[] (the latest request number serviced for each site). Each site holds RN[] (the highest request number seen). To request entry, a site broadcasts REQUEST(i, n) with n = RN_i[i] + 1. The token holder forwards the token to P_j exactly when RN[j] = LN[j] + 1.

Performance Metrics

Message complexity

messages per CS

Sync. delay

exit → next entry

Response time

request → finish

Throughput

1 / (SD + E)

Election — Who Shall Preside

Many algorithms presume a coordinator. When one fails, an election must produce another. Three classical strategies follow.

The Bully Algorithm

The biggest live process wins. When a process notices the coordinator has gone silent, it sends ELECTION to every higher-numbered process. If none replies, it declares itself; otherwise, it stands down.

fig. 5.1 A Bully election after P₅'s death. P₂ initiates; higher-numbered processes participate; P₄, finding no higher process responding, declares itself coordinator.

Ring & Chang–Roberts

Processes are arranged in a logical ring; messages flow in one direction. In the classical ring algorithm, an ELECTION message accumulates every visited process's id; when it returns to the originator, the maximum id wins, and a second pass announces the coordinator.

Chang–Roberts is uniform — the number of processes need not be known. Each node sends its id to the left. On receipt, if the incoming id exceeds one's own, forward; if less, discard; if equal to one's own, declare oneself the leader. Cost: O(N²) worst, O(N log N) average.

fig. 5.2 Chang–Roberts on a ring of eight. Each id propagates leftward, surviving only as long as it exceeds the receiver. The highest id is the only message that ever completes a full circuit.

Hirschberg–Sinclair

On a bidirectional ring. In phase r, a candidate probes both directions to distance 2^r; only those who are the largest in their 2^r-neighbourhood survive to phase r+1. Cost: O(N log N) — better than Chang–Roberts, the price of bidirectionality.

VIWeek 07

Global State & the Chandy–Lamport Snapshot

A photograph of a distributed system must capture every process and every channel — without halting either. Chandy & Lamport's algorithm achieves this by inserting markers into the flow of ordinary messages.

The global state is the union of every process's local state and every channel's contents (messages in flight). The naive approach — synchronize clocks, ask each process to record at the same wall-clock instant — fails twice: clock skew makes the instant fuzzy, and the in-flight messages go unobserved.

Why One Records the State

Checkpoint

restart after failure

Garbage

unreachable objects

Deadlock

detect cycles in waits

Termination

is the job done

Assumptions Channels are FIFO; no failures of processes or channels; messages arrive intact and exactly once. Subsequent work relaxes these; the original argument requires them.

The Algorithm

At the initiator P_i

Record local state S_i.
Send a Marker on each outgoing channel.
Begin recording every incoming channel.

On receiving a Marker on C_{k → i}

If this is the first marker P_i has seen: record own state; mark C_{k → i} as empty; send markers on all outgoing channels; begin recording every other incoming channel.

Otherwise: the state of C_{k → i} is precisely the sequence of messages received on that channel since recording began.

fig. 6.1 The Chandy–Lamport propagation. P₁ initiates; the marker fans across every outgoing channel. Each first receipt triggers a process to record its own state and forward markers in turn. Subsequent marker arrivals close their channels with whatever messages had arrived since recording began.

The Consistent Cut

A cut partitions every process's events into "before" and "after". A cut is consistent when, for every event e in the cut, every event f with f → e is also in the cut. Equivalently: no message is received before it is sent.

Theorem · Chandy–Lamport

Every state recorded by the algorithm corresponds to a consistent cut of the global execution — even though the recording was not synchronous.

VIIWeek 10

Hadoop & the MapReduce Paradigm

Storage and compute, both at the scale of a warehouse, designed on commodity hardware around the certainty of failure. The result is a programming model so spare it can be taught in two functions.

Hadoop is the open-source rendering of Google's MapReduce, born from Doug Cutting's effort to scale the Nutch search engine in 2005. Its philosophy: any data will fit; failure is the norm; compute moves to the data, not the reverse.

The Architecture, in Two Layers

fig. 7.1 HDFS topology. A single NameNode holds the file-to-block-to-DataNode map. Blocks (64 or 128 MB) are replicated three times by default, spread across the cluster.

The Compute Layer · MapReduce

fig. 7.2 The MapReduce pipeline. Mappers emit key-value pairs; the shuffle phase groups by key; reducers process one key and its list of values. The programmer provides only the map and reduce functions.

Hadoop versus Relational Databases

	RDBMS	Hadoop
Schema	Fixed; ACID	Schema on read
Mode	Read & write	Mostly read
Hardware	Expensive servers	Commodity
Failures	Rare	Normal
Work unit	Transaction	Job

PageRank, As MapReduce

PR(p) = (1 − d) / N + d · Σ PR(q) / L(q) for q ∈ in-links(p)

The damping factor d (commonly 0.85) models the probability that a random surfer follows a link rather than jumping. PageRank is iterated until convergence; in MapReduce, mappers emit contributions to each out-neighbour, reducers sum incoming contributions.

PageRank · four pages · d = 0.5 · initial PR = 1
Page	Iter 0	Iter 1	Iter 2	Rank
A	1.000	1.500	1.500	1
B	1.000	1.250	1.375	2
D	1.000	0.750	0.625	3
C	1.000	0.500	0.500	4

VIIIWeek 11

Paxos & the Politics of Consensus

When multiple processes propose values, and a majority must agree on exactly one, Paxos provides the protocol. It is among the simplest distributed algorithms — and, famously, among the most misunderstood.

The Three Roles

Proposer

drives consensus

Acceptor

votes on proposals

Learner

announces outcome

In practice, each node plays all three; the roles are logical separations within a single process.

Required Properties

Concurrent proposals	More than one proposer may act at once.
Validity	The chosen value must be one that was proposed — not invented.
Majority rule	To tolerate m failures the protocol requires N = 2m + 1 acceptors.
Unicast	No reliance on atomic multicast.

The Two Phases

fig. 8.1 Paxos in two phases. Prepare and Promise establish the right to lead; Accept and Accepted establish the value. The validity property requires that if any Promise reports a previously accepted value, that value — not the proposer's preferred one — must be re-proposed.

Phase 1 · Prepare & Promise

The proposer chooses a unique, monotonically increasing identifier N — typically counter.pid.
It sends Prepare(N) to at least a majority of acceptors.
An acceptor: if N exceeds every identifier it has seen, replies Promise(N, U) where U is the highest-numbered proposal already accepted (or none); it vows to refuse anything lower than N. Otherwise, it ignores the request.

Phase 2 · Accept & Accepted

If the proposer receives Promises from a majority:
If any Promise reported an accepted value U, the proposer must propose the value of the highest-numbered such U — validity demands it.
Otherwise, it proposes its own value V.
It sends Accept(N, V) to the majority.
An acceptor accepts unless it has, since, promised a higher number.
When a majority accepts, consensus is reached and learners are notified.

Result · Fault tolerance

To tolerate m simultaneous failures, deploy N = 2m + 1 acceptors. Five acceptors survive two; three acceptors survive one.

A Worked Contention — NADRA CNIC Update

Five acceptors R₁…R₅. Two district offices propose simultaneously: P₁ with id 15.1 ("Update to Islamabad"), P₂ with id 15.2 ("Update to Karachi").

P₂'s Prepare(15.2) arrives first; all acceptors promise (nothing previously accepted).
P₁'s Prepare(15.1) is now too low and is rejected.
P₂'s Accept(15.2, Karachi) is accepted by a majority. Consensus on Karachi.
P₁ retries with id 16.1. Promises now report "Karachi" as accepted — P₁ must re-propose Karachi.
R₃ crashes: four acceptors remain — majority of three still attainable. R₃ and R₄ crash: three remain — majority still possible. A third failure halts progress; correctness is preserved.

IXWeek 12

Fault Tolerance & Redundancy

Failure is not exceptional; it is the working condition. The discipline of fault tolerance is the discipline of building systems whose correctness survives the inevitable misbehaviour of their parts.

The Chain

fig. 9.1 The fault chain. A latent fault produces an erroneous internal state, which in turn produces an externally visible failure.

By Duration

Transient	Appears once and is gone. A cosmic-ray bit flip.
Intermittent	Recurrent and unpredictable. A loose cable.
Permanent	Persists until repair. A dead disk.

By Behaviour

Fail-silent (fail-stop)

The component produces no output, or stops. Easy to detect — the absence of signal is itself a signal.

Byzantine

The component produces wrong, arbitrary, or malicious output indistinguishable from correct. Hard to detect. Requires either trust, attestation, or quorum.

Strategies of Handling

Prevention	Avoid the fault at the source.	Careful engineering
Tolerance	Mask the fault when it occurs.	Redundancy, voting
Removal	Reduce frequency through correction.	Testing, patches
Forecasting	Estimate future incidence.	Monitoring, telemetry

Three Kinds of Redundancy

Information

parity · Hamming · ECC

Time

retry on failure

Physical

backup hardware · replicas

Triple Modular Redundancy

fig. 9.2 Triple Modular Redundancy: three identical modules feed a majority voter. The output remains correct so long as no more than one module fails. The pattern scales to N-Modular Redundancy.

Primary–Backup

The primary alone handles requests. The backup observes heartbeats; on the primary's silence, it assumes the role. The model trades double the hardware for continuous availability.

Availability, By the Nines

Nines	Annual downtime	Roughly
99%	3.65 days	internal tools
99.9%	8.76 hours	web applications
99.99%	52 minutes	financial services
99.999%	5.26 minutes	telephony, payments

XWeek 13

CAP & the Eventual

A theorem about three desirable properties — consistency, availability, partition tolerance — states that one may have at most two. Since partitions are inevitable, the practical choice is between consistency and availability. Eric Brewer, 2000.

ACID, Recalled

Atomic

all or nothing

Consistent

invariants kept

Isolated

no interference

Durable

commit is forever

Two-Phase Commit

fig. 10.1 Two-Phase Commit. Prepare gathers votes; Commit (or Abort) enforces the decision. Its weakness is the blocking problem: if the coordinator dies between phases, participants hold their locks indefinitely.

The CAP Theorem

fig. 10.2 The CAP triangle. Network partitions will occur. The honest choice, in a real distributed system, is therefore between the CP corner (refuse minority writes) and the AP corner (accept divergent writes, reconcile later).

BASE — the Alternative to ACID

Basically

always available

Soft state

may shift unprompted

Eventual

replicas converge in time

Conflict Resolution

Last writer wins	Decide by timestamp. Simple; may discard real work.
Vector clocks	Surface genuine conflicts and defer to the application.
CRDTs	Conflict-free replicated data types — merge automatically. Examples: G-Counter, OR-Set.

Quorum Arithmetic

With N replicas, W required for write, R required for read:

Quorum overlap

W + R > N ⟺ strong consistency. The write set and read set must share at least one node.

fig. 10.3 Quorum overlap. The intersection guarantees that any reader, looking at R nodes, will encounter at least one node that participated in the latest write.

Protocols, Compared

Protocol	Strength	Weakness	Use case
2PC	Strict ACID	Blocking	Local transactions
Paxos · Raft	Strong consistency	Latency	Distributed ledger
Eventual	Massive scale	Temporary divergence	Feeds, carts
Quorum	Tunable	Complex to operate	NoSQL

XIWeek 14

Distributed Intelligence at Scale

Training a model of a hundred billion parameters is a problem of distributed systems first, machine learning second. The compute fits on no single device; the cluster must agree, partition, replicate, and survive failure on the scale of a small data centre.

The Arithmetic of Memory

For N parameters in FP16 with an Adam optimizer, the per-parameter bookkeeping is rigid:

Component	Bytes per parameter	30 B model	175 B model
Weights · FP16	2	60 GB	350 GB
Gradients · FP16	2	60 GB	350 GB
Adam states · FP32 (master + m + v)	12	360 GB	2.1 TB
Total	16	480 GB	2.8 TB

FLOPs_train ≈ 6 · N · D · 2 forward + 4 backward; N parameters, D tokens

Interconnect Hierarchy

Link	Bandwidth	Scope
NVLink	600 GB/s	Intra-node · 8 GPUs
InfiniBand	50–100 GB/s	Inter-node
PCIe Gen 4/5	32–64 GB/s	GPU ↔ CPU

Three-Dimensional Parallelism

fig. 11.1 The three axes of parallelism. Tensor parallelism, with its per-layer communication, must remain intra-node where NVLink furnishes the bandwidth. Pipeline parallelism, with its rarer transfers, tolerates inter-node links.

NCCL Primitives

All-Reduce	Reduce across all workers, result to all. The core of DP.
Broadcast	Rank 0 to every worker.
Scatter / Gather	Split a tensor / collect tensors.
Reduce-Scatter	Reduce, then distribute shards.
All-Gather	Collect shards from all workers into all.

FSDP & ZeRO

FSDP · Fully Sharded Data Parallel

Shards weights, gradients and optimizer state across GPUs. 1/N memory footprint relative to DDP. Forward: All-Gather weights for the current layer; compute; discard. Backward: All-Gather plus Reduce-Scatter for gradients.

ZeRO Stages · DeepSpeed

Stage 1: shard optimizer states (~4× saving).
Stage 2: shard optimizer + gradients (~8×).
Stage 3: shard everything (FSDP-equivalent).
Offload: push to CPU RAM or NVMe; ten times larger models, much slower.

Lesser Techniques That Buy A Lot

Mixed precision

BF16 · same FP32 range

Activation ckpt

33% compute, 5× memory

DDP > DP

no GIL · overlap

Industry Notes

Llama-2 · 2 000 A100 Hardware failures certain at scale. Frequent local checkpoints to NVMe; DCGM monitoring; Slurm auto-drain and replace. Sustained high utilization despite daily hardware drops.

BLOOM · The Straggler Problem Synchronous All-Reduce travels at the speed of the slowest worker. NCCL_DEBUG profiling identified five faulty InfiniBand cables; cable replacement restored full throughput.

XIIAppendix

Formulas at a Glance

Every numeric rule worth memorizing, gathered. The kind of facts that lose marks when forgotten — message counts, fault tolerance bounds, memory budgets, bandwidth tiers — laid out for one last pass before walking in.

Time & Logical Clocks

Rule	Meaning
delay = ((T2−T1) + (T4−T3)) / 2	Cristian's symmetric-path delay estimate
offset = T3 + delay − T4	Correction applied to client clock
Cⱼ ← max(Cⱼ, ts(m)) + 1	Lamport's receive rule
VCⱼ[k] ← max(VCⱼ[k], ts(m)[k]) ∀k	Vector clock receive · then VCⱼ[j]++
a → b ⟺ VC(a) < VC(b)	Causality test · componentwise & strict

Mutual Exclusion · Messages per CS

Algorithm	Messages
Centralized coordinator	3
Lamport	3 (N − 1)
Ricart & Agrawala	2 (N − 1)
Token ring	1 to ∞
Suzuki–Kasami	0 (have token) or N (broadcast)
Decentralized voting	2m · N
Throughput	1 / (SD + E)

Election · Worst-Case Complexity

Algorithm	Messages
Bully	O ( N² ) · worst case
Ring	2 N messages · two passes
Chang–Roberts	O ( N² ) worst · O ( N log N ) avg
Hirschberg–Sinclair	O ( N log N ) · bidirectional

Chord DHT

Property	Value
Lookup complexity	O ( log N )
Finger entry i	succ ( n + 2ⁱ⁻¹ ) mod 2ᵐ
Asymmetric distance	dist ( A, B ) = ( B − A ) mod 2ᵐ
Doubling network	+1 finger row
Finger table size	m rows for m-bit IDs

P2P · Gnutella Flooding

Quantity	Formula
Max query messages	b + b(b−1) + b(b−1)² + … + b(b−1)ᵀᵀᴸ⁻¹
For TTL = 7, b = 5	5 + 5·4 + 5·4² + … + 5·4⁶ · exponential
KaZaA leaves per SN	60 – 150 · TTL typically 7

Hadoop & HDFS

Item	Value
Default replication factor	3
Block size	64 MB (older) or 128 MB
Storage for a file F	\|F\| × 3
PageRank	(1 − d) / N + d · Σ [ PR(q) / L(q) ]
Damping factor d	≈ 0.85

Paxos

Rule	Value
Acceptors to tolerate m failures	N = 2m + 1
Majority of N	⌊N / 2⌋ + 1
3 acceptors	tolerates 1 failure
5 acceptors	tolerates 2 failures
7 acceptors	tolerates 3 failures
Proposal ID	counter . pid · unique & monotonic

Fault Tolerance · Availability

Item	Value
TMR tolerates	1 faulty module · majority vote
99 %	~ 3.65 days / year
99.9 %	~ 8.76 hours / year
99.99 %	~ 52 minutes / year
99.999 %	~ 5.26 minutes / year
Availability	uptime / (uptime + downtime)

CAP · Quorum Arithmetic

Rule	Meaning
W + R > N	Strong consistency · read & write sets overlap
Max failures = N − max(W, R)	How many crashes the cluster survives
QUORUM = ⌊N / 2⌋ + 1	Cassandra-style majority
N = 3, W = 2, R = 2	4 > 3 ✓ · tolerates 1 failure
N = 5, W = 3, R = 3	6 > 5 ✓ · tolerates 2 failures

Distributed AI · The Numbers

Quantity	Value
Total memory per parameter · FP16 + Adam	16 B
· Weights · FP16	2 B
· Gradients · FP16	2 B
· Adam states · FP32 (master + m + v)	12 B (4 + 4 + 4)
Training FLOPs	≈ 6 · N · D · 2 fwd + 4 bwd
NVLink	600 GB/s · intra-node
InfiniBand	50 – 100 GB/s · inter-node
PCIe Gen 4 / 5	32 – 64 GB/s · GPU ↔ CPU
Activation checkpoint	+33 % compute · 5× memory
ZeRO-1 / 2 / 3 saving	~ 4× / 8× / 64×
30 B model · total VRAM	≈ 480 GB
175 B model · total VRAM	≈ 2.8 TB

Universal Method For any "calculate X" question: (1) identify the formula; (2) write it down before plugging numbers — earns partial marks; (3) substitute carefully and show the step; (4) state the result with units; (5) name the rule in one short sentence ("by Lamport's protocol, total = 3 (N − 1)"). Examiners reward showing the formula and naming the rule — not just the final number.

Architectures of the Distributed

Software versus System Architecture

Software architecture

System architecture

Four Styles, Plainly Stated

Goals of the Design

Seven Transparencies

Delivery Semantics & Idempotency

The Chord Ring — A Distributed Hash Table

Two- and Three-Tiered Realizations

Two-tier

Three-tier

Cloud, Edge, Blockchain

Peer-to-Peer Systems

Five Architectures, in Order of Their Defeats

Gnutella in Detail

BitTorrent's Vocabulary

Time, Clocks & the Order of Things

Cristian's Algorithm

Lamport's Logical Clocks

Vector Clocks

Matrix Clocks — What Others Know

Synchronization & Ordered Multicast

Totally-Ordered Multicast — Extended Lamport

Ordering — the Distinctions

Schiper–Eggli–Sandoz

Matrix Clocks — redux

Mutual Exclusion & Election

Three Requirements

The Algorithms, in Increasing Subtlety

Lamport's Mutex

Ricart & Agrawala

Suzuki–Kasami

Performance Metrics

Election — Who Shall Preside

The Bully Algorithm

Ring & Chang–Roberts

Hirschberg–Sinclair

Global State & the Chandy–Lamport Snapshot

Why One Records the State

The Algorithm

At the initiator Pi

On receiving a Marker on Ck → i

The Consistent Cut

Hadoop & the MapReduce Paradigm

The Architecture, in Two Layers

The Compute Layer · MapReduce

Hadoop versus Relational Databases

PageRank, As MapReduce

Paxos & the Politics of Consensus

The Three Roles

Required Properties

The Two Phases

Phase 1 · Prepare & Promise

Phase 2 · Accept & Accepted

A Worked Contention — NADRA CNIC Update

Fault Tolerance & Redundancy

The Chain

By Duration

By Behaviour

Fail-silent (fail-stop)

Byzantine

Strategies of Handling

Three Kinds of Redundancy

Triple Modular Redundancy

Primary–Backup

Availability, By the Nines

CAP & the Eventual

ACID, Recalled

Two-Phase Commit

The CAP Theorem

BASE — the Alternative to ACID

Conflict Resolution

Quorum Arithmetic

Protocols, Compared

Distributed Intelligence at Scale

The Arithmetic of Memory

Interconnect Hierarchy

Three-Dimensional Parallelism

NCCL Primitives

At the initiator P_i

On receiving a Marker on C_{k → i}