Architectures of the Distributed
Before time, before consensus, the first question: where do the components live, and how do they speak. A distributed system is a collection of independent computers that, by some sleight of middleware, appear to its user as a single coherent thing.
Two definitions sit at the root of the field. A decentralized system is one in which processes and resources are necessarily spread across multiple computers. A distributed system is one in which they are sufficiently spread — with the goal of presenting, to the user, the illusion of a single machine.
Two views guide its construction: the integrative, in which existing networked computers are knit into one larger system, and the expansive, in which an existing networked system is grown by addition of more computers.
Software versus System Architecture
Software architecture
The logical organization of components: their interfaces, the data they exchange, and the manner of their connection. A middleware layer that hides distribution is software architecture in the small.
System architecture
The physical realization: which component runs on which machine. A centralized client-server is one such system; a fully decentralized peer-to-peer mesh is another.
Four Styles, Plainly Stated
| Style | Premise | Canonical example |
|---|---|---|
| Layered | Components in a strict stack; only adjacent layers converse. | OSI, TCP/IP |
| Object-based | Objects encapsulate data and expose methods through well-defined interfaces. | Client–server, CORBA |
| Data-centred | Processes communicate by reading and writing a shared repository. | Database (passive); Blackboard (active) |
| Event-based | Communication by propagation of events. Often fused with data-centred to make shared data spaces. | Publish / subscribe; Kafka |
Goals of the Design
Seven Transparencies
| Transparency | Hides |
|---|---|
| Access | Differences in data representation and how an object is accessed. |
| Location | Where an object is located. |
| Migration | That an object may move to another location. |
| Relocation | That an object may be moved while in use. |
| Replication | That an object is replicated. |
| Concurrency | That an object may be shared by several users. |
| Failure | The failure and recovery of an object. |
Delivery Semantics & Idempotency
Three contracts a delivery mechanism may keep:
- At-most-once — zero or one delivery. Messages may be lost.
- At-least-once — one or more deliveries. Messages may be duplicated.
- Exactly-once — exactly one. Neither lost nor duplicated. The most expensive guarantee.
An operation is idempotent when performing it twice has the same effect as performing it once. READ X is idempotent; INCREMENT X is not. In an unreliable network, the temptation is to retransmit; the danger is in retransmitting that which is not safely repeated.
The Chord Ring — A Distributed Hash Table
Nodes are organized on a logical ring of 2m positions. Each node has an m-bit identifier; each data item is hashed to an m-bit key. The item with key k is stored at the smallest node whose identifier is at least k — the successor of k.
dist(A,B) = (B − A) mod 2m.Two- and Three-Tiered Realizations
Two-tier
Thin-client sends only display work to the server, which handles processing and data. Easier to manage; performance loss at the client. Fat-client moves processing and some data to the client — reduces server load, scales further, but is harder to administer.
Three-tier
Presentation, processing, data; each a separate machine. Vertical distribution splits logical layers across machines; horizontal distribution replicates the same layer for load.
Cloud, Edge, Blockchain
Cloud computing is layered: hardware (the metal), infrastructure (virtualization), platform (e.g. S3-style buckets), application. Edge-server systems push servers to the network's boundary — closer to ISP, closer to the user. Blockchains are append-only chains of immutable, massively replicated blocks; their hard problem is not the chain but the question of who may append.
Peer-to-Peer Systems
A history of the last twenty-five years told as a single argument: how to find a file when no one is in charge. Each system proposes an answer; each answer is broken by the next.
Every peer-to-peer system answers four primitive verbs: join the network, publish what one has, search for what one wants, fetch what one finds. The interesting differences live in the second and third.
Five Architectures, in Order of Their Defeats
| System | Architecture | Search | Decentralization | Innovation |
|---|---|---|---|---|
| Napster | Centralized index | Central server | None | Easy UI; first scale |
| Gnutella | Pure P2P | Query flood with TTL | Full | No central authority |
| KaZaA | Hybrid · FastTrack | Through supernodes | Partial | Hierarchical search |
| Skype | Hybrid · supernodes + login | Supernode discovery | Partial | NAT traversal |
| BitTorrent | Hybrid · tracker / DHT | Out-of-band sites or DHT | High | Swarming, tit-for-tat |
Gnutella in Detail
The protocol's five messages:
| Ping | Probe the network for other peers. |
| Pong | Reply to Ping; carries an IP and port. |
| Query | Search request, propagated to neighbours until TTL expires. |
| QueryHit | Returned along the reverse path when a match is found. |
| Push | Used when the supplier sits behind a firewall. |
BitTorrent's Vocabulary
| Torrent file | Metadata referring to a tracker. |
| Tracker | Server keeping account of swarm membership. |
| Seeder · Leecher | Peer with the complete file · peer still downloading. |
| Swarm | The peers sharing one file. |
| DHT | Replaces the tracker for fully decentralized discovery. |
| Tit-for-tat | The incentive mechanism. Uploaders earn priority as downloaders. |
Time, Clocks & the Order of Things
Two computers will never agree on the exact time. The question is whether they need to. The answer, almost always, is no — they need to agree only on the order of events that matter.
Cristian's Algorithm
The server is passive and carries an accurate clock. The client estimates round-trip delay and applies half of it as a correction. The four observed times are recorded:
- Client A sends a request at
T1. - Server B receives at
T2and replies atT3, piggybacking both. - Client A records the arrival at
T4.
NTP repeats this eight times and takes the minimum-delay sample as its best estimate.
Lamport's Logical Clocks
Lamport's insight: processes that do not communicate need not agree on time at all. Among those that do, we need only agree on the order of events that touch them.
Events with neither a → b nor b → a are concurrent. The clock rules then write themselves:
- Before any event at process
Pi:Ci ← Ci + 1. - On sending a message: stamp it with
Ci. - On receiving a message m at
Pj:Cj ← max(Cj, ts(m)) + 1.
max(local, ts(m)) + 1; here, P3's clock advances from 0 to 5 upon receiving m₂ with stamp 4.C(a) < C(b), but the converse does not hold. A small Lamport stamp does not imply a causal relation. To distinguish concurrency from precedence, we need vectors.
Vector Clocks
Each process holds a vector VCi[1…n]. Position i counts its own events; position j records what Pi knows about Pj's clock.
- Before any event:
VCi[i] ← VCi[i] + 1. - On send: attach
VCi. - On receive m: for each
k,VCj[k] ← max(VCj[k], ts(m)[k]); thenVCj[j]++.
Matrix Clocks — What Others Know
Each process maintains an n × n matrix. The row Mi[i] is the process's own vector clock, the principal vector. Other rows record what Pi knows of what others know. This is precisely the structure required to garbage-collect message logs, decide stable predicates, or run causal multicast protocols.
Synchronization & Ordered Multicast
If every replica must see the same updates in the same order — and updates do not commute — then logical clocks alone are insufficient. We must agree on a total order, even for concurrent events.
Totally-Ordered Multicast — Extended Lamport
- The sender stamps its update with its Lamport clock and multicasts to all processes, itself included.
- Each receiver places the message in a local queue, sorted by timestamp (ties broken by sender id).
- Each receiver replies with a timestamped acknowledgement.
- A message is delivered to the application only when it stands at the head of the queue and has been acknowledged by every other process.
Ordering — the Distinctions
Schiper–Eggli–Sandoz
For causal ordering when broadcast is unavailable. Each message carries a vector of "what this sender has sent to each other process". The receiver delivers only when its own state shows all causally prior messages have arrived. Trades message size for the absence of broadcast; clock advances only on receive.
Matrix Clocks — redux
Stronger than vector clocks where the application needs to know what other processes know. The principal vector is, on any process, larger or equal to every non-principal vector.
Mutual Exclusion & Election
A shared resource demands one user at a time. Across a network, without a shared semaphore, this becomes a problem of agreement, of message economy, of failure handling. Then: who shall preside?
Three Requirements
The Algorithms, in Increasing Subtlety
| Algorithm | Class | Messages / CS | Notes |
|---|---|---|---|
| Centralized coordinator | token at a master | 3 | SPOF; trivial fairness |
| Lamport | permission, non-token | 3 (N − 1) | REQUEST · REPLY · RELEASE |
| Ricart & Agrawala | permission, non-token | 2 (N − 1) | Reply replaces release |
| Token ring | token, circulating | 1 to ∞ | No starvation; token loss is hard |
| Suzuki–Kasami | token, broadcast request | 0 or N | Counter-based |
| Decentralized voting | quorum | 2m · N | Starvation possible |
Lamport's Mutex
Every site keeps a local request queue ordered by timestamp. Channels must be FIFO. Three message types — REQUEST, REPLY, RELEASE — carry the protocol.
L2. Its own request stands at the head of its local queue.
Ricart & Agrawala
The same idea, refined. The RELEASE message vanishes; its work is folded into the deferred REPLY. On receiving a competing request:
- If state is Held, queue the request.
- If state is Wanted and one's own
(Ti, i)is lexicographically less than the incoming(Tj, j), queue the request. - Otherwise, send REPLY at once.
Enter the critical section when REPLY has been received from all N − 1 others. On exit, send REPLY to all queued requests.
Suzuki–Kasami
The single token holds an array LN[] (the latest request number serviced for each site). Each site holds RN[] (the highest request number seen). To request entry, a site broadcasts REQUEST(i, n) with n = RNi[i] + 1. The token holder forwards the token to Pj exactly when RN[j] = LN[j] + 1.
Performance Metrics
Election — Who Shall Preside
Many algorithms presume a coordinator. When one fails, an election must produce another. Three classical strategies follow.
The Bully Algorithm
The biggest live process wins. When a process notices the coordinator has gone silent, it sends ELECTION to every higher-numbered process. If none replies, it declares itself; otherwise, it stands down.
Ring & Chang–Roberts
Processes are arranged in a logical ring; messages flow in one direction. In the classical ring algorithm, an ELECTION message accumulates every visited process's id; when it returns to the originator, the maximum id wins, and a second pass announces the coordinator.
Chang–Roberts is uniform — the number of processes need not be known. Each node sends its id to the left. On receipt, if the incoming id exceeds one's own, forward; if less, discard; if equal to one's own, declare oneself the leader. Cost: O(N²) worst, O(N log N) average.
Hirschberg–Sinclair
On a bidirectional ring. In phase r, a candidate probes both directions to distance 2r; only those who are the largest in their 2r-neighbourhood survive to phase r+1. Cost: O(N log N) — better than Chang–Roberts, the price of bidirectionality.
Global State & the Chandy–Lamport Snapshot
A photograph of a distributed system must capture every process and every channel — without halting either. Chandy & Lamport's algorithm achieves this by inserting markers into the flow of ordinary messages.
The global state is the union of every process's local state and every channel's contents (messages in flight). The naive approach — synchronize clocks, ask each process to record at the same wall-clock instant — fails twice: clock skew makes the instant fuzzy, and the in-flight messages go unobserved.
Why One Records the State
The Algorithm
At the initiator Pi
- Record local state Si.
- Send a Marker on each outgoing channel.
- Begin recording every incoming channel.
On receiving a Marker on Ck → i
If this is the first marker Pi has seen: record own state; mark Ck → i as empty; send markers on all outgoing channels; begin recording every other incoming channel.
Otherwise: the state of Ck → i is precisely the sequence of messages received on that channel since recording began.
The Consistent Cut
A cut partitions every process's events into "before" and "after". A cut is consistent when, for every event e in the cut, every event f with f → e is also in the cut. Equivalently: no message is received before it is sent.
Hadoop & the MapReduce Paradigm
Storage and compute, both at the scale of a warehouse, designed on commodity hardware around the certainty of failure. The result is a programming model so spare it can be taught in two functions.
Hadoop is the open-source rendering of Google's MapReduce, born from Doug Cutting's effort to scale the Nutch search engine in 2005. Its philosophy: any data will fit; failure is the norm; compute moves to the data, not the reverse.
The Architecture, in Two Layers
The Compute Layer · MapReduce
Hadoop versus Relational Databases
| RDBMS | Hadoop | |
|---|---|---|
| Schema | Fixed; ACID | Schema on read |
| Mode | Read & write | Mostly read |
| Hardware | Expensive servers | Commodity |
| Failures | Rare | Normal |
| Work unit | Transaction | Job |
PageRank, As MapReduce
The damping factor d (commonly 0.85) models the probability that a random surfer follows a link rather than jumping. PageRank is iterated until convergence; in MapReduce, mappers emit contributions to each out-neighbour, reducers sum incoming contributions.
| Page | Iter 0 | Iter 1 | Iter 2 | Rank |
|---|---|---|---|---|
| A | 1.000 | 1.500 | 1.500 | 1 |
| B | 1.000 | 1.250 | 1.375 | 2 |
| D | 1.000 | 0.750 | 0.625 | 3 |
| C | 1.000 | 0.500 | 0.500 | 4 |
Paxos & the Politics of Consensus
When multiple processes propose values, and a majority must agree on exactly one, Paxos provides the protocol. It is among the simplest distributed algorithms — and, famously, among the most misunderstood.
The Three Roles
In practice, each node plays all three; the roles are logical separations within a single process.
Required Properties
| Concurrent proposals | More than one proposer may act at once. |
| Validity | The chosen value must be one that was proposed — not invented. |
| Majority rule | To tolerate m failures the protocol requires N = 2m + 1 acceptors. |
| Unicast | No reliance on atomic multicast. |
The Two Phases
Phase 1 · Prepare & Promise
- The proposer chooses a unique, monotonically increasing identifier
N— typicallycounter.pid. - It sends
Prepare(N)to at least a majority of acceptors. - An acceptor: if
Nexceeds every identifier it has seen, repliesPromise(N, U)whereUis the highest-numbered proposal already accepted (or none); it vows to refuse anything lower thanN. Otherwise, it ignores the request.
Phase 2 · Accept & Accepted
- If the proposer receives Promises from a majority:
- If any Promise reported an accepted value U, the proposer must propose the value of the highest-numbered such U — validity demands it.
- Otherwise, it proposes its own value V.
- It sends
Accept(N, V)to the majority. - An acceptor accepts unless it has, since, promised a higher number.
- When a majority accepts, consensus is reached and learners are notified.
A Worked Contention — NADRA CNIC Update
Five acceptors R1…R5. Two district offices propose simultaneously: P1 with id 15.1 ("Update to Islamabad"), P2 with id 15.2 ("Update to Karachi").
- P2's
Prepare(15.2)arrives first; all acceptors promise (nothing previously accepted). - P1's
Prepare(15.1)is now too low and is rejected. - P2's
Accept(15.2, Karachi)is accepted by a majority. Consensus on Karachi. - P1 retries with id 16.1. Promises now report "Karachi" as accepted — P1 must re-propose Karachi.
- R3 crashes: four acceptors remain — majority of three still attainable. R3 and R4 crash: three remain — majority still possible. A third failure halts progress; correctness is preserved.
Fault Tolerance & Redundancy
Failure is not exceptional; it is the working condition. The discipline of fault tolerance is the discipline of building systems whose correctness survives the inevitable misbehaviour of their parts.
The Chain
By Duration
| Transient | Appears once and is gone. A cosmic-ray bit flip. |
| Intermittent | Recurrent and unpredictable. A loose cable. |
| Permanent | Persists until repair. A dead disk. |
By Behaviour
Fail-silent (fail-stop)
The component produces no output, or stops. Easy to detect — the absence of signal is itself a signal.
Byzantine
The component produces wrong, arbitrary, or malicious output indistinguishable from correct. Hard to detect. Requires either trust, attestation, or quorum.
Strategies of Handling
| Prevention | Avoid the fault at the source. | Careful engineering |
| Tolerance | Mask the fault when it occurs. | Redundancy, voting |
| Removal | Reduce frequency through correction. | Testing, patches |
| Forecasting | Estimate future incidence. | Monitoring, telemetry |
Three Kinds of Redundancy
Triple Modular Redundancy
Primary–Backup
The primary alone handles requests. The backup observes heartbeats; on the primary's silence, it assumes the role. The model trades double the hardware for continuous availability.
Availability, By the Nines
| Nines | Annual downtime | Roughly |
|---|---|---|
| 99% | 3.65 days | internal tools |
| 99.9% | 8.76 hours | web applications |
| 99.99% | 52 minutes | financial services |
| 99.999% | 5.26 minutes | telephony, payments |
CAP & the Eventual
A theorem about three desirable properties — consistency, availability, partition tolerance — states that one may have at most two. Since partitions are inevitable, the practical choice is between consistency and availability. Eric Brewer, 2000.
ACID, Recalled
Two-Phase Commit
The CAP Theorem
BASE — the Alternative to ACID
Conflict Resolution
| Last writer wins | Decide by timestamp. Simple; may discard real work. |
| Vector clocks | Surface genuine conflicts and defer to the application. |
| CRDTs | Conflict-free replicated data types — merge automatically. Examples: G-Counter, OR-Set. |
Quorum Arithmetic
With N replicas, W required for write, R required for read:
Protocols, Compared
| Protocol | Strength | Weakness | Use case |
|---|---|---|---|
| 2PC | Strict ACID | Blocking | Local transactions |
| Paxos · Raft | Strong consistency | Latency | Distributed ledger |
| Eventual | Massive scale | Temporary divergence | Feeds, carts |
| Quorum | Tunable | Complex to operate | NoSQL |
Distributed Intelligence at Scale
Training a model of a hundred billion parameters is a problem of distributed systems first, machine learning second. The compute fits on no single device; the cluster must agree, partition, replicate, and survive failure on the scale of a small data centre.
The Arithmetic of Memory
For N parameters in FP16 with an Adam optimizer, the per-parameter bookkeeping is rigid:
| Component | Bytes per parameter | 30 B model | 175 B model |
|---|---|---|---|
| Weights · FP16 | 2 | 60 GB | 350 GB |
| Gradients · FP16 | 2 | 60 GB | 350 GB |
| Adam states · FP32 (master + m + v) | 12 | 360 GB | 2.1 TB |
| Total | 16 | 480 GB | 2.8 TB |
Interconnect Hierarchy
| Link | Bandwidth | Scope |
|---|---|---|
| NVLink | 600 GB/s | Intra-node · 8 GPUs |
| InfiniBand | 50–100 GB/s | Inter-node |
| PCIe Gen 4/5 | 32–64 GB/s | GPU ↔ CPU |
Three-Dimensional Parallelism
NCCL Primitives
| All-Reduce | Reduce across all workers, result to all. The core of DP. |
| Broadcast | Rank 0 to every worker. |
| Scatter / Gather | Split a tensor / collect tensors. |
| Reduce-Scatter | Reduce, then distribute shards. |
| All-Gather | Collect shards from all workers into all. |
FSDP & ZeRO
FSDP · Fully Sharded Data Parallel
Shards weights, gradients and optimizer state across GPUs. 1/N memory footprint relative to DDP. Forward: All-Gather weights for the current layer; compute; discard. Backward: All-Gather plus Reduce-Scatter for gradients.
ZeRO Stages · DeepSpeed
Stage 1: shard optimizer states (~4× saving).
Stage 2: shard optimizer + gradients (~8×).
Stage 3: shard everything (FSDP-equivalent).
Offload: push to CPU RAM or NVMe; ten times larger models, much slower.
Lesser Techniques That Buy A Lot
Industry Notes
Formulas at a Glance
Every numeric rule worth memorizing, gathered. The kind of facts that lose marks when forgotten — message counts, fault tolerance bounds, memory budgets, bandwidth tiers — laid out for one last pass before walking in.
Time & Logical Clocks
| Rule | Meaning |
|---|---|
| delay = ((T2−T1) + (T4−T3)) / 2 | Cristian's symmetric-path delay estimate |
| offset = T3 + delay − T4 | Correction applied to client clock |
| Cⱼ ← max(Cⱼ, ts(m)) + 1 | Lamport's receive rule |
| VCⱼ[k] ← max(VCⱼ[k], ts(m)[k]) ∀k | Vector clock receive · then VCⱼ[j]++ |
| a → b ⟺ VC(a) < VC(b) | Causality test · componentwise & strict |
Mutual Exclusion · Messages per CS
| Algorithm | Messages |
|---|---|
| Centralized coordinator | 3 |
| Lamport | 3 (N − 1) |
| Ricart & Agrawala | 2 (N − 1) |
| Token ring | 1 to ∞ |
| Suzuki–Kasami | 0 (have token) or N (broadcast) |
| Decentralized voting | 2m · N |
| Throughput | 1 / (SD + E) |
Election · Worst-Case Complexity
| Algorithm | Messages |
|---|---|
| Bully | O ( N² ) · worst case |
| Ring | 2 N messages · two passes |
| Chang–Roberts | O ( N² ) worst · O ( N log N ) avg |
| Hirschberg–Sinclair | O ( N log N ) · bidirectional |
Chord DHT
| Property | Value |
|---|---|
| Lookup complexity | O ( log N ) |
| Finger entry i | succ ( n + 2ⁱ⁻¹ ) mod 2ᵐ |
| Asymmetric distance | dist ( A, B ) = ( B − A ) mod 2ᵐ |
| Doubling network | +1 finger row |
| Finger table size | m rows for m-bit IDs |
P2P · Gnutella Flooding
| Quantity | Formula |
|---|---|
| Max query messages | b + b(b−1) + b(b−1)² + … + b(b−1)ᵀᵀᴸ⁻¹ |
| For TTL = 7, b = 5 | 5 + 5·4 + 5·4² + … + 5·4⁶ · exponential |
| KaZaA leaves per SN | 60 – 150 · TTL typically 7 |
Hadoop & HDFS
| Item | Value |
|---|---|
| Default replication factor | 3 |
| Block size | 64 MB (older) or 128 MB |
| Storage for a file F | |F| × 3 |
| PageRank | (1 − d) / N + d · Σ [ PR(q) / L(q) ] |
| Damping factor d | ≈ 0.85 |
Paxos
| Rule | Value |
|---|---|
| Acceptors to tolerate m failures | N = 2m + 1 |
| Majority of N | ⌊N / 2⌋ + 1 |
| 3 acceptors | tolerates 1 failure |
| 5 acceptors | tolerates 2 failures |
| 7 acceptors | tolerates 3 failures |
| Proposal ID | counter . pid · unique & monotonic |
Fault Tolerance · Availability
| Item | Value |
|---|---|
| TMR tolerates | 1 faulty module · majority vote |
| 99 % | ~ 3.65 days / year |
| 99.9 % | ~ 8.76 hours / year |
| 99.99 % | ~ 52 minutes / year |
| 99.999 % | ~ 5.26 minutes / year |
| Availability | uptime / (uptime + downtime) |
CAP · Quorum Arithmetic
| Rule | Meaning |
|---|---|
| W + R > N | Strong consistency · read & write sets overlap |
| Max failures = N − max(W, R) | How many crashes the cluster survives |
| QUORUM = ⌊N / 2⌋ + 1 | Cassandra-style majority |
| N = 3, W = 2, R = 2 | 4 > 3 ✓ · tolerates 1 failure |
| N = 5, W = 3, R = 3 | 6 > 5 ✓ · tolerates 2 failures |
Distributed AI · The Numbers
| Quantity | Value |
|---|---|
| Total memory per parameter · FP16 + Adam | 16 B |
| · Weights · FP16 | 2 B |
| · Gradients · FP16 | 2 B |
| · Adam states · FP32 (master + m + v) | 12 B (4 + 4 + 4) |
| Training FLOPs | ≈ 6 · N · D · 2 fwd + 4 bwd |
| NVLink | 600 GB/s · intra-node |
| InfiniBand | 50 – 100 GB/s · inter-node |
| PCIe Gen 4 / 5 | 32 – 64 GB/s · GPU ↔ CPU |
| Activation checkpoint | +33 % compute · 5× memory |
| ZeRO-1 / 2 / 3 saving | ~ 4× / 8× / 64× |
| 30 B model · total VRAM | ≈ 480 GB |
| 175 B model · total VRAM | ≈ 2.8 TB |