|
| 1 | +--- |
| 2 | +seoTitle: Conflict-free Replicated Data Types (CRDT) – System Design Guide |
| 3 | +description: "A comprehensive guide on CRDTs (Conflict-free Replicated Data Types). Learn about CvRDT vs CmRDT, state/operation propagation, join-semilattices, and Python/JavaScript implementations." |
| 4 | +keywords: "crdt, conflict-free replicated data types, cvrdt, cmrdt, replication, distributed systems, replication conflicts, state-based, operation-based, eventual consistency, yjs, automerge, figma, vr-rathod, code-note" |
| 5 | +displayTitle: System Design - CRDT |
| 6 | +--- |
| 7 | + |
| 8 | +> [!info] What is a CRDT? |
| 9 | +> A **Conflict-free Replicated Data Type (CRDT)** is a specialized data structure designed for distributed systems. It allows multiple replicas to be updated independently and concurrently without coordination (like locking or central authorization), with a mathematical guarantee that they will eventually converge to the identical state when all updates are propagated. |
| 10 | +
|
| 11 | +- # Explanation |
| 12 | + - In distributed databases or collaborative applications (like Google Docs or Figma), multiple users or servers can edit the same data concurrently. Traditional databases use locks or consensus protocols (like Paxos or Raft) to coordinate writes, which introduces high latency and dependency on network connectivity. |
| 13 | + - CRDTs enable **coordination-free eventual consistency** (AP in the CAP theorem). Replicas can accept updates offline and sync asynchronously. |
| 14 | + - |
| 15 | + - ## Real-World Analogy |
| 16 | + - **Git Merge (Automatic)**: Imagine editing a file where Git can resolve all merges automatically without conflicts. CRDTs are data structures pre-designed with merge rules so that conflict resolution is mathematically deterministic and handled entirely by the structure itself. |
| 17 | + - **Collaborative Whiteboard (Figma)**: Users drag shapes around a canvas. Instead of a server locking each shape, each user moves it locally. The movements are broadcast to others, and a CRDT ensures everyone sees the shapes in the same final positions. |
| 18 | +- |
| 19 | +- # How It Works |
| 20 | + collapsed:: true |
| 21 | + - ## The Two CRDT Approaches |
| 22 | + - There are two primary ways to design and propagate changes in a CRDT: |
| 23 | + - |
| 24 | + - ### 1. State-Based CRDTs (CvRDTs) |
| 25 | + - **Mechanism**: Replicas send their **entire state** to other replicas. |
| 26 | + - **Merge Operator**: Receivers merge their local state with the incoming state using a merge function ($\sqcup$). |
| 27 | + - **Requirements**: The merge operator must form a **Join-Semilattice**. |
| 28 | + - **Network**: Can tolerate message loss, duplication, and out-of-order delivery, since the merge operator is idempotent. |
| 29 | + - |
| 30 | + - ### 2. Operation-Based CRDTs (CmRDTs) |
| 31 | + - **Mechanism**: Replicas send only the **operations** (mutations) to other replicas. |
| 32 | + - **Requirements**: Operations must be commutative to ensure convergence regardless of delivery order. |
| 33 | + - **Network**: Requires a reliable broadcast channel that guarantees **exactly-once** or **at-least-once** causal delivery. |
| 34 | + - |
| 35 | + - ## Mathematical Foundations of CvRDTs |
| 36 | + - For a state-based CRDT (CvRDT) to guarantee convergence, its states must form a **partially ordered set (poset)**, and the merge operator ($\sqcup$, also called "join") must be a **Join-Semilattice**, which satisfies three properties: |
| 37 | + - |
| 38 | + - 1. **Commutativity**: $A \sqcup B = B \sqcup A$ |
| 39 | + - The order in which replicas merge states does not affect the final result. |
| 40 | + - 2. **Associativity**: $(A \sqcup B) \sqcup C = A \sqcup (B \sqcup C)$ |
| 41 | + - The grouping of merge operations does not affect the final result. |
| 42 | + - 3. **Idempotency**: $A \sqcup A = A$ |
| 43 | + - Merging the same state multiple times yields the same result (no double-counting). |
| 44 | + - |
| 45 | + - ## Visual Walkthrough: G-Counter (Grow-Only Counter) |
| 46 | + - A G-Counter is a state-based CRDT where the value can only increase. |
| 47 | + - |
| 48 | + - ### Initial State (3 replicas: A, B, C) |
| 49 | + - ``` |
| 50 | + Replica A: [0, 0, 0] (Value: 0) |
| 51 | + Replica B: [0, 0, 0] (Value: 0) |
| 52 | + Replica C: [0, 0, 0] (Value: 0) |
| 53 | + ``` |
| 54 | + - |
| 55 | + - ### Concurrent Increments |
| 56 | + - A increments twice; B increments once. |
| 57 | + - ``` |
| 58 | + Replica A: [2, 0, 0] (Value: 2) |
| 59 | + Replica B: [0, 1, 0] (Value: 1) |
| 60 | + Replica C: [0, 0, 0] (Value: 0) |
| 61 | + ``` |
| 62 | + - |
| 63 | + - ### Synchronization (A sends state to B; B sends state to C) |
| 64 | + - Replicas merge states by taking the **element-wise maximum** of their vectors: |
| 65 | + - $$Merge(V_1, V_2) = [\max(V_{1,0}, V_{2,0}), \max(V_{1,1}, V_{2,1}), \dots]$$ |
| 66 | + - |
| 67 | + - **B merges A's state**: $[\max(0, 2), \max(1, 0), \max(0, 0)] = [2, 1, 0]$ (Value: 3) |
| 68 | + - **C merges B's state**: $[\max(0, 0), \max(0, 1), \max(0, 0)] = [0, 1, 0]$ (Value: 1) |
| 69 | + - |
| 70 | + - ``` |
| 71 | + Replica A: [2, 0, 0] (Value: 2) |
| 72 | + Replica B: [2, 1, 0] (Value: 3) |
| 73 | + Replica C: [0, 1, 0] (Value: 1) |
| 74 | + ``` |
| 75 | + - |
| 76 | + - ### Full Convergence (All replicas sync) |
| 77 | + - ``` |
| 78 | + Replica A: [2, 1, 0] (Value: 3) |
| 79 | + Replica B: [2, 1, 0] (Value: 3) |
| 80 | + Replica C: [2, 1, 0] (Value: 3) |
| 81 | + ``` |
| 82 | +- |
| 83 | +- # Complexity & Trade-offs |
| 84 | + collapsed:: true |
| 85 | + - ## Complexity Table |
| 86 | + - | Operation / Aspect | State-Based (CvRDT) | Operation-Based (CmRDT) | |
| 87 | + |--------------------|---------------------|--------------------------| |
| 88 | + | **Local Update** | $O(1)$ | $O(1)$ | |
| 89 | + | **Merge/Apply** | $O(N)$ (where $N$ is replica count) | $O(1)$ | |
| 90 | + | **Message Size** | $O(N)$ (entire state size) | $O(1)$ (operation details only) | |
| 91 | + | **Network Cost** | Higher (bandwidth overhead) | Lower (small payloads) | |
| 92 | + | **Network Reliability**| Low requirements (works over UDP/Gossip) | High requirements (needs causal order) | |
| 93 | +- |
| 94 | +- # Implementations |
| 95 | + collapsed:: true |
| 96 | + - :::code-tabs |
| 97 | + |
| 98 | + ```python |
| 99 | + class PNCounter: |
| 100 | + """ |
| 101 | + Positive-Negative Counter CvRDT implementation. |
| 102 | + Allows both increments and decrements by maintaining two G-Counters: |
| 103 | + one for positive additions (P) and one for negative subtractions (N). |
| 104 | + """ |
| 105 | + def __init__(self, replica_id: str): |
| 106 | + self.replica_id = replica_id |
| 107 | + # Dictionary mapping replica_id -> count |
| 108 | + self.P = {self.replica_id: 0} |
| 109 | + self.N = {self.replica_id: 0} |
| 110 | + |
| 111 | + def increment(self, amount: int = 1): |
| 112 | + self.P[self.replica_id] += amount |
| 113 | + |
| 114 | + def decrement(self, amount: int = 1): |
| 115 | + self.N[self.replica_id] += amount |
| 116 | + |
| 117 | + def value(self) -> int: |
| 118 | + """Returns the current accumulated count.""" |
| 119 | + return sum(self.P.values()) - sum(self.N.values()) |
| 120 | + |
| 121 | + def merge(self, other: 'PNCounter'): |
| 122 | + """ |
| 123 | + Merges another PNCounter's state into this one. |
| 124 | + Takes the element-wise maximum for both P and N vectors. |
| 125 | + """ |
| 126 | + # Merge Positive vector |
| 127 | + all_p_keys = set(self.P.keys()).union(other.P.keys()) |
| 128 | + for key in all_p_keys: |
| 129 | + self.P[key] = max(self.P.get(key, 0), other.P.get(key, 0)) |
| 130 | + |
| 131 | + # Merge Negative vector |
| 132 | + all_n_keys = set(self.N.keys()).union(other.N.keys()) |
| 133 | + for key in all_n_keys: |
| 134 | + self.N[key] = max(self.N.get(key, 0), other.N.get(key, 0)) |
| 135 | + |
| 136 | + # Example Usage |
| 137 | + if __name__ == "__main__": |
| 138 | + nodeA = PNCounter("A") |
| 139 | + nodeB = PNCounter("B") |
| 140 | + |
| 141 | + nodeA.increment(5) |
| 142 | + nodeB.increment(3) |
| 143 | + nodeA.decrement(2) # Net value A: 3, B: 3 |
| 144 | + |
| 145 | + # Sync Node A -> Node B |
| 146 | + nodeB.merge(nodeA) |
| 147 | + print("Node B Value after sync:", nodeB.value()) # Expected: 6 (5 - 2 from A + 3 from B) |
| 148 | + ``` |
| 149 | + |
| 150 | + ```javascript |
| 151 | + class LWWElementSet { |
| 152 | + /** |
| 153 | + * Last-Write-Wins Element Set (LWW-Element-Set) CvRDT. |
| 154 | + * Maintains an add set and a remove set with timestamps. |
| 155 | + */ |
| 156 | + constructor() { |
| 157 | + this.addSet = new Map(); // element -> timestamp |
| 158 | + this.removeSet = new Map(); // element -> timestamp |
| 159 | + } |
| 160 | + |
| 161 | + add(element, timestamp = Date.now()) { |
| 162 | + const existing = this.addSet.get(element) || 0; |
| 163 | + this.addSet.set(element, Math.max(existing, timestamp)); |
| 164 | + } |
| 165 | + |
| 166 | + remove(element, timestamp = Date.now()) { |
| 167 | + const existing = this.removeSet.get(element) || 0; |
| 168 | + this.removeSet.set(element, Math.max(existing, timestamp)); |
| 169 | + } |
| 170 | + |
| 171 | + contains(element) { |
| 172 | + const addTime = this.addSet.get(element); |
| 173 | + if (addTime === undefined) return false; |
| 174 | + |
| 175 | + const removeTime = this.removeSet.get(element); |
| 176 | + if (removeTime === undefined) return true; |
| 177 | + |
| 178 | + // If added after removed, or added at the same time (bias towards add) |
| 179 | + return addTime >= removeTime; |
| 180 | + } |
| 181 | + |
| 182 | + merge(other) { |
| 183 | + // Merge addSet |
| 184 | + for (const [element, timestamp] of other.addSet.entries()) { |
| 185 | + const localTime = this.addSet.get(element) || 0; |
| 186 | + this.addSet.set(element, Math.max(localTime, timestamp)); |
| 187 | + } |
| 188 | + |
| 189 | + // Merge removeSet |
| 190 | + for (const [element, timestamp] of other.removeSet.entries()) { |
| 191 | + const localTime = this.removeSet.get(element) || 0; |
| 192 | + this.removeSet.set(element, Math.max(localTime, timestamp)); |
| 193 | + } |
| 194 | + } |
| 195 | + } |
| 196 | + |
| 197 | + // Example Usage |
| 198 | + const client1 = new LWWElementSet(); |
| 199 | + const client2 = new LWWElementSet(); |
| 200 | + |
| 201 | + client1.add("item1", 100); |
| 202 | + client2.remove("item1", 150); // client2 removes it with later timestamp |
| 203 | + |
| 204 | + client1.merge(client2); |
| 205 | + console.log("Contains item1?", client1.contains("item1")); // false |
| 206 | + |
| 207 | + client1.add("item1", 200); // client1 re-adds it later |
| 208 | + console.log("Contains item1 after re-add?", client1.contains("item1")); // true |
| 209 | + ``` |
| 210 | + |
| 211 | + ::: |
| 212 | +- |
| 213 | +- # CRDT vs Operational Transformation (OT) |
| 214 | + collapsed:: true |
| 215 | + - ## Architectural Differences |
| 216 | + - **Operational Transformation (OT)** is another conflict resolution paradigm used extensively in collaborative editors (e.g., Google Docs). |
| 217 | + - Instead of utilizing math in the data structure itself to resolve conflict, OT intercepts concurrent operations and intercepts/modifies their offsets dynamically depending on peer operations. |
| 218 | + - |
| 219 | + - | Feature | Operational Transformation (OT) | CRDT | |
| 220 | + |---------|---------------------------------|------| |
| 221 | + | **Topology** | Typically Server-Client (Centralized) | Peer-to-Peer or Server-Client (Decentralized) | |
| 222 | + | **Offline Support** | Poor (Requires synchronization locks) | Excellent (Offline updates merge natively) | |
| 223 | + | **Implementation Complexity** | Extremely high (Edge-cases in transform matrices) | Low to Medium (Relies on data structure rules) | |
| 224 | + | **Memory Overhead** | Low | High (Needs to store tombstones/metadata) | |
| 225 | + | **Use Cases** | Google Docs, Wave | Figma, Apple Notes, Yjs, Automerge, Redis | |
| 226 | +- |
| 227 | +- # Real-World Systems Utilizing CRDTs |
| 228 | + collapsed:: true |
| 229 | + - **Figma**: Uses a custom CRDT layout engine to synchronize multiplayer canvas edits. |
| 230 | + - **Redis Enterprise**: Uses CRDTs to provide Active-Active Multi-Master replication databases across geographically separated regions. |
| 231 | + - **Riak KV / Cassandra**: Uses CRDTs for resolving values in multi-master tables under eventual consistency. |
| 232 | + - **Yjs & Automerge**: Popular open-source JavaScript libraries for building offline-first, real-time collaborative text and rich-text editing applications. |
| 233 | +- |
| 234 | +- # Key Takeaways |
| 235 | + collapsed:: true |
| 236 | + - **Convergence Guarantee**: CRDTs ensure that any two nodes that have received the same set of updates are guaranteed to be in the same state. |
| 237 | + - **Mathematical Rigor**: CvRDT states must be partially ordered, and merging must be associative, commutative, and idempotent to prevent double-counting. |
| 238 | + - **Tombstones**: Removing data in sets or sequences usually requires storing a record of deletion (a "tombstone"), which can cause data structure size to grow indefinitely unless garbage collected. |
| 239 | +- |
| 240 | +- # More Learn |
| 241 | + collapsed:: true |
| 242 | + - ## Resources & Readings |
| 243 | + - [A comprehensive study of Convergent and Divergent Replicated Data Types](https://inria.hal.science/inria-00555588/document) — The original seminal paper by Marc Shapiro et al. |
| 244 | + - [Designing Data-Intensive Applications – Martin Kleppmann](https://dataintensive.net/) — Chapter 5 discusses replication and CRDT conflict resolution patterns. |
| 245 | + - [Yjs Documentation](https://docs.yjs.dev/) — In-depth look at production-grade text/rich-text CRDTs. |
| 246 | + - [Wikipedia -> CRDT](https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type) |
0 commit comments