Understanding CAP and PACELC Theorems: Ensuring Data Consistency and Availability in Distributed Systems
Introduction
When designing a distributed system, we as engineers strive for scalability, consistency and high availability. Distributed systems are inherently scalable since you can always add another computer to the cluster. However, the tradeoff we have to consider is Consistency, Availability or Partition Tolerance.
CAP
The idea behind the CAP theorem is to visualize these three tradeoffs, you can only have two of these!!
Consistency refers to the requirement that all nodes in a distributed system see the same data simultaneously. Therefore, updates to the system must be consistent across all replicas.
Availability implies that every request made to a distributed system, regardless of node failures or network partitions, should receive a timely response. High availability is crucial for systems that cannot afford downtime or interruptions.
Partition Tolerance addresses the system’s ability to handle network partitions, where nodes can become temporarily unreachable or network messages can be delayed or lost. In a distributed system, partition tolerance ensures that the system continues to function even in the presence of network failures.
Why Can’t I have all three?
A distributed system cannot have all three, for example assume our system first has P (meaning it can handle partitions). When faced with network partitions, the system must choose between maintaining consistency (having all nodes see the same data) or providing high availability (responding to requests even with potentially inconsistent data). If the network is partitioned, we can either lower availability and maintain consistency (by quarantining the separated parts of the network) OR lower consistency but maintain availability (by keeping those separated nodes live and risking data inconsistencies in the network).
Availability vs. Consistency
The real question is, do you want your system to be available OR consistent in the case of a network partition. Distributed systems have network partitions ALL the time so if you don’t want to support Partition Tolerance, you would have a single node system — which kind of defeats the purpose of being “distributed”.
You would want to build a CP system in cases where data integrity is paramount (like in a Financial Transactions or Healthcare System).
An AP system would be good for high traffic systems where eventual consistency is fine (like a social media app or a CDN).
PACELC
CAP has an extension called PACELC, meaning:
- In a partitioned state, do you prefer Availability or Consistency?
- In a non-partitioned state, do you want Low Latency or Consistency?
We have covered the first piece as part of CAP now we will address the second.
Low Latency emphasizes the need for timely responses in a distributed system. Low latency ensures that clients experience minimal delays when interacting with the system, enhancing user experience and responsiveness.
The reason we cannot have Low Latency and Consistency is because to achieve consistency, our nodes must sync up with each other to ensure they have the same data. This syncing will cause a latency increase for our requestor.
In general Low Latency systems are also the ones that favor Availability, and systems that opt for Consistency do so in both network partition states.
Conclusion
System designing is full of trade offs, and it is important to understand the theoretical limits of a system. Next time you want to build a large distributed system, I hope the first thing on your checklist is to consider “Does my system need to be Highly Consistent or Highly Available?” because unfortunately, we cannot have our cake and eat it too when it comes to networks.