Modern applications process and generate large volumes of data. The possibility of a single web server/data source failing, resulting in losing applications and priceless data is a common nightmare among software developers. However, you can achieve high data availability if all server nodes have an identical copy of data – this implies that there won’t be any data loss if a few nodes fail in the cluster. But, what happens when the data begins to expand significantly? In cases like these, you must dial back replication and begin partitioning the data.
NCache being a distributed and in-memory caching solution provides high scalability, performance, and availability for data-intensive applications. It brings forward the POR (Partitioned Replica) topology to divide the data into multiple chunks (buckets) and place them in different partitions. In order to evenly spread the read and write loads, data is partitioned across several nodes. This solves the initial problem in achieving scalability by dividing the data, but how exactly is the data partitioned equally? This blog aims to educate you about how data partitioning takes place in NCache.
Hash-Based Partitioning for Equal Data Distribution
More often than not, various applications employ the round-robin strategy to assign data to different partitions. While this approach guarantees even distribution, it presents a challenge when it comes to locating specific data items. The search and retrieval of data can become time-consuming and ineffective with no way to track item locations.
To resolve this issue, NCache incorporates Hash-Based Partitioning. Data is split into many buckets that are subsequently dispersed among several partitions. The goal is to evenly distribute buckets across nodes in the cluster to optimize performance and ensure high availability. To achieve this, NCache employs a hashing technique that maps each data item to a specific bucket based on the key of the item. Now, to figure out the owner of the bucket, you need to apply the hash function on the item’s key and mod it by the total number of buckets – we have 1000 buckets in total.
What is a Distribution Map?
The coordinator server is essential in a distributed cache cluster because it oversees bucket distribution and ensures that each item is assigned to a particular bucket based on its key. To achieve this, the coordinator server creates a distribution map including the bucket distribution and distributes it to all other partitions in the cluster, as well as, to all connected clients.
No matter how many servers are in the cluster, NCache makes sure that each item is given a consistent bucket address through this method. This is because the distribution map remains constant, even if the number of servers in the cluster changes. As a result, even if a bucket moves from one partition to another at any stage, the bucket address of an item remains the same. This guarantees that the data stays intact, and no data is lost during bucket movements.
In case of Partitioned topology, whenever a node leaves the cluster, the cluster experiences data loss. The buckets owned by the leaving node will all be lost. However, in case of POR, the replica is present on another node that will be redistributed on the basis on the distribution map – preventing data loss.
Data Distribution Based on a Distribution Map
Data is spread equally among all nodes in the cache cluster thanks to the dynamic bucket distribution strategy offered by NCache. All 1000 buckets are assigned to the node when you start a cache cluster, which results in all data being stored in a single partition. To provide the best performance and load balancing, the buckets are then divided equally throughout the partitions when more nodes are added to the cluster.
The 1000 buckets, for instance, are divided equally between the two partitions when a second node is added to the cluster, giving each split 500 buckets. Similarly, when a third node enters the cluster, the buckets are redistributed, giving each partition 333, 333, and 334 buckets, accordingly.
Bucket distribution modifies once more if a partition leaves the cluster. To maintain an even distribution of the data, for instance, when a partition exits a three-node cluster, the 333 or 334 buckets that belong to that partition are dispersed across the two remaining nodes. NCache’s state transfer mechanism kicks in to rebalance the data between nodes whenever the bucket distribution changes, ensuring that data is distributed optimally in accordance with the bucket distribution. Similarly, the client also receives the distribution map that informs about the running server nodes and their hash-based distributions.
Data Load Balancing
While redistributing buckets around the cache cluster’s nodes, NCache takes a data-centric strategy to make sure that the amount of data that each partition receives is balanced. In order to accomplish this, each partition in the cluster periodically exchanges the statistics of the buckets it owns with the other partitions in the cluster. This enables the creation of a balanced distribution map that accounts for the quantity of data each partition possesses. NCache automatically balances the data to ensure that each partition receive an equal share of data. It also allows to balance data manually. You can read more about it here.
Conclusion
In conclusion, utilizing POR to split data in NCache is a useful technique for enhancing the speed and scalability of applications. You may guarantee that data is always available and lessen the possibility of performance bottlenecks by breaking data into smaller chunks and distributing them over numerous cache nodes.