Maintenance Mode
Note
This feature is only available in NCache Enterprise Edition.
NCache supports maintenance mode for Partitioned of Replica topology, which is the most commonly used caching topology.
Maintenance mode has been introduced to accommodate patching or upgrading hardware/software on caching servers. A typical workflow of upgrade involves stopping one cache node at a time, upgrade the server and restart the cache(s) on it. This upgrade procedure avoids application downtime. However, stopping a cache node triggers state transfer within the entire cache cluster resulting in excessive use of resources like Network and CPU.
NCache maintenance mode resolves this issue by halting state transfer if a node is to be stopped temporarily for maintenance. Once a node is specifically stopped for maintenance, it informs the running cache cluster to halt state transfer for a given timeout period. When a cluster is under maintenance mode, client data requests for the stopped node are served from its replica node. Once stopped node re-joins the cluster, it transfers data from its replica node.
How Maintenance Mode Works
Consider a cluster of three nodes as shown below. If Node 3 is marked for maintenance, state transfer is halted for a specific timeout while any operations are served from its replica on Node 1. During the maintenance period, the replica of the maintenance node acts as its active partition. This happens without any client intervening and ensures client operations work smoothly even if a node is stopped for maintenance.
Timeout and State Transfer
The timeout, configured by the user, serves as a waiting period for the state transfer thread. This thread waits for the cluster to exit maintenance mode either by rejoining the node or not joining. There are two outcomes in this case:
Node 3 does not re-join the cluster within timeout:
In this case, state transfer task will start between the remaining nodes (Node 1 and Node 2) and they will resume their normal state.
Note that if in this case Node 2 abruptly leaves the cluster, data loss may occur as its replica existed on Node 3.
Node 3 re-joins the cluster within timeout:
If Node 3 rejoins within the timeout period, state transfer will be initiated to resume the original state of the cluster. This state transfer now consists of two stages:
- Node 1 (replica of Node 3) --> Node 3 (active partition of Node 3)
- Node 2 (active partition of Node 2) --> Node 3 (replica of Node 2)
Behavior
Minimum 2 nodes are required to mark a cluster for maintenance.
If a cluster has been marked for maintenance, no node can be added or removed from the cluster through Web manager or tools.
If a cluster has been marked for maintenance, no other node except the node which has been stopped for maintenance can be started. Let’s suppose if Node 3 was already stopped and Node 2 has been marked for maintenance, you can only start Node 2.
On stopping/starting a node which has not been stopped for maintenance, following exception will be thrown: "Cluster is already under maintenance."
If state transfer is already in progress either by node down or up, you cannot mark a cluster for maintenance. The following exception is thrown: "Cluster is not available for maintenance, state transfer or cluster state change in process."
User can check if a cluster is in state transfer or not through state transfer counters or by looking at cache log files in
%NCHOME%/log-files
(Windows) or\opt\ncache\log-files
(Linux). "State transfer has been completed" will be logged in log files.Once a cluster is marked for maintenance, cache logs have the log of "Cluster marked under maintenance of node: [IP]:[Port] for xx:xx:xx {HH:MM:SS)."
A cluster can exit maintenance mode in following scenarios:
- The node marked for maintenance starts again.
- Timeout for maintenance mode occurs.
- "Exit Maintenance Mode" option is selected through Web Manager.
- A node leaves the cluster abruptly.
Once cluster exists maintenance mode, state transfer is initiated.