IT Stuff

Cluster 7. What is split-brain, quorum, DLM & fencing, totem protocol & CPG.

Split-brain

A split-brain is a state in which nodes lose contact with each other and then try to take control of shared resources or simultaneously provide clustered services. This leads to actually corrupting and loosing data. To avoid split-brain situations quorum is used.

Quorum

Quorum algorithm used in the Red Hat Cluster is a simple majority meaning that more than half of the hosts must be online and communicating in order to provide services: (nodes_count / 2 + 1) rounding down:

If we have 3 nodes in a cluster, voices count = 3, quorum = 3 / 2 + 1 = 1.5 + 1 = 2.5 ~ 2 , so at least 2 nodes needed to form new cluster after 1 node fails, if 2 nodes fail cluster will hung
If we have 4 nodes in a cluster, voices count = 4, quorum = 4 / 2 + 1 = 2 + 1 = 3 , so at least 3 nodes needed to form new cluster after 1 node fails, if 2 nodes fail cluster will hang
If we have 5 nodes in a cluster, voices count = 5, quorum = 5 / 2 + 1 = 2.5 + 1 = 3.5 ~ 3 , so at least 3 nodes needed to form new cluster after 1 node fails, if 2 nodes fail cluster will rebuild

In cluster with 2 nodes any failure will cause 50/50 split, hanging both nodes. To make 2 nodes cluster fault-tolerant fencing is used (in corosync 2.4 we have options to use quorum with 2 node cluster but fencing also needed).

If cluster is split into two or more partitions, group of machines having quorum, can form new cluster.

PS we can use qdisk to form quorum in cluster of 2 nodes, but this one is not working with DRBD which we are going to use for HDD replication. Also we are going to use corosync 2.4 which has options like two_node & wait_for_all which are not working with qdisk.

Fencing aka STONITH

Fencing means putting the target node into a state where it can not affect cluster resources or provide cluster services. This can be accomplished by powering it off (power fencing), disconnect it from SAN storage and/or network (fabric fencing).

Fence is absolutely critical part for clustering. Without fully functional fencing your cluster will fail.

Linux-HA used STONITH ("Shoot The Other Node In The Head") term and Red Hat used the term - "fencing". Both terms can be used interchangeably.
When nodes fail or cluster split into partitions winning node or partition (winning here means - "having quorum") will fence losers (in two node cluster with corosync 2.4 one node will have quorum and try to fence the other node, with network failure this can end with fencing loop - both nodes fencing each other forever. To solve that - you need to setup delay in fencing for the preferred node).
If all (or the only) configured fence fails, fence daemon will start over. Fence daemon will wait and loop forever until a fence agent succeeds. During this time, the cluster is effectively hung.
Once a fence_agent succeeds, fence daemon notifies DLM and lost locks are recovered. This is how Fencing & DLM are cooperating.

DLM (Distributed Lock Manager)

File system locking in Linux is done by POSIX or other type of locks available in system. DLM is used by cluster storage and resource manager in order to organize and serialize the access (it manages locks). dlm daemon runs in user-space (kernel space is used to run OS critical components and user-space is used to provide memory for software), this software communicates with DLM in kernel. The lockspace (locking definite resource) is given to the requester node, the other node can request lockspace only after first node releases the lock.
PS - DLM is used only with cluster aware file-systems.

Totem protocol, CPG & virtual synchrony

Totem protocol is used to send token messages between cluster nodes. A token is passed around to each node, the node does some work, and then it passes the token on to the next node. This goes around and around all the time. Should a node not pass its token on after a short time-out period (defaults to 238ms), the token is declared lost, an error count (defaults to 4 losses) goes up and a new token is sent. If too many tokens are lost in a row, the node is declared lost. The cluster checks which members it still has, and if that provides enough votes for quorum.

The closed process group (CPG) is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages among nodes in a consistent order. It adds PIDs and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group. So in other words - CPG is simply a private group of processes in a cluster.
The ordered delivery of messages among cluster nodes is referred to as "virtual synchrony".

Virtual synchrony (DLM & CPG cooperation)

DLM messages are in ordered delivery due to using totem's CPG. When a node wants to start a clustered service (cluster-aware file-system), this node can start this service only after achieving a lock from DLM. After starting this clustered service, this node announces other nodes - members of the CPG. So after issuing DLM (when stating clustered service or requesting storage lock etc.) every member (node) notifies other CPG members (nodes).

Messages can only be sent to the members of the CPG while the node has a totem token from corosync.

This tutorial was used to understand and setup clustering: AN!Cluster