Friday, March 9, 2018

Cluster 16. DRBD Setup.


Installation

On both nodes:
Linbit provides yum repo only with paid support. So we'll use ELRepo (Enterprise Linux Repository) to install DRBD:
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org # imports the public key
yum install drbd84-utils.x86_64 kmod-drbd84.x86_64 -y # DRBD90 is also available in ELRepo but, as official docs state:
https://docs.linbit.com/docs/users-guide-9.0/: With current DRBD-9.0 version running in Dual-Primary mode is not recommended (because of lack of testing). In DRBD-9.1 it will be possible to have more than two primaries at the same time.

systemctl disable drbd.service
systemctl status drbd.service

DRBD will not be able to run under the default SELinux security policies. If you are familiar with
SELinux, you can modify the policies in a more fine-grained manner, but here we will simply exempt
DRBD processes from SELinux control (must be done on both nodes):
semanage permissive -a drbd_t
reboot

Also you can do (must be done on both nodes):
sealert -a /var/log/audit/audit.log
reboot
Then perform suggested actions to solve problems.

Note: This tutorial will create two DRBD resources. Each resource will use a different TCP port. By convention, they start at port 7788 and increment up per resource. So we will be opening ports 7788 and 7789 on each node:

node1:
firewall-cmd --permanent --add-rich-rule='
    rule family="ipv4" 
    source address="10.10.52.2/32" 
    port protocol="tcp" 
    port="7788-7789" accept'
firewall-cmd --reload
firewall-cmd --list-all

node2:
firewall-cmd --permanent --add-rich-rule='
    rule family="ipv4" 
    source address="10.10.52.1/32" 
    port protocol="tcp" 
    port="7788-7789" accept'
firewall-cmd --reload
firewall-cmd --list-all

Setup

Backup existing configs:
rsync -av /etc/drbd.d /root/backups/

cat /etc/drbd.conf 
# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example
include "drbd.d/global_common.conf";
include "drbd.d/*.res";

So we need to setup global_common.conf and *.res file for each of our resources.

Setup common DRBD options


vi /etc/drbd.d/global_common.conf # we will describe only changed options
Verify below options:

  • in the global section:
    • usage-count no; # This tells DRBD that you allow it to report this installation to LINBIT for statistical purposes. If you have privacy concerns, set this to 'no'. 
  • in the handlers section:
    • fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # sets constraint (smth like drbd-fence-by-handler-r0-ms_drbd_r0) and there the after-resync-target handler on the peer should remove the constraint again. Thus, if the DRBD replication link becomes disconnected, the crm-fence-peer.sh script contacts the cluster manager, determines the Pacemaker Master/Slave resource associated with this DRBD resource, and ensures that the Master/Slave resource no longer gets promoted on any node other than the currently active one. Conversely, when the connection is re-established and DRBD completes its synchronization process, then that constraint is removed and the cluster manager is free to promote the resource on any node again. In a dual-primary setup, if it was a replication link failure only (if it was node failer - pacemaker will call fence agent on the failed node), and cluster communication is still up, both will call that handler,but only one will succeed to set the constraint. The other will remain IO-blocked, and can optionally "commit suicide" from inside the handler. But just because you where able to shoot the other node does not make your data any better.
    • after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; # removes constraint after sync
    • split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # simply sends message about SB occurrence to the specified e-mail address
  • in the startup section:
    • become-primary-on both; # This tells DRBD to promote both nodes to Primary on start.
    • wait for connection timeouts (wfc) - This command will fail if the device cannot communicate with its partner for timeout seconds. If the peer was working before this node was rebooted, the wfc_timeout is used. If the peer was already down before this node was rebooted, the degr_wfc_timeout is used. If the peer was successfully outdated before this node was rebooted the outdated_wfc_timeout is used. The default value for all those timeout values is 0 which means to wait forever. The unit is seconds. In case the connection status goes down to StandAlone because the peer appeared but the devices had a split brain situation, the default for the command is to terminate:
      • wfc-timeout 300;  # This tells DRBD to wait five minutes for the other node to connect. This should be longer than it takes for corosync to timeout and fence the other node *plus* the amount of time it takes the other node to reboot. If you set this too short, you could corrupt your data. If you want to be extra safe, do not use this at all and DRBD will wait for the other node forever. 
      • degr-wfc-timeout 120; # This tells DRBD to wait for the other node for three minutes if the other node was degraded the last time it was seen by this node. This is a way to speed up the boot process when the other node is out of commission for an extended duration.
      • outdated-wfc-timeout 120; #Same as above, except this time-out is used if the peer was 'Outdated'. 
  • in the disk section:
    • on-io-error detach; # 
    • fencing resource-and-stonith; # This tells DRBD to block IO and fence the remote node (using the 'fence-peer' helper) when connection with the other node is unexpectedly lost. This is what helps prevent split-brain condition and it is incredible important in dual-primary setups! 
    • resync-rate 30M; # An eventually running resync process should use about 30MByte/second of IO bandwidth. This tells DRBD how fast to synchronize out-of-sync blocks. The higher this number, the faster an Inconsistent resource will get back to UpToDate state. However, the faster this is, the more of an impact normal application use of the DRBD resource will suffer. We'll set this to 30 MB/sec.
  • int the net section:
    • protocol C; # tells DRBD not to tell the operating system that the write is complete until the data has reach persistent storage on both nodes. This is the slowest option, but it is also the only one that guarantees consistency between the nodes. It is also required for dual-primary, which we will be using.
    • allow-two-primaries; # This tells DRBD to allow two nodes to be Primary at the same time. It is needed when 'become-primary-on both' is set. You only should use this option if you use a shared storage file system on top of DRBD. At the time of writing the only ones are: OCFS2 and GFS. If you use this option with any other file system, you are going to crash your nodes and to corrupt your data! (we are going to use GFS2 and CLVM, both with DLM). This is needed to enable live-migration of our servers. In our case, we'll be running  dual-primary, so we can not safely recover automatically. The only safe option is for the nodes to disconnect from one another and let a human decide which node to invalidate. You can learn more about these options by reading the drbd.conf man page. NOTE! It is not possible to safely recover from a split brain where both nodes were primary. This care requires human intervention, so 'disconnect' is the only safe policy. It doesn't matter what mode you are in, it matters what happened during the time that the nodes were split-brained (time when being StandAlone/UpToDate). If both nodes were secondary during the split-brain, 0pri policy is used. If one node was Primary and the other remained secondary, 1pri policy is used. If both nodes were primary, even for a short time, 2pri is used:
      • after-sb-0pri discard-zero-changes; # "after-sb-0pri" - Split brain has just been detected, but at this time the resource is not in the Primary role on any host - neither node is Primary. "discard-zero-changes" - If there is any host on which no changes occurred at all, simply apply all modifications (sync) made on the other and continue. In case none wrote anything this policy uses a random decision to perform a "resync" of 0 blocks. In case both have written something this policy disconnects the nodes.
      • after-sb-1pri discard-secondary; # "after-sb-1pri" - Split brain has just been detected, and at this time the resource is in the Primary role on one host. "discard-secondary" - discard changes on the secondary and sync to the primary
      • after-sb-2pri disconnect; # "after-sb-2pri" - This tells DRBD what to do in the case of a split-brain when both nodes are primary. "disconnect" - no automatic re-synchronization, simply disconnect.
After setup file on one node (agrp-c01n01), copy file to the other node:
rsync -av /etc/drbd.d/global_common.conf root@agrp-c01n02:/etc/drbd.d/global_common.conf

Setup DRBD resource options

We are going to have 2 resources - r0 and r1, so we must to setup 2 files:

  1. resource r0, which will create the device /dev/drbd0, will be backed by each nodes /dev/sda3 partition. It will provide disk space for VMs that will normally run on agrp-c01n01 and provide space for the /shared GFS2 (will be discussed further) partition.
  2. resource r1, which will create the device /dev/drbd1, will be backed by each nodes //dev/sda4 partition. It will provide disk space for VMs that will normally run on agrp-c01n02 and provide space for the /shared GFS2 (will be discussed further) partition.
vi /etc/drbd.d/r0.res

# This is the resource used for the shared GFS2 partition and host VMs designed 
# to run on an-a05n01. 
resource r0 {
            # This is the block device path. 
            device /dev/drbd0; 
            # We'll use the normal internal meta-disk. This is where DRBD stores 
            # its state information about the resource. It takes about 32 MB per 
            # 1 TB of raw space. 
            meta-disk internal; 
            # This is the `uname -n` of the first node 
            on agrp-c01n01 { 
                           # The 'address' has to be the IP, not a host name. This is the 
                           # node's SN (sn_bond1) IP. The port number must be unique amoung 
                           # resources. 
                           address 10.10.52.1:7788; 
                           # This is the block device backing this resource on this node. 
                           disk /dev/sda3; 
            } 
           # Now the same information again for the second node. 
          on agrp-c01n02 { 
                           address 10.10.52.2:7788; 
                           disk /dev/sda3; 
          } 
}

Now copy this to r1.res and edit for the agrp-c01n01 VM resource. The main differences are the resource name, r1, the block device, /dev/drbd1, the port, 7789 and the backing block devices, /dev/sda4:

vi /etc/drbd.d/r0.res

# This is the resource used for the shared GFS2 partition and host VMs designed 
# to run on an-a05n01. 
resource r0 {
            # This is the block device path. 
            device /dev/drbd0; 
            # We'll use the normal internal meta-disk. This is where DRBD stores 
            # its state information about the resource. It takes about 32 MB per 
            # 1 TB of raw space. 
            meta-disk internal; 
            # This is the `uname -n` of the first node 
            on agrp-c01n01 { 
                           # The 'address' has to be the IP, not a host name. This is the 
                           # node's SN (sn_bond1) IP. The port number must be unique amoung 
                           # resources. 
                           address 10.10.52.1:7789; 
                           # This is the block device backing this resource on this node. 
                           disk /dev/sda4; 
            } 
           # Now the same information again for the second node. 
          on agrp-c01n02 { 
                           address 10.10.52.2:7789; 
                           disk /dev/sda4; 
          } 
}

Now we will do an initial validation of the configuration - if some options are wrong, descriptive warning message will appear. This is done by running the following command:
drbdadm dump

Now do the same process for node 2 or just use rsync:
rsync -av /etc/drbd.d root@agrp-c01n02:/etc/
After setting node 2 up - verify with drdbadm dump.
To see which options are default options:
drbdsetup /dev/drbd0 show --show-defaults
drbdsetup /dev/drbd1 show --show-defaults

Create DRBD resources

Create DRBD resources (on both nodes):
drbdadm create-md r{0,1} # 'yes' 'yes' => New drbd meta data block successfully created. This step must be completed only on initial device creation. It initializes DRBD’s metadata. If create-md returns "Operation refused" error, then for needed disk (/dev/sda3 or /dev/sda4 in our case):
dd if=/dev/zero of=/dev/sda3 bs=1M count=128
drbdadm up r{0,1} # This step associates the resource with its backing device (or devices, in case of a multi-volume resource), sets replication parameters, and connects the resource to its peer.

DRBD’s virtual status file in the /proc filesystem, /proc/drbd, should now contain information similar to the following (The Inconsistent/Inconsistent disk state is expected at this point):
version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:488266148
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:488192424

Description of line with resource number (lines starting with "0:" or "1:" ):
  1. cs - connection status (i.e. Connected, Waiting, see http://docs.linbit.com/docs/users-guide-8.4/#s-connection-states)
  2. ro - roles (i.e. Primary, Secondary, see http://docs.linbit.com/docs/users-guide-8.4/#s-roles)
  3. ds - disk status (i.e. Inconsistent, UpToDate, see http://docs.linbit.com/docs/users-guide-8.4/#s-disk-states)
  4. replication protocol mode (A, B or C, see http://docs.linbit.com/docs/users-guide-8.4/#s-replication-protocols)
  5. six flags reflecting the I/O status of this resource (i.e. r - running (is the normal state), see http://docs.linbit.com/docs/users-guide-8.4/#s-io-flags) Normally first flag must be "r" and others "-", so flags must be like: r-----
  6. the next line after described line is the line of "Performance indicators" (see http://docs.linbit.com/docs/users-guide-8.4/#s-performance-indicators)
  7.  oos (out of sync) -  amount of storage currently out of sync; in Kibibytes.

The good thing about DRBD is that we do not have to wait for the resources to be synchronized. So long as one of the resource is UpToDate, both nodes will work. If the Inconsistent node needs to read data, it will simply read it from its peer. But cluster can not be considered redundant until both nodes are UpToDate. So to make disk consistent issue on only on one node (it's the first time we connect our disks, so we really don't need to synchronize disks data):
drbdadm new-current-uuid --clear-bitmap r{0,1} #    You must not use this on pre-existing data!  Even though it may appear to work at first glance, once you switch to the other node, your data is toast, as it never got replicated. So do not leave out the mkfs (or equivalent) - this means that you must create file-system on top of the DRBD device

drbdadm primary --force r{0,1} # just to be sure that synchronization status is checked

Verify (local node is primary because of the drbdadm primary --force r{0,1} command):
cat /proc/drbd

version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:2120 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:2120 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

As we see both nodes data is UpToDate. Also we can use drbd-overview:
NOTE: drbd-overview will be deprecated soon.
Please consider using drbdtop.

 0:r0/0  Connected Primary/Secondary UpToDate/UpToDate 
 1:r1/0  Connected Primary/Secondary UpToDate/UpToDate 

Or we can use drbdadm status r{0,1}:
r0 role:Primary
  disk:UpToDate
  peer role:Secondary
    replication:Established peer-disk:UpToDate

r1 role:Primary
  disk:UpToDate
  peer role:Secondary
    replication:Established peer-disk:UpToDate

In order for a DRBD resource to be usable, it has to be "promoted". Be default, DRBD resources start in the Secondary state. This means that it will receive changes from the peer, but no changes can be made. You can't even look at the contents of a Secondary resource. Why this is requires more time to discuss than we can go into here.
So the next step is to promote both resource on both nodes.
drbdadm primary r{0,1}
Verify:
cat /proc/drbd # if both nodes are Primary, we're done setting up DRBD.

Pacemaker DRBD resource setup 

Find appropriate type of resource:
pcs resource list | grep -i drbd
ocf:linbit:drbd - Manages a DRBD device as a Master/Slave resource

To view all available non-advanced options:
pcs resource describe drbd

So we need to create Master/Slave resource of type ocf:linbit:drbd:

r0:
pcs resource create drbd_r0 ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s op monitor interval="29s" role="Master" op monitor interval="31s" role="Slave"
pcs resource master ms_drbd_r0 drbd_r0 master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true
pcs resource cleanup drbd_r0
journalctl | grep -i error

Give it a couple minutes to promote both nodes to Master on both nodes. Initially, it will appear as Master on one node only.

r1:
pcs resource create drbd_r1 ocf:linbit:drbd drbd_resource=r1 op monitor interval=60s op monitor interval="29s" role="Master" op monitor interval="31s" role="Slave"
pcs resource master ms_drbd_r1 drbd_r1 master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
pcs resource cleanup drbd_r1
journalctl | grep -i error

Give it a couple minutes to promote both nodes to Master on both nodes. Initially, it will appear as Master on one node only.

Options description (master-* options are unique to multi-state resource, clone-* options are derived from clone resources):
  • clone-max=2 - how many copies of the resource to start
  • clone-node-max=1 - how many copies of the resource can be started on a single node
  • master-max=2 - how many copies of the resource can be promoted to the master role
  • master-node-max=1 - how many copies of the resource can be promoted to the master role on a single node
  • notify=true - When stopping or starting a copy of the clone, tell all the other copies beforehand and again when the action was successful. 


We didn't enable drbd daemon at all, so pacemaker will start drbd when cluster is up, and turn it down when cluster stops.
to delete resources:
pcs resource delete drbd_r1 # ms resource will be deleted automatically
pcs resource delete drbd_r0 # ms resource will be deleted automatically
pcs resource cleanup
pcs resource


Split Brain (SB) & Recovery From SB

Normal operation status:
0:r0/0  Connected Primary/Primary UpToDate/UpToDate
or
0:r0/0  Connected Primary/Secondary UpToDate/UpToDate

Master Node WFConnection - primary node can't connect to the secondary node. To resolve see "Manually connecting slave to the master":
0:r0/0  WFConnection Primary/Unknown UpToDate/Unknown

Note: If the master node reports WFConnection while the slave node reports StandAlone, it indicates a DRBD split brain. See "Recovery from SB"

Slave node StandAlone - secondary can't connect to the primary. To resolve see "Manually connecting slave to the master":
0:r0/0  StandAlone Secondary/Unknown UpToDate/Unknown

Note: If the master node reports WFConnection while the slave node reports StandAlone, it indicates a DRBD split brain. See "Recovery from SB"

Both nodes Secondary/Secondary - nodes are connected, but neither is primary. Usually this is due to Pacemaker failer. To resolve restart the entire cluster if DRBD itself is functioning properly (this judgement is up on you):
0:r0/0  Connected Secondary/Secondary UpToDate/UpToDate
0:r0/0  Connected Secondary/Secondary UpToDate/UpToDate

Both nodes StandAlone and Primary  - SB is occured. Both nodes operating independently.  To resolve see "Recovery from SB":
0:r0/0  StandAlone Primary/Unknown UpToDate/Unknown
0:r0/0  StandAlone Primary/Unknown UpToDate/Unknown

"Split-Brain detected, dropping connection!" - message appeared in logs. To resolve see "Recovery from SB".

Manually connecting slave to the master

First identify which node is the master (in such situation Pacemaker and DRBD can see master node differently, we need DRBD point of view).
On the slave node:
drbdadm connect r0 # after that reconnection must take place, verify with drbd-overview

Recovery from SB

When SB occures nodes operating completely independently, because there is no connection and data replication is not taking place. After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).
To understand which node must be considered as master - you need to verify actual data on each node.
First identify which node is considered to be the master and run on the master.
On the slave node:
drbdadm secondary r0
drbdadm connect --discard-my-data connect r0 # Discarding the database on the slave node does not result in a full re-synchronization from master to slave. The slave node has its local modifications rolled back, and modifications made to the master are propagated to the slave. If "Failure: (102) Local address (port) already in use." message is shown issue on the slave node: drbdadm disconnect r0
Verify with drbd-overview - now both nodes must show "Connected", proper role and "UpToDate/UpToDate".
On the master node (SB survivor) - only if it is also StandAlone (if it's WFConnection, this step is not needed):
drbdadm connect r0 # # after that reconnection must take place, verify with drbd-overview

After re-synchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again.

This tutorials were used to understand and setup clustering: 
AN!Cluster
LINBIT User's Guide 8.4.x
clusterlabs.org
avid.force.com

No comments:

Post a Comment