Friday, March 30, 2018

Asterisk retransmission timeout

Sometimes Asterisk log show this message:
WARNING[9380] chan_sip.c: Retransmission timeout reached on transmission 1a1df1ca5a8cc898668aaabf73852da2@10.10.10.10:5060 for seqno 102 (Critical Request) -- See https://wiki.asterisk.org/wiki/display/AST/SIP+Retransmissions

This can be because of several reasons and usually the easiest way to get rid of this is to:
core restart now from Asterisk console.

While this solves the problem, you can have many calls you don't want to lose, in such a situation you can do:
core restart gracefully 
now new calls will not come into system, and restart will be done as long as all (old) calls are ended.

But I found one way around - to issue core restart gracefully - when you have retransmission error and execute - core abort shutdown - when error disappears. This approach gives you ability to accept phone calls and also eliminates problems caused by retransmission error (bad voice, lost calls, simultaneous calls to the same operator etc.).

Save below script into the file (i.e. named: retransmission_timeout_resolving.py) and setup cron to execute this file every minute ( * * * * * /root/retransmission_timeout_resolving.py):

#!/usr/bin/python

from subprocess import PIPE,Popen
import time

try:
 proc1 = Popen(["/usr/bin/tail", "-n", "10", "/var/log/asterisk/messages"], stdout=PIPE)
 tail = (proc1.communicate()[0].split("\n"))
except:
 LOG_LINE = "77 ==> " + str(time.ctime()) + "\n"
 with open('/var/log/asterisk/RETRANSMISSION_ERR.log', 'a') as f:
  f.write(LOG_LINE)

for x in tail:
 if 'SIP+Retransmissions' in x:
  ERR = 1
  break
 elif 'SIP+Retransmissions' not in x:
  ERR = 0

if ERR == 1:
 proc2 = Popen(["/usr/sbin/asterisk", "-rx", "core restart gracefully"])
elif ERR == 0:
 proc3 = Popen(["/usr/sbin/asterisk", "-rx", "core abort shutdown"])
else:
 ERR = "SMTH WRONG HAPPENED"

LOG_LINE = str(ERR) + "  ==> " + str(time.ctime()) + "\n"

with open('/var/log/asterisk/RETRANSMISSION_ERR.log', 'a') as f:
 f.write(LOG_LINE)

Thursday, March 29, 2018

Cluster 19. Fencing level 2 with SNMP.

Please, reread these blog-posts => Cluster 7   / Cluster 11 / Cluster 12

We have level-1 fencing (iLO/IPMI-fencing) setup. But what is the problem with this? The problem is when a node vanishes and fencing fails (if one node loses power - the IPMI fencing will fail for that node). Then, not knowing what the other node might be doing, the only safe option is to block, otherwise you risk a split-brain. To verify that:
  1. pcs cluster start --all
  2. then pull power cable out of one of the nodes
  3. dlm_tool ls # you'll see that:
    1. new change    member 1 joined 0 remove 1 failed 1 seq 2,2
    2. new status    wait fencing

This means that dlm will wait till the powered-off node status get cleared. You'll see the same messages for both "shared" and "clvmd" lock-spaces. This means that GFS2 file-system and CLVMD will hung (pcs status will still show this resources as "Started"). This is because  GFS2 file systems will freeze to ensure data integrity in the event of a failed fence. 
This is why multiple-level fencing is so important.

Overview SNMP fencing

The logic behind this mechanism is very simple: once a node has been marked as dead the agent uses the SNMP SET method to tell the managed switch to shut the ports down. For SNMP fencing we'll use fence_ifmib (see: pcs stonith list fence_ifmib and pcs stonith describe fence_ifmib). Only two OIDs are needed by the agent: ifDescr and ifAdminStatus. The first is used to match the interface name used on the Cisco device (fence_ifmib can be used with any vendor while it's supporting SNMP) with the one provided in the cluster configuration, the latter to get/set the port status.

Setup Cisco stack switches to support SNMP

As we setup in Cluster 3, Switch IP for the 1st cluster is 10.10.53.12 and the 1st cluster uses ports gi1/0/1-4,17 & gi2/0/1-4,17. So:
  1. agrp-c01n01 uses ports
    1. 1/0/1, 1/0/3, 2/0/2, 2/0/4
    2. (config)# int ra gi1/0/1, gi1/0/3, gi2/0/2, gi2/0/4
    3. (config-if-range) # description agrp-c01n01
    4. these are all members of the channel-group 2
    5. (config)#int Port-channel 2
    6. (config-if)#description agrp-c01n01
  2. agrp-c01n02 uses ports
    1. 2/0/1, 2/0/3, 1/0/2, 1/0/4
    2. (config)# int ra gi2/0/1, gi2/0/3, gi1/0/2, gi1/0/4
    3. (config-if-range) # description agrp-c01n02
    4. these are all members of the channel-group 3
    5. (config)#int Port-channel 3
    6. (config-if)#description agrp-c01n02
  3. save and verify with show interface status | incl agrp-c01
Setup ACL for SNMP view:
ip access-list standard agrp-c01-acl
  permit 10.10.53.1
  permit 10.10.53.2
  deny any

Setup SNMP view and community for that view:
Setup community (this enables SNMP agent):
snmp-server community agrp-c01-community RW agrp-c01-acl
Test with:
snmpwalk -v 2c -c agrp-c01-community 10.10.53.12 # huge list of OID will be shown
Find IfIndex of needed interfaces (we'll use them to restrict access in set up community to the needed values only):
You can search only for Port-channel interfaces if you use LACP:
agrp-c01n01: show snmp mib ifmib ifindex | incl Port-channel2
agrp-c01n02: show snmp mib ifmib ifindex | incl Port-channel3
Or search for all connected interfaces, if don't using LACP:
agrp-c01n01: show snmp mib ifmib ifindex | incl net(1/0/[13]|2/0/[24]):
agrp-c01n02: show snmp mib ifmib ifindex | incl net(2/0/[13]|1/0/[24]):

Setup SNMP view:
If you use LACP:
#agrp-c01n01
#For Port-channel2: Ifindex = 5002
snmp-server view agrp-c01-view ifDescr.5002 included 
snmp-server view agrp-c01-view ifAdminStatus.5002 included 
#agrp-c01n02
#For Port-channel3: Ifindex = 5003
snmp-server view agrp-c01-view ifDescr.5003 included 
snmp-server view agrp-c01-view ifAdminStatus.5003 included 

If you don't use LACP:
#agrp-c01n01
#For GigabitEthernet2/0/4: Ifindex = 10604
snmp-server view agrp-c01-view ifDescr.10604 included 
snmp-server view agrp-c01-view ifAdminStatus.10604 included 
#For GigabitEthernet2/0/2: Ifindex = 10602
snmp-server view agrp-c01-view ifDescr.10602 included 
snmp-server view agrp-c01-view ifAdminStatus.10602 included 
#For GigabitEthernet1/0/3: Ifindex = 10103
snmp-server view agrp-c01-view ifDescr.10103 included 
snmp-server view agrp-c01-view ifAdminStatus.10103 included 
#For GigabitEthernet1/0/1: Ifindex = 10101
snmp-server view agrp-c01-view ifDescr.10101 included 
snmp-server view agrp-c01-view ifAdminStatus.10101 included 
#agrp-c01n02
#For GigabitEthernet2/0/3: Ifindex = 10603
snmp-server view agrp-c01-view ifDescr.10603 included 
snmp-server view agrp-c01-view ifAdminStatus.10603 included 
#For GigabitEthernet2/0/1: Ifindex = 10601
snmp-server view agrp-c01-view ifDescr.10601 included 
snmp-server view agrp-c01-view ifAdminStatus.10601 included 
#For GigabitEthernet1/0/4: Ifindex = 10104
snmp-server view agrp-c01-view ifDescr.10104 included 
snmp-server view agrp-c01-view ifAdminStatus.10104 included 
#For GigabitEthernet1/0/2: Ifindex = 10102
snmp-server view agrp-c01-view ifDescr.10102 included 
snmp-server view agrp-c01-view ifAdminStatus.10102 included

Modify community to include setup view:
snmp-server community agrp-c01-community view agrp-c01-view RW agrp-c01-acl
Verify:
sh run | incl  snmp
Test with (from any of the cluster nodes):
snmpwalk -v 2c -c agrp-c01-community 10.10.53.12 # only configured OIDs will be shown
Also test fence_ifmib itself:
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action list # only names of the needed port must be shown

Setup Pacemaker Stonith SNMP

Here I'll be using LACP Port-channel if you don't use LACP, just create one fence_ifmib fence device per port, surely with different names. i.e. fence_ifmib_n01_1-0-1 for GigabitEthernet1/0/1, fence_ifmib_n02_1-0-3 for GigabitEthernet1/0/3 etc.
Create fence device fence_ifmib_n01 for agrp-c01n01:
pcs stonith create fence_ifmib_n01 fence_ifmib pcmk_host_list="agrp-c01n01" ipaddr="agrp-stack01" snmp_version="2c" community="agrp-c01-community" inet4_only="1" port="Port-channel2"  power_wait=4 delay=15 op monitor interval=60s
Create fence device fence_ifmib_n02 for agrp-c01n02:
pcs stonith create fence_ifmib_n02 fence_ifmib pcmk_host_list="agrp-c01n02" ipaddr="agrp-stack01" snmp_version="2c" community="agrp-c01-community" inet4_only="1" port="Port-channel3"  power_wait=4 op monitor interval=60s
Setup constraints (fence_ifmib_n01 will start on agrp-c01n02 & fence_ifmib_n02 will start on agrp-c01n01):
pcs constraint location add lc_fence_ifmib_n01 fence_ifmib_n01 agrp-c01n01 -INFINITY
pcs constraint location add lc_fence_ifmib_n02 fence_ifmib_n02 agrp-c01n02 -INFINITY
Adding stonith level 2:
pcs stonith level add 2 agrp-c01n01 fence_ifmib_n01
pcs stonith level add 2 agrp-c01n02 fence_ifmib_n02

Review Cluster 13 to understand options.

Verify that now GFS2 and CLVMD works even if one node loses power

To verify that:

  1. pcs cluster start --all
  2. then pull power cable out of one of the nodes
  3. dlm_tool ls # you'll see that:
    1. new change    member 1 joined 0 remove 1 failed 1 seq 2,2
    2. new status    wait fencing

This means that dlm fence was successfull and node is removed (down). You'll see the same messages for both "shared" and "clvmd" lock-spaces. This means that GFS2 file-system and CLVMD will work. 

How to rejoin node to the cluster after above test (I've powered off agrp-c01n02):

fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action status
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action on
ping agrp-c01n02
After successful ping:
pcs cluster start agrp-c01n02

This tutorials were used to understand and setup clustering: 
AN!Cluster
unixarena
redhat.com
Pierky's Blog

Cluster 18. GFS2 (Global File System 2).

GFS2 overview

With DRBD providing the clusters raw storage space, and Clustered LVM providing the logical partitions, we can now look at the clustered file system. This is the role of the GFS2.

It works much like standard filesystem, with user-land tools like mkfs.gfs2, fsck.gfs2 and so on. The major difference is that it and clvmd use the cluster's DLM. Once formatted, the GFS2-formatted partition can be mounted and used by any node in the cluster's closed process group (CPG). All nodes can then safely read from and write to the data on the partition simultaneously.

The Red Hat Global File System (GFS) is Red Hat’s implementation of a concurrent-access shared storage file system. As any such filesystem, GFS allows multiple nodes to access the same storage device, in read/write fashion, simultaneously without risking data corruption. It does so by using a Distributed Lock Manager (DLM) which manages concurrent access from cluster members.

By default, the value of no-quorum-policy is set to stop, indicating that once quorum is lost, all the resources on the remaining partition will immediately be stopped. Typically this default is the safest and most optimal option, but unlike most resources, GFS2 requires quorum to function. When quorum is lost both the applications using the GFS2 mounts and the GFS2 mount itself cannot be correctly stopped. Any attempts to stop these resources without quorum will fail which will ultimately result in the entire cluster being fenced every time quorum is lost.
To address this situation, you can set the no-quorum-policy=freeze when GFS2 is in use. This means that when quorum is lost, the remaining partition will do nothing until quorum is regained:
pcs property set no-quorum-policy=freeze

  1. no-quorum-policy=freeze:
    1. If quorum is lost, the cluster partition freezes. Resource management is continued: running resources are not stopped (but possibly restarted in response to monitor events), but no further resources are started within the affected partition. This setting is recommended for clusters where certain resources depend on communication with other nodes (for example, OCFS2 mounts). In this case, the default setting no-quorum-policy=stop is not useful, as it would lead to the following scenario: Stopping those resources would not be possible while the peer nodes are unreachable. Instead, an attempt to stop them would eventually time out and cause a stop failure, triggering escalated recovery and fencing.
  2. no-quorum-policy=stop (default):
    1. If quorum is lost, all resources in the affected cluster partition are stopped in an orderly fashion.

GFS2 setup


From both nodes:
yum install gfs2-utils -y
From one node:
Format the /dev/agrp-c01n01_vg0/shared:
mkfs.gfs2 -j 2 -p lock_dlm -t agrp-c01:shared /dev/agrp-c01n01_vg0/shared # "say" yes to all questions (cluster must be started at the moment of the LV formatting)

The following switches are used with our mkfs.gfs2 call:
  • -j 2 # This tells GFS2 to create two journals. This must match the number of nodes that will try to mount this partition at any one time.
  • -p lock_dlm # This tells GFS2 to use DLM for its clustered locking.
  • -t agrp-c01:shared # This is the lock space name, which must be in the format <cluster_name>:<file-system_name>. The cluster_name must match the one in pcs config | grep Name. The <file-system_name> has to be unique in the cluster, which is easy for us because we'll only have the one gfs2 file system.
From both nodes:
tunegfs2 -l /dev/agrp-c01n01_vg0/shared # both nodes must see this LV in older versions of gfs,  equivalent to this command is: gfs2_tool sb /dev/an-a05n01_vg0/shared all

From one node:
Configure mount point for GFS2 resource (view pcs resource describe Filesystem output to understand all available options):
pcs resource create sharedfs ocf:heartbeat:Filesystem device="/dev/agrp-c01n01_vg0/shared" directory="/shared" fstype="gfs2" 
pcs resource clone sharedfs clone-max=2 clone-node-max=1 interleave=true ordered=true

From both nodes:
df -h /shared # both nodes must see /shared directory as mounted and 259M must shown as used space. 259M is journals consumed space (number of journals * journal size) on disk

From one node:
pcs constraint order start clvmd-clone then sharedfs-clone
pcs constraint colocation add sharedfs-clone with clvmd-clone

Test /shared from any of the nodes (we'll use agrp-c01n02):
cd /shared
touch test{1..10}
ssh agrp-c01n01 ls -lh /shared

This tutorials were used to understand and setup clustering: 
AN!Cluster
unixarena
clusterlabs.org
redhat.com


Tuesday, March 27, 2018

Cluster 17. DLM & CLVM (Distributed Lock Manager Clustered Logical Volume Management). 


Clustered LVM


With DRBD providing the raw storage for the cluster, we must next consider partitions. This is where Clustered LVM, known as CLVM, comes into play.
CLVM is ideal in that by using DLM, the distributed lock manager. It won't allow access to cluster members outside of corosync's closed process group, which, in turn, requires quorum.
It is ideal because it can take one or more raw devices, known as "physical volumes", or simple as PVs, and combine their raw space into one or more "volume groups", known as VGs. These volume groups then act just like a typical hard drive and can be "partitioned" into one or more "logical volumes", known as LVs. These LVs are where KVM's virtual machine guests will exist and where we will create our GFS2 clustered file system (KVM and GFS2 will be set up in further posts).
LVM is particularly attractive because of how flexible it is. We can easily add new physical volumes later, and then grow an existing volume group to use the new space. This new space can then be given to existing logical volumes, or entirely new logical volumes can be created. This can all be done while the cluster is online offering an upgrade path with no down time.


Installation and initial setup

On both nodes (you can use ssh agrp-c01n01 comand to execute the same commands on the remote node):
yum install dlm lvm2-cluster -y
rsync -av /etc/lvm /root/backups/

Before creation of the clustered LVM, we need to first make some changes to the LVM configuration (vi /etc/lvm/lvm.conf):
  1. We need to filter out the DRBD backing devices so that LVM doesn't see the same signature a second time on the DRBD resource's backing device. Or in other words - limit the block devices that are used by LVM commands:
    1. filter = [ "a|/dev/drbd|", "a|/dev/sdb|", "r|.*|" ]
    2. pvs # should only show drbd and sdb devices
  2. Switch from local locking to clustered locking:
    1. lvmconf --enable-cluster # Set locking_type to the default clustered type on this system
    2. verify: cat /etc/lvm/lvm.conf |grep locking_type |grep -v "#" # must be locking_type = 3 (clustered locking using DLM)
    3. Other than this setup, creating LVM logical volumes in a clustered environment is identical to creating LVM logical volumes on a single node. There is no difference in the LVM commands themselves, or in the LVM GUI interface.
  3. Do this setting only if your OS itself doesn't use LVM. Don't use locking_type 1 (local) if locking_type 2 or 3 fail. (If an attempt to initialise type 2 or type 3 locking failed, perhaps because cluster components such as clvmd are not running, with this enabled (set to 1), an attempt will be made to use local file-based locking (type 1). If this succeeds, only commands against local VGs will proceed. VGs marked as clustered will be ignored.
    1. fallback_to_local_locking = 0
    2. verify: cat /etc/lvm/lvm.conf |grep fallback_to_local_locking |grep -v "#"
  4. Disable the writing of LVM cache and remove any existing cache:
    1. write_cache_state = 0 # default is "1"
    2. rm /etc/lvm/cache/*
  5. With releases of lvm2 that provide support for lvm2-lvmetad, clusters sharing access to LVM volumes must have lvm2-lvmetad disabled in the configuration and as a service to prevent problems resulting from inconsistent metadata caching throughout the cluster:
    1. use_lvmetad = 0
    2. verify: cat /etc/lvm/lvm.conf |grep use_lvmetad |grep -v "#"
    3. systemctl disable lvm2-lvmetad.service
    4. systemctl disable lvm2-lvmetad.socket
    5. systemctl stop lvm2-lvmetad.service
    6. systemctl status lvm2-lvmetad
    7. Remove lvmetad socket file (if exists): rm '/etc/systemd/system/sockets.target.wants/lvm2-lvmetad.socket'

Setup DLM and CLVM

Create DLM and CLVMD clone cluster Resources (Clone options allows resource to can run on both nodes. In other words clone is Active/Active mode):
pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence
pcs resource clone dlm clone-max=2 clone-node-max=1 interleave=true ordered=true
pcs resource create clvmd ocf:heartbeat:clvm op monitor interval=30s on-fail=fence
pcs resource clone clvmd clone-max=2 clone-node-max=1 interleave=true ordered=true
Verify:
pcs status

Almost every decision in a Pacemaker cluster, like choosing where a resource should run, is done by comparing scores. Scores are calculated per resource, and the cluster resource manager chooses the node with the highest score for a particular resource. (If a node has a negative score for a resource, the resource cannot run on that node.)

We can manipulate the decisions of the cluster with constraints. Constraints have a score. If a constraint has a score lower than INFINITY, it is only a recommendation. A score of INFINITY means it is a must. Default score is INFINITY. To view current constraints: pcs constraint

Each of our devices must be promoted before starting DLM on that node. Configure the resource order (we want drbd to start and promote to Master first and then we want dlm to start and if successful - clvmd will be started):
pcs constraint order promote ms_drbd_r0 then promote ms_drbd_r1 kind=Mandatory
pcs constraint promote start ms_drbd_r1 then start dlm-clone kind=Mandatory
pcs constraint order start dlm-clone then clvmd-clone kind=Mandatory 

Ordering constraints affect only the ordering of resources; they do not require that the resources
be placed on the same node. If you want resources to be started on the same node and in a
specific order, you need both an ordering constraint and a colocation constraint
A colocation constraint determines that the location of one resource depends on the location of another resource. We need clvmd-clone start on the same node where dlm-clone is stared (this resources are cloned, so both nodes will have dlm and clvmd started):
pcs constraint colocation add ms_drbd_r1 with ms_drbd_r0
pcs constraint colocation add dlm-clone with ms_drbd_r1
pcs constraint colocation add clvmd-clone with dlm-clone
Verify:
pcs constraint

Options described:

  • These two settings are the default we using them only for clarity:
    • clone-max=2 - how many copies of the resource to start (we'll have 2 copies of both dlm and clvmd)
    • clone-node-max=1 - how many copies of the resource can be started on a single node (because of previous setting we'll have 2 copies of each resource and because of this setting only one copy of each resource can be started on one node, so we'll have 4 resources, 2 on each node - dlm & clvmd) 
  • on-fail=fence - STONITH the node on which the resource failed.
  • interleave=true - If this clone depends on another clone via an ordering constraint, is it allowed to start after the local instance of the other clone starts, rather than wait for all instances of the other clone to start (so node1 will start dlm and then clvmd resource, but if we setup "interleave=false", then node1 will wait until dlm is started on node2 too, and will start clvmd only after that)
  • ordered=true - will start copies in serial rather than in parallel
  • kind=Mandatory - Always. If first (dlm-clone) does not perform first-action (default first-action is start), then (clvmd-clone) will not be allowed to performed then-action (default then-action equals to the value of the first-action, so default is - start). If first (dlm-clone) is restarted, then (clvmd-clone - if running) will be stopped beforehand and  started afterward.

Check cluster status:
pcs status

Check that DLM is working properly:
dlm_tool ls # name clvmd / members 1 2 / seq 2,2 on one node and 1,1 on the other

Function clvmd_start() calls function clvmd_activate_all() which is basicly "ocf_run vgchange -ay" so if a clustered volume group is used, by default the clvm resource agent will activate it on all nodes.

Setup Clustered LV /shared

On one node: pvscan # scans all supported LVM block devices in the system for PVs - we must see only OS PVs's
On one node: pvcreate /dev/drbd{0,1}
On both nodes: pvdisplay # verify this on both nodes smth. like ""/dev/drbd1" is a new physical volume of "<465.58 GiB"" will appear
On both nodes: vgscan # scans all supported LVM block devices in the system for VGs - we must see only OS VGs's

Resource r0 will provide disk space for VMs that will normally run on agrp-c01n01.
Resource r1 will provide disk space for VMs that will normally run on agrp-c01n02.
So we'll use appropriate names while creating VGs:

  1. r0 (drbd0) will be in VG agrp-c01n01_vg0
  2. r1 (drbd1) will be in VG agrp-c01n02_vg0

On one node: vgcreate -Ay -cy agrp-c01n01_vg0 /dev/drbd0 
On one node: vgcreate -Ay -cy agrp-c01n02_vg0 /dev/drbd1

Options:

  • -A - Specifies if metadata should be backed up automatically after a change.  Enabling this is strongly advised! See vgcfgbackup(8) for more information.
  • -c - Create a clustered VG using clvmd if LVM is compiled with cluster support.  This allows multiple hosts to share a VG on shared devices.  clvmd and a lock manager must be configured and running.  (A clustered VG using clvmd is different from a shared VG using lvmlockd.)  See clvmd(8) for more information about clustered VGs.


On both nodes: vgs and then pvs
On both nodes: lvscan #  List all logical volumes in all volume groups - we must see only OS LVs's
On one node: lvcreate -L 20G -n shared agrp-c01n01_vg0 # we'll use this shared LV to store OS images and other service information
On both nodes: lvdisplay # we must see newly created "shared" named LV

Now we done with DLM & CLVM. 
Reboot both servers, then start cluster and verify that PV, VG and LV are show properly. Then stop cluster on any node and verify that this node sees only local LVs. If everything works as expected, proceed next.


This tutorials were used to understand and setup clustering: 
AN!Cluster
clusterlabs.org
redhat.com


Friday, March 9, 2018

Cluster 16. DRBD Setup.


Installation

On both nodes:
Linbit provides yum repo only with paid support. So we'll use ELRepo (Enterprise Linux Repository) to install DRBD:
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org # imports the public key
yum install drbd84-utils.x86_64 kmod-drbd84.x86_64 -y # DRBD90 is also available in ELRepo but, as official docs state:
https://docs.linbit.com/docs/users-guide-9.0/: With current DRBD-9.0 version running in Dual-Primary mode is not recommended (because of lack of testing). In DRBD-9.1 it will be possible to have more than two primaries at the same time.

systemctl disable drbd.service
systemctl status drbd.service

DRBD will not be able to run under the default SELinux security policies. If you are familiar with
SELinux, you can modify the policies in a more fine-grained manner, but here we will simply exempt
DRBD processes from SELinux control (must be done on both nodes):
semanage permissive -a drbd_t
reboot

Also you can do (must be done on both nodes):
sealert -a /var/log/audit/audit.log
reboot
Then perform suggested actions to solve problems.

Note: This tutorial will create two DRBD resources. Each resource will use a different TCP port. By convention, they start at port 7788 and increment up per resource. So we will be opening ports 7788 and 7789 on each node:

node1:
firewall-cmd --permanent --add-rich-rule='
    rule family="ipv4" 
    source address="10.10.52.2/32" 
    port protocol="tcp" 
    port="7788-7789" accept'
firewall-cmd --reload
firewall-cmd --list-all

node2:
firewall-cmd --permanent --add-rich-rule='
    rule family="ipv4" 
    source address="10.10.52.1/32" 
    port protocol="tcp" 
    port="7788-7789" accept'
firewall-cmd --reload
firewall-cmd --list-all

Setup

Backup existing configs:
rsync -av /etc/drbd.d /root/backups/

cat /etc/drbd.conf 
# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example
include "drbd.d/global_common.conf";
include "drbd.d/*.res";

So we need to setup global_common.conf and *.res file for each of our resources.

Setup common DRBD options


vi /etc/drbd.d/global_common.conf # we will describe only changed options
Verify below options:

  • in the global section:
    • usage-count no; # This tells DRBD that you allow it to report this installation to LINBIT for statistical purposes. If you have privacy concerns, set this to 'no'. 
  • in the handlers section:
    • fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; # sets constraint (smth like drbd-fence-by-handler-r0-ms_drbd_r0) and there the after-resync-target handler on the peer should remove the constraint again. Thus, if the DRBD replication link becomes disconnected, the crm-fence-peer.sh script contacts the cluster manager, determines the Pacemaker Master/Slave resource associated with this DRBD resource, and ensures that the Master/Slave resource no longer gets promoted on any node other than the currently active one. Conversely, when the connection is re-established and DRBD completes its synchronization process, then that constraint is removed and the cluster manager is free to promote the resource on any node again. In a dual-primary setup, if it was a replication link failure only (if it was node failer - pacemaker will call fence agent on the failed node), and cluster communication is still up, both will call that handler,but only one will succeed to set the constraint. The other will remain IO-blocked, and can optionally "commit suicide" from inside the handler. But just because you where able to shoot the other node does not make your data any better.
    • after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; # removes constraint after sync
    • split-brain "/usr/lib/drbd/notify-split-brain.sh root"; # simply sends message about SB occurrence to the specified e-mail address
  • in the startup section:
    • become-primary-on both; # This tells DRBD to promote both nodes to Primary on start.
    • wait for connection timeouts (wfc) - This command will fail if the device cannot communicate with its partner for timeout seconds. If the peer was working before this node was rebooted, the wfc_timeout is used. If the peer was already down before this node was rebooted, the degr_wfc_timeout is used. If the peer was successfully outdated before this node was rebooted the outdated_wfc_timeout is used. The default value for all those timeout values is 0 which means to wait forever. The unit is seconds. In case the connection status goes down to StandAlone because the peer appeared but the devices had a split brain situation, the default for the command is to terminate:
      • wfc-timeout 300;  # This tells DRBD to wait five minutes for the other node to connect. This should be longer than it takes for corosync to timeout and fence the other node *plus* the amount of time it takes the other node to reboot. If you set this too short, you could corrupt your data. If you want to be extra safe, do not use this at all and DRBD will wait for the other node forever. 
      • degr-wfc-timeout 120; # This tells DRBD to wait for the other node for three minutes if the other node was degraded the last time it was seen by this node. This is a way to speed up the boot process when the other node is out of commission for an extended duration.
      • outdated-wfc-timeout 120; #Same as above, except this time-out is used if the peer was 'Outdated'. 
  • in the disk section:
    • on-io-error detach; # 
    • fencing resource-and-stonith; # This tells DRBD to block IO and fence the remote node (using the 'fence-peer' helper) when connection with the other node is unexpectedly lost. This is what helps prevent split-brain condition and it is incredible important in dual-primary setups! 
    • resync-rate 30M; # An eventually running resync process should use about 30MByte/second of IO bandwidth. This tells DRBD how fast to synchronize out-of-sync blocks. The higher this number, the faster an Inconsistent resource will get back to UpToDate state. However, the faster this is, the more of an impact normal application use of the DRBD resource will suffer. We'll set this to 30 MB/sec.
  • int the net section:
    • protocol C; # tells DRBD not to tell the operating system that the write is complete until the data has reach persistent storage on both nodes. This is the slowest option, but it is also the only one that guarantees consistency between the nodes. It is also required for dual-primary, which we will be using.
    • allow-two-primaries; # This tells DRBD to allow two nodes to be Primary at the same time. It is needed when 'become-primary-on both' is set. You only should use this option if you use a shared storage file system on top of DRBD. At the time of writing the only ones are: OCFS2 and GFS. If you use this option with any other file system, you are going to crash your nodes and to corrupt your data! (we are going to use GFS2 and CLVM, both with DLM). This is needed to enable live-migration of our servers. In our case, we'll be running  dual-primary, so we can not safely recover automatically. The only safe option is for the nodes to disconnect from one another and let a human decide which node to invalidate. You can learn more about these options by reading the drbd.conf man page. NOTE! It is not possible to safely recover from a split brain where both nodes were primary. This care requires human intervention, so 'disconnect' is the only safe policy. It doesn't matter what mode you are in, it matters what happened during the time that the nodes were split-brained (time when being StandAlone/UpToDate). If both nodes were secondary during the split-brain, 0pri policy is used. If one node was Primary and the other remained secondary, 1pri policy is used. If both nodes were primary, even for a short time, 2pri is used:
      • after-sb-0pri discard-zero-changes; # "after-sb-0pri" - Split brain has just been detected, but at this time the resource is not in the Primary role on any host - neither node is Primary. "discard-zero-changes" - If there is any host on which no changes occurred at all, simply apply all modifications (sync) made on the other and continue. In case none wrote anything this policy uses a random decision to perform a "resync" of 0 blocks. In case both have written something this policy disconnects the nodes.
      • after-sb-1pri discard-secondary; # "after-sb-1pri" - Split brain has just been detected, and at this time the resource is in the Primary role on one host. "discard-secondary" - discard changes on the secondary and sync to the primary
      • after-sb-2pri disconnect; # "after-sb-2pri" - This tells DRBD what to do in the case of a split-brain when both nodes are primary. "disconnect" - no automatic re-synchronization, simply disconnect.
After setup file on one node (agrp-c01n01), copy file to the other node:
rsync -av /etc/drbd.d/global_common.conf root@agrp-c01n02:/etc/drbd.d/global_common.conf

Setup DRBD resource options

We are going to have 2 resources - r0 and r1, so we must to setup 2 files:

  1. resource r0, which will create the device /dev/drbd0, will be backed by each nodes /dev/sda3 partition. It will provide disk space for VMs that will normally run on agrp-c01n01 and provide space for the /shared GFS2 (will be discussed further) partition.
  2. resource r1, which will create the device /dev/drbd1, will be backed by each nodes //dev/sda4 partition. It will provide disk space for VMs that will normally run on agrp-c01n02 and provide space for the /shared GFS2 (will be discussed further) partition.
vi /etc/drbd.d/r0.res

# This is the resource used for the shared GFS2 partition and host VMs designed 
# to run on an-a05n01. 
resource r0 {
            # This is the block device path. 
            device /dev/drbd0; 
            # We'll use the normal internal meta-disk. This is where DRBD stores 
            # its state information about the resource. It takes about 32 MB per 
            # 1 TB of raw space. 
            meta-disk internal; 
            # This is the `uname -n` of the first node 
            on agrp-c01n01 { 
                           # The 'address' has to be the IP, not a host name. This is the 
                           # node's SN (sn_bond1) IP. The port number must be unique amoung 
                           # resources. 
                           address 10.10.52.1:7788; 
                           # This is the block device backing this resource on this node. 
                           disk /dev/sda3; 
            } 
           # Now the same information again for the second node. 
          on agrp-c01n02 { 
                           address 10.10.52.2:7788; 
                           disk /dev/sda3; 
          } 
}

Now copy this to r1.res and edit for the agrp-c01n01 VM resource. The main differences are the resource name, r1, the block device, /dev/drbd1, the port, 7789 and the backing block devices, /dev/sda4:

vi /etc/drbd.d/r0.res

# This is the resource used for the shared GFS2 partition and host VMs designed 
# to run on an-a05n01. 
resource r0 {
            # This is the block device path. 
            device /dev/drbd0; 
            # We'll use the normal internal meta-disk. This is where DRBD stores 
            # its state information about the resource. It takes about 32 MB per 
            # 1 TB of raw space. 
            meta-disk internal; 
            # This is the `uname -n` of the first node 
            on agrp-c01n01 { 
                           # The 'address' has to be the IP, not a host name. This is the 
                           # node's SN (sn_bond1) IP. The port number must be unique amoung 
                           # resources. 
                           address 10.10.52.1:7789; 
                           # This is the block device backing this resource on this node. 
                           disk /dev/sda4; 
            } 
           # Now the same information again for the second node. 
          on agrp-c01n02 { 
                           address 10.10.52.2:7789; 
                           disk /dev/sda4; 
          } 
}

Now we will do an initial validation of the configuration - if some options are wrong, descriptive warning message will appear. This is done by running the following command:
drbdadm dump

Now do the same process for node 2 or just use rsync:
rsync -av /etc/drbd.d root@agrp-c01n02:/etc/
After setting node 2 up - verify with drdbadm dump.
To see which options are default options:
drbdsetup /dev/drbd0 show --show-defaults
drbdsetup /dev/drbd1 show --show-defaults

Create DRBD resources

Create DRBD resources (on both nodes):
drbdadm create-md r{0,1} # 'yes' 'yes' => New drbd meta data block successfully created. This step must be completed only on initial device creation. It initializes DRBD’s metadata. If create-md returns "Operation refused" error, then for needed disk (/dev/sda3 or /dev/sda4 in our case):
dd if=/dev/zero of=/dev/sda3 bs=1M count=128
drbdadm up r{0,1} # This step associates the resource with its backing device (or devices, in case of a multi-volume resource), sets replication parameters, and connects the resource to its peer.

DRBD’s virtual status file in the /proc filesystem, /proc/drbd, should now contain information similar to the following (The Inconsistent/Inconsistent disk state is expected at this point):
version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:488266148
 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:488192424

Description of line with resource number (lines starting with "0:" or "1:" ):
  1. cs - connection status (i.e. Connected, Waiting, see http://docs.linbit.com/docs/users-guide-8.4/#s-connection-states)
  2. ro - roles (i.e. Primary, Secondary, see http://docs.linbit.com/docs/users-guide-8.4/#s-roles)
  3. ds - disk status (i.e. Inconsistent, UpToDate, see http://docs.linbit.com/docs/users-guide-8.4/#s-disk-states)
  4. replication protocol mode (A, B or C, see http://docs.linbit.com/docs/users-guide-8.4/#s-replication-protocols)
  5. six flags reflecting the I/O status of this resource (i.e. r - running (is the normal state), see http://docs.linbit.com/docs/users-guide-8.4/#s-io-flags) Normally first flag must be "r" and others "-", so flags must be like: r-----
  6. the next line after described line is the line of "Performance indicators" (see http://docs.linbit.com/docs/users-guide-8.4/#s-performance-indicators)
  7.  oos (out of sync) -  amount of storage currently out of sync; in Kibibytes.

The good thing about DRBD is that we do not have to wait for the resources to be synchronized. So long as one of the resource is UpToDate, both nodes will work. If the Inconsistent node needs to read data, it will simply read it from its peer. But cluster can not be considered redundant until both nodes are UpToDate. So to make disk consistent issue on only on one node (it's the first time we connect our disks, so we really don't need to synchronize disks data):
drbdadm new-current-uuid --clear-bitmap r{0,1} #    You must not use this on pre-existing data!  Even though it may appear to work at first glance, once you switch to the other node, your data is toast, as it never got replicated. So do not leave out the mkfs (or equivalent) - this means that you must create file-system on top of the DRBD device

drbdadm primary --force r{0,1} # just to be sure that synchronization status is checked

Verify (local node is primary because of the drbdadm primary --force r{0,1} command):
cat /proc/drbd

version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:2120 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:2120 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

As we see both nodes data is UpToDate. Also we can use drbd-overview:
NOTE: drbd-overview will be deprecated soon.
Please consider using drbdtop.

 0:r0/0  Connected Primary/Secondary UpToDate/UpToDate 
 1:r1/0  Connected Primary/Secondary UpToDate/UpToDate 

Or we can use drbdadm status r{0,1}:
r0 role:Primary
  disk:UpToDate
  peer role:Secondary
    replication:Established peer-disk:UpToDate

r1 role:Primary
  disk:UpToDate
  peer role:Secondary
    replication:Established peer-disk:UpToDate

In order for a DRBD resource to be usable, it has to be "promoted". Be default, DRBD resources start in the Secondary state. This means that it will receive changes from the peer, but no changes can be made. You can't even look at the contents of a Secondary resource. Why this is requires more time to discuss than we can go into here.
So the next step is to promote both resource on both nodes.
drbdadm primary r{0,1}
Verify:
cat /proc/drbd # if both nodes are Primary, we're done setting up DRBD.

Pacemaker DRBD resource setup 

Find appropriate type of resource:
pcs resource list | grep -i drbd
ocf:linbit:drbd - Manages a DRBD device as a Master/Slave resource

To view all available non-advanced options:
pcs resource describe drbd

So we need to create Master/Slave resource of type ocf:linbit:drbd:

r0:
pcs resource create drbd_r0 ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s op monitor interval="29s" role="Master" op monitor interval="31s" role="Slave"
pcs resource master ms_drbd_r0 drbd_r0 master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true
pcs resource cleanup drbd_r0
journalctl | grep -i error

Give it a couple minutes to promote both nodes to Master on both nodes. Initially, it will appear as Master on one node only.

r1:
pcs resource create drbd_r1 ocf:linbit:drbd drbd_resource=r1 op monitor interval=60s op monitor interval="29s" role="Master" op monitor interval="31s" role="Slave"
pcs resource master ms_drbd_r1 drbd_r1 master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
pcs resource cleanup drbd_r1
journalctl | grep -i error

Give it a couple minutes to promote both nodes to Master on both nodes. Initially, it will appear as Master on one node only.

Options description (master-* options are unique to multi-state resource, clone-* options are derived from clone resources):
  • clone-max=2 - how many copies of the resource to start
  • clone-node-max=1 - how many copies of the resource can be started on a single node
  • master-max=2 - how many copies of the resource can be promoted to the master role
  • master-node-max=1 - how many copies of the resource can be promoted to the master role on a single node
  • notify=true - When stopping or starting a copy of the clone, tell all the other copies beforehand and again when the action was successful. 


We didn't enable drbd daemon at all, so pacemaker will start drbd when cluster is up, and turn it down when cluster stops.
to delete resources:
pcs resource delete drbd_r1 # ms resource will be deleted automatically
pcs resource delete drbd_r0 # ms resource will be deleted automatically
pcs resource cleanup
pcs resource


Split Brain (SB) & Recovery From SB

Normal operation status:
0:r0/0  Connected Primary/Primary UpToDate/UpToDate
or
0:r0/0  Connected Primary/Secondary UpToDate/UpToDate

Master Node WFConnection - primary node can't connect to the secondary node. To resolve see "Manually connecting slave to the master":
0:r0/0  WFConnection Primary/Unknown UpToDate/Unknown

Note: If the master node reports WFConnection while the slave node reports StandAlone, it indicates a DRBD split brain. See "Recovery from SB"

Slave node StandAlone - secondary can't connect to the primary. To resolve see "Manually connecting slave to the master":
0:r0/0  StandAlone Secondary/Unknown UpToDate/Unknown

Note: If the master node reports WFConnection while the slave node reports StandAlone, it indicates a DRBD split brain. See "Recovery from SB"

Both nodes Secondary/Secondary - nodes are connected, but neither is primary. Usually this is due to Pacemaker failer. To resolve restart the entire cluster if DRBD itself is functioning properly (this judgement is up on you):
0:r0/0  Connected Secondary/Secondary UpToDate/UpToDate
0:r0/0  Connected Secondary/Secondary UpToDate/UpToDate

Both nodes StandAlone and Primary  - SB is occured. Both nodes operating independently.  To resolve see "Recovery from SB":
0:r0/0  StandAlone Primary/Unknown UpToDate/Unknown
0:r0/0  StandAlone Primary/Unknown UpToDate/Unknown

"Split-Brain detected, dropping connection!" - message appeared in logs. To resolve see "Recovery from SB".

Manually connecting slave to the master

First identify which node is the master (in such situation Pacemaker and DRBD can see master node differently, we need DRBD point of view).
On the slave node:
drbdadm connect r0 # after that reconnection must take place, verify with drbd-overview

Recovery from SB

When SB occures nodes operating completely independently, because there is no connection and data replication is not taking place. After split brain has been detected, one node will always have the resource in a StandAlone connection state. The other might either also be in the StandAlone state (if both nodes detected the split brain simultaneously), or in WFConnection (if the peer tore down the connection before the other node had a chance to detect split brain).
To understand which node must be considered as master - you need to verify actual data on each node.
First identify which node is considered to be the master and run on the master.
On the slave node:
drbdadm secondary r0
drbdadm connect --discard-my-data connect r0 # Discarding the database on the slave node does not result in a full re-synchronization from master to slave. The slave node has its local modifications rolled back, and modifications made to the master are propagated to the slave. If "Failure: (102) Local address (port) already in use." message is shown issue on the slave node: drbdadm disconnect r0
Verify with drbd-overview - now both nodes must show "Connected", proper role and "UpToDate/UpToDate".
On the master node (SB survivor) - only if it is also StandAlone (if it's WFConnection, this step is not needed):
drbdadm connect r0 # # after that reconnection must take place, verify with drbd-overview

After re-synchronization has completed, the split brain is considered resolved and the two nodes form a fully consistent, redundant replicated storage system again.

This tutorials were used to understand and setup clustering: 
AN!Cluster
LINBIT User's Guide 8.4.x
clusterlabs.org
avid.force.com

Wednesday, March 7, 2018

Cluster 15. Pacemaker resources & resource constraints.

Pacemaker resources

A resource is a service made highly available by a cluster. The simplest type of resource, a primitive
resource, is described in this section. More complex forms, such as groups and clones, are described
in later sections.
Every primitive resource has a resource agent. A resource agent is an external program that abstracts  the service it provides and present a consistent view to the cluster. Typically, resource agents come in the form of shell scripts. However, they can be written using any technology (such as C, Python or Perl) that the author is comfortable with.

Pacemaker supports several classes of agents (Remember to make sure the computer is not configured to start any services at boot time — that should be controlled by the cluster):

  • LSB - LSB resource agents are those found in /etc/init.d (SysV initialization style). Many distributions claim LSB compliance but ship with broken init scripts. Common problematic violations of the LSB standard include:
    • Not implementing the status operation at all
    • Not observing the correct exit status codes for start/stop/status actions
    • Starting a started resource returns an error
    • Stopping a stopped resource returns an error
  • Systemd - Some newer distributions have replaced the old "SysV" style of initialization daemons and scripts with an alternative called Systemd . Pacemaker is able to manage these services if they are present. Instead of init scripts, systemd has unit files. Generally, the services (unit files) are provided by the OS distribution, but there are online guides for converting from init scripts.
  • Upstart - Some newer distributions have replaced the old "SysV" style of initialization daemons (and scripts) with an alternative called Upstart. Instead of init scripts, upstart has jobs. Generally, the services (jobs) are provided by the OS distribution.
  • Service - Since there are various types of system services (systemd, upstart, and lsb), Pacemaker supports a special service alias which intelligently figures out which one applies to a given cluster node. This is particularly useful when the cluster contains a mix of systemd, upstart, and lsb. In order, Pacemaker will try to find the named service as:
    • an LSB init script
    • a Systemd unit file
    • an Upstart job
  • OCF - the OCF standard is basically an extension of the LSB conventions for init scriptsto support parameters, make them self-describing, and make them extensible. OCF specs have strict definitions of the exit codes that actions must return. The cluster follows these specifications exactly, and giving the wrong exit code will cause the cluster to behave in ways you will likely find puzzling and annoying. In particular, the cluster needs to distinguish a completely stopped resource from one which is in some erroneous and indeterminate state. Parameters are passed to the resource agent as environment variables, with the special prefix OCF_RESKEY_. So, a parameter which the user thinks of as ip will be passed to the resource agent as OCF_RESKEY_ip. The number and purpose of the parameters is left to the resource agent; however, the resource agent should use the meta-data command to advertise any that it supports. The OCF class is the most preferred as it is an industry standard, highly flexible (allowing parameters to be passed to agents in a non-positional manner) and self-describing. pcs resource providers command show which OCF providers presence on the cluster:
    • heartbeat
    • linbit
    • openstack
    • pacemaker
  • Fencing - this class is used exclusively for fencing-related resources (differently from all other resources which are managed by pcs resource create command, this class is managed by pcs stonith command)
  • Nagios Plugins - Nagios Plugins allow us to monitor services on remote hosts. Pacemaker is able to do remote monitoring with the plugins if they are present. A common use case is to configure them as resources belonging to a resource container (usually a virtual machine), and the container will be restarted if any of them has failed. Another use is to configure them as ordinary resources to be used for monitoring hosts or services via the network.
To find which Pacemaker resource classes/standards are supported:
pcs resource standards # List available resource agent standards supported by this installation
lsb
ocf
service
systemd

To view all resources agents:
pcs resource list # this will list all available resource agents with their standard/class names and with provider name (if one is available).

To view usage help for any agent:
pcs resource describe [<standard>:[<provider>:]]<type> [--full]
you can specify only the name of the agent (standard and provider are optional until agent's name in unique). Without "--full" advanced options are not shown.

Pacemaker resource constraints

Pacemaker has below resource constraints:

  1. Location constraints - tell the cluster which nodes a resource can run on.
  2. Ordering Constraints - tell the cluster the order in which resources should start or stop. 
  3. Colocation Constraints - tell the cluster that the location of one resource depends on the location of another one.
  4. Ticket Constraints - tell the cluster how to coordinate multi-site (geographically-distributed/dispersed clusters). Apart from local clusters, Pacemaker also supports multi-site clusters. That means you can have multiple, geographically dispersed sites, each with a local cluster. Fail-over between these clusters can be coordinated manually by the administrator, or automatically by a higher-level entity called a Cluster Ticket Registry (CTR). A ticket grants the right to run certain resources on a specific cluster site.
What these all mean:
  • If we want to start Apache on node1 we use location constraint
  • but if we can start Apache on any node but must start it only on the node where MySQL is currently started, then we'll use colocation constraint
  • if we want Apache to be started only after MySQL is started, then we'll use ordering constraint
  • if we have several (not only two) resources that must have applied some rules on them, then we use resources set

Pacemaker non-primitive resources: groups, clones, multi-state, bundles


  • groups - One of the most common elements of a cluster is a set of resources that need to be located together, start sequentially, and stop in the reverse order. To simplify this configuration, we support the concept of groups. Resources are started in the order they appear in a group and resources are stopped in the reverse order to which they appear in the group. So group is syntactic-sugar packaging primitive resource, colocation constraints and order constraints under one name (configured by pcs resource group)
  • clones - Resources That Get Active on Multiple Hosts (A clone is basically a shortcut: instead of defining n identical, yet differently named resources, a single cloned resource suffices). Three types of cloned resources exist:
    • Anonymous - Anonymous clones are the simplest. These behave completely identically everywhere they are running. Because of this, there can be only one copy of an anonymous clone active per machine.
    • Globally unique - Globally unique clones are distinct entities. A copy of the clone running on one machine is not equivalent to another instance on another node, nor would any two copies on the same node be equivalent.
    • Stateful clones -  multi-state 
    • (configured by pcs resource clone)
  • multi-state - Resources That Have Multiple Modes - are a specialization of clone resources. Multi-state resources allow the instances to be in one of two operating modes (called roles). The roles are called master and slave, but can mean whatever you wish them to mean. The only limitation is that when an instance is started, it must come up in the slave role (configured by pcs resource master).
  • bundles - Pacemaker supports a special syntax for launching a container with any infrastructure it requires: the bundle (configured by pcs resource bundle).

This tutorial was used to understand and setup clustering: 
clusterlabs.org

Monday, March 5, 2018

Cluster 14. Setup partitions for DRBD.

Overview

DRBD states for Distributed Replicated Block Device, is a technology that takes raw storage from two nodes and keeps their data synchronized in real time. It is sometimes described as "network RAID Level 1", and that is conceptually accurate. In this tutorial cluster, DRBD will be used to provide that back-end storage as a cost-effective alternative to a traditional SAN device.

With traditional raid, you would take:
HDD1 + HDD2 -> /dev/sda

With DRBD, you have this:
node1:/dev/sda5 + node2:/dev/sda5 -> both:/dev/drbd0

In both cases, as soon as you create the new device, you pretend like the member devices no longer exist. You format a file-system onto newly created device as an LVM physical volume, and so on.
The main difference with DRBD is that the /dev/drbd0 will always be the same on both nodes. If you write something to node 1, it's instantly available on node 2, and vice versa. Of course, this means that what ever you put on top of DRBD has to be "cluster aware". That is to say, the program or file system using the new /dev/drbd0 device has to understand that the contents of the disk might change because of another node.


Setuping partitions for DRBD

We're going to use a program called parted instead of fdisk. With fdisk, we would have to manually ensure that our partitions fell on 64 KiB block boundaries. With parted, we can use the -a opt to tell it to use optimal alignment, saving us a lot of work. This is important for decent performance performance in our servers. This is true for both traditional platter and modern solid-state drives.
For performance reasons, we want to ensure that the file systems created within a VM matches the block alignment of the underlying storage stack, clear down to the base partitions on /dev/sda (or what ever your lowest-level block device is). By changing the start cylinder of our partitions to always start on 64 KiB boundaries, we're sure to keep the guest OS's file system in-line with the DRBD backing device's blocks. Thus, all reads and writes in the guest OS effect a matching number of real blocks, maximizing disk I/O efficiency.

yum install parted -y # on both nodes

We will setup 2 DRBD resources:
  1. r0 - This resource will back the VMs designed to primarily run on agrp-c01n01
  2. r1 - This resource will back the VMs designed to primarily run on agrp-c01n02
In the case of the split brain (if both nodes are online and both remain working) we must know which nodes data is more valid. We will consider each node to be more valid for a group of VM's - these VM's will be default resources for that node, so we can easily recover:
  1. The DRBD resource hosting agrp-c01n01's servers can invalidate any changes on agrp-c01n02. We consider agrp-c01n01 to be more valid for r0 resource
  2. The DRBD resource hosting agrp-c01n02's servers can invalidate any changes on agrp-c01n01. We consider agrp-c01n02 to be more valid for r1 resource
LVM (Logical Volume) - we have physical volumes (execute pvs to view), every PV is then added to the volume group (in the pvs output you can see VG name in front of the PV name), then VG can be divided into the logical volumes (execute vgs to see how many PV and LV each group has and execute lvs to see details about each LV).

We are going to use raw partitions and if your system has not spare raw partitions, in Cluster 1 post I advised to leave space assumed for VMs as raw partition/device, if you didn't the simplest way will be to reinstall all the things from the start, or (steps below are not tested enough, so use them at your own risk - I assume that you have additional partition with enough space to move /root /home and swap there) - at first you will reduce size of /dev/mapper/centos-home partition (yours name can be different - choose non-root partition which is big enough):

Firstly we will resize /home partition:
umount /home
df -h # verify that /home is unmounted
parted -l # to find type of the home LV filesystem (fxs in my case)
lvremove /dev/centos/home # remove LV from the VG
lvcreate -L 50G -n home centos # create new smaller partition for home in VG centos
mkfs.xfs /dev/mapper/centos-home # make home partition formatted as XFS file-system
mount -a # remount all partitions
df -h # verify that /home is mounted

Secondly we will return allocated free space from PV to the raw partition:
Now determine how much space do you want for your DRBD resources, I'll use 500GB for r0 and 500GB for r1:
pvs # find name of the raw partition with enough free space /dev/sda3 in my case
pvdisplay /dev/sda3 # take PE Size (4 MiB in my case) and Allocated PE (27616 in my case). PE is Physical Extent. Calculate space needed for /dev/sda3 (this space is already usedfor /root , /home and swap). You can use bash bc calculator. 4*27616=110464MiBs (MiB = 1024KiB / MB = 1000KB), 110464/1024=107GiBs. Or you can use:
pvs -o +used /dev/sda3 #  this command shows 107.88g instead of our calculated 107GiBs
pvs -v --segments /dev/sda3 # free space must be only at the end of the LVM, in my case I have some free space before root partition, so I need to reallocate space: pvmove --alloc anywhere /dev/sda3:940779-953578 # /dev/sda3:940779-953578 - is the value from PE ranges column of the pvs -v --segments /dev/sda3 command output, after reallocating - recheck with pvs -v --segments /dev/sda3
After reallocating all partitions and having free space only at the end of the LVm disks, move partitions to the additional disk (/dev/sdb2 in my case):
pvmove  /dev/sda3:14816-27615 /dev/sdb2 # moving /root
pvmove  /dev/sda3:2016-14815 /dev/sdb2 # moving /home
pvmove  /dev/sda3:0-2015 /dev/sdb2 # moving swap
If you renamed VG or want to rename - follow instructions in this post - vgrename

vgreduce centos00 /dev/sda3
pvremove /dev/sda3

Lets start with parted:
parted -a optimal /dev/sda # access parted for /dev/sda - /dev/sda3 partition's disk
rm 3 # remove /dev/sda3
print free # to see how many space we have to form new partitions 
mkpart extended 1076MB 501.076GB # 1076MB is the start point of free space 501.076GB is the start-point + 500GB
print free # check that new extended partition is created and it is 500GB
Now we'll create two logical partitions on the newly created extended partition:
mkpart extended 501GB 1001GB # 501GB is the start point of free space 1001GB is the start-point + 500GB
print free # check that new extended partition is created and it is 500GB
We created 2 extended partitions, lets check that they are aligned optimally:
align-check opt 3 # must return "3 aligned"
align-check opt 4 # must return "4 aligned"
quit # to escape parted

This tutorial was used to understand and setup clustering: 
AN!Cluster

ATM / DSL OAM ping

The ATM OAM Ping feature sends an ATM Operation, Administration, and Maintenance (OAM) packet to confirm the connectivity of a specific permanent virtual circuit (PVC). The status of the PVC is displayed when a response to the OAM packet is received. The ATM OAM Ping feature allows the network administrator to verify PVC integrity and facilitates ATM network troubleshooting (it's OAM ping, not IP ping, so that you must be able to success atm ping (OAM ping) without having an IP address set up on an ATM interface (Cisco specific command):

ping atm interface ATM0.1 7 34 # ATM0.1 is ATM subinterface, 7 is VPI, 34 is VCI

Thursday, March 1, 2018

Cluster 13. Setup IPMI fencing


STONITH (Shoot The Other Node In The Head aka. fencing) protects your data from being corrupted
by non-clustered (because of some problem) nodes or unintended concurrent access.
Just because a node is unresponsive doesn’t mean it has stopped accessing your data. The only
way to be 100% sure that your data is safe, is to use STONITH to ensure that the node is truly offline
before allowing the data to be accessed from another node.
STONITH also has a role to play in the event that a clustered service cannot be stopped. In this case,
the cluster uses STONITH to force the whole node offline, thereby making it safe to start the service
elsewhere.
In pacemaker STONITH is one of the primitive resources.

Testing fence agent

On each node find proper hostname to use with cluster:
crm_node -n displays the name used by a running cluster

For now we'll be using IPMI fencing - simply powering off the lost server.
Verify that we have needed fencing agent:
[root@agrp-c01n01 ~]# pcs stonith list fence_ipmilan
fence_ipmilan - Fence agent for IPMI

Verify that agent is working properly (from both nodes we test each other - detailed explanation of fence_ipmilan arguments can be found via - man fence_ipmilan):
[root@agrp-c01n01 ~]# fence_ipmilan --ip=agrp-c01n02.ipmi --username="Administrator" --password="our_pass_here" --action="status" --lanplus
[root@agrp-c01n02 ~]# fence_ipmilan --ip=agrp-c01n01.ipmi --username="Administrator" --password="our_pass_here" --action="status" --lanplus

If both tests are successful, proceed.

Setup fencing devices


[root@agrp-c01n01 ~]# pcs stonith describe fence_ipmilan # shows all available options

From one of the nodes (view pcs stonith create help):
Create fence device ipmi_n01 for agrp-c01n01:
pcs stonith create fence_ipmi_n01 fence_ipmilan pcmk_host_list="agrp-c01n01" ipaddr="agrp-c01n01.ipmi" login=Administrator passwd=our_pass_here lanplus=true power_wait=4  delay=15 op monitor interval=60s
Create fence device ipmi_n02 for agrp-c01n02:
pcs stonith create fence_ipmi_n02 fence_ipmilan pcmk_host_list="agrp-c01n02" ipaddr="agrp-c01n02.ipmi" login=Administrator passwd=our_pass_here lanplus=true power_wait=4 op monitor interval=60s


Described (view pcs stonith create help):
  • pcs stonith create fence_ipmi_n01 fence_ipmilan - create stonith device named "fence_ipmi_n01" and using "fence_ipmilan" fencing agent
  • pcmk_host_list - a list of machines controlled by this device
  • ipaddr - proper IPMI-hostname or IPMI IP address of the node
  • login & passwd - IPMI username and password
  • lanplus - enable enhanced security while accessing IPMI
  • power_wait - wait X seconds after issuing ON/OFF
  • delay - this only needed on one of the nodes because if both nodes are alive then each will try to fence the other and theoretically both nodes can fence each other at the same time, so giving delay of 15 seconds on one node makes one node be 15 seconds faster than the other in fencing procedure.
Execute pcs config to verify that above commands are inserted properly:
Stonith Devices:
 Resource: fence_ipmi_n01 (class=stonith type=fence_ipmilan)
  Attributes: delay=15 ipaddr=agrp-c01n01.ipmi lanplus=true login=Administrator passwd=our_pass_here pcmk_host_list=agrp-c01n01 power_wait=4
  Operations: monitor interval=60s (fence_ipmi_n01-monitor-interval-60s)
 Resource: fence_ipmi_n02 (class=stonith type=fence_ipmilan)
  Attributes: ipaddr=agrp-c01n02.ipmi lanplus=true login=Administrator passwd=our_pass_here pcmk_host_list=agrp-c01n02 power_wait=4
  Operations: monitor interval=60s (fence_ipmi_n02-monitor-interval-60s)

Verify that both fencing devices are "Started":
pcs stonith 

Enable stonith in a cluster (Run 'man pengine' and 'man crmd' to get a description of the properties):
pcs property set stonith-enabled=true
pcs property # to check all set-up properties

To delete fencing devices (if something went wrong):
pcs stonith delete ipmi_n01
pcs stonith delete ipmi_n02

To add stonith level 1 (stonith levels are run in the order they come: 1st then 2nd up to 9th level - fencing stop trying to fence after first succeeded level). Fencing levels are needed if you have more than one fencing devices/resources in use (view pcs stonith level  add help):
pcs stonith level add 1 agrp-c01n01 fence_ipmi_n01
pcs stonith level add 1 agrp-c01n02 fence_ipmi_n02
pcs stonith level

Fencing resource location:
to verify where fencing resource is started, execute:
pcs stonith 
 fence_ipmi_n01 (stonith:fence_ipmilan): Started agrp-c01n01
 fence_ipmi_n02 (stonith:fence_ipmilan): Started agrp-c01n02
 Target: agrp-c01n01
   Level 1 - fence_ipmi_n01
 Target: agrp-c01n02
   Level 1 - fence_ipmi_n02

As you see  fence_ipmi_n01 is started on agrp-c01n01, and fence_ipmi_n02 is started agrp-c01n02. We want agrp-c01n01 to start fence_ipmi_n02 and agrp-c01n02 to start fence_ipmi_n01. To do so we must first understand what are pacemaker resource constraints. And also we don't want fence_ipmi_n01 be started on agrp-c01n01, and don't want fence_ipmi_n02 be started on agrp-c01n02


Pacemaker resource constraints:

  1. Location constraints - tell the cluster which nodes a resource can run on.
  2. Ordering Constraints - tell the cluster the order in which resources should start or stop. 
  3. Colocation Constraints - tell the cluster that the location of one resource depends on the location of another one.
  4. Ticket Constraints - tell the cluster how to coordinate multi-site (geographically-distributed/dispersed clusters) 

To setup stonith preference  with location constraint:
There are two alternative strategies. One way is to say that, by default, resources can run anywhere,
and then the location constraints specify nodes that are not allowed (an opt-out cluster). The other way is to start with nothing able to run anywhere, and use location constraints to selectively enable allowed nodes (an opt-in cluster). We'll use default method:pcs constraint location add lc_ipmi_n02 fence_ipmi_n02 agrp-c01n02 -INFINITY
pcs constraint location add lc_ipmi_n01 fence_ipmi_n01 agrp-c01n01 -INFINITY

Described:
  • pcs constraint location add - add new resource location constraint
  • lc_ipmi_n01 - location constraint name
  • fence_ipmi_n01 - name of the resource 
  • agrp-c01n02 -node name for which this location constraint is being setup 
  • -INFINITY - score meaning "must never be used to run that resource"
So "pcs constraint location add lc_ipmi_n01 fence_ipmi_n01 agrp-c01n01 -INFINITY" means:
create location constraint named lc_ipmi_n01 which means never run fence_ipmi_n01 resource on agrp-c01n01.

To remove constraint:
pcs constraint --full # or use pcs config
then remove by id:
pcs constraint remove cpecified_id

Execute pcs config to verify that above commands are inserted properly:
Location Constraints:
  Resource: fence_ipmi_n01
    Disabled on: agrp-c01n01 (score:-INFINITY) (id:lc_ipmi_n01)
  Resource: fence_ipmi_n02
    Disabled on: agrp-c01n02 (score:-INFINITY) (id:lc_ipmi_n02)

pcs constraint location 
Location Constraints:
  Resource: fence_ipmi_n01
    Disabled on: agrp-c01n01 (score:-INFINITY)
  Resource: fence_ipmi_n02
    Disabled on: agrp-c01n02 (score:-INFINITY)

Verify that each fencing mechanism is started on the opposite node:
pcs status
 fence_ipmi_n01 (stonith:fence_ipmilan): Started agrp-c01n02
 fence_ipmi_n02 (stonith:fence_ipmilan): Started agrp-c01n01

To view current placement scores:
crm_simulate -sL
Allocation scores:
native_color: fence_ipmi_n01 allocation score on agrp-c01n01: -INFINITY
native_color: fence_ipmi_n01 allocation score on agrp-c01n02: 0
native_color: fence_ipmi_n02 allocation score on agrp-c01n01: 0
native_color: fence_ipmi_n02 allocation score on agrp-c01n02: -INFINITY


Test fencing

Make agrp-c01n02 kernel panic:
echo c > /proc/sysrq-trigger #  immediately and completely hangs the kernel. This does not effect the IPMI BMC, so if we've configured everything properly, the surviving node should be able to use fence_ipmilan to reboot the crashed node.
After that agr-c01n01 must fence agrp-c01n02:
grep "Operation reboot of agrp-c01n02 by agrp-c01n01" /var/log/messages # must show "OK" and agrp-c01n02 must reboot
pcs status:

  • fisrst shows: 
    • Node agrp-c01n02: UNCLEAN (offline)
    • fence_ipmi_n01 (stonith:fence_ipmilan): Started agrp-c01n02 (UNCLEAN)
  • then (after successful fencing) shows:
    • OFFLINE: [ agrp-c01n02 ]
    • fence_ipmi_n01 (stonith:fence_ipmilan): Stopped
stonith_admin --history=*  # must show which node fenced the other and when (* means history from all nodes) on the fence initiator node, fence victim will not show this.

We didn't enable corosync and pacemaker services on our nodes and in our cluster, so after the fencing is done we must enable cluster on the fenced node:
pcs cluster start # this command will start corosync and pacemaker on local node, to start cluster on the remote node specify node host-name after start word, to start cluster on all nodes use --all argument, to stop use stop argument

If IPMI fence devices are stopped because of failure, you can:
pcs stonith cleanup ipmi_n01
pcs stonith cleanup ipmi_n02
after clearing all errors.

Make agrp-c01n01 kernel panic:
echo c > /proc/sysrq-trigger 
After that agr-c01n02 must fence agrp-c01n01:
grep "Operation reboot of agrp-c01n01 by agrp-c01n02" /var/log/messages # must show "OK" and agrp-c01n02 must reboot
pcs status:

  • fisrst shows: 
    • Node agrp-c01n01: UNCLEAN (offline)
    • fence_ipmi_n02 (stonith:fence_ipmilan): Started agrp-c01n01 (UNCLEAN)
  • then (after successful fencing) shows:
    • OFFLINE: [ agrp-c01n01 ]
    • fence_ipmi_n02 (stonith:fence_ipmilan): Stopped

stonith_admin --history=*  # must show which node fenced the other and when (* means history from all nodes) on the fence initiator node, fence victim will not show this.


This tutorials were used to understand and setup clustering: 
AN!Cluster