IT Stuff

Thursday, November 8, 2018

Cluster 25. Restoring failed node.

In case of hardware failure and need to restore one of the node (ex. agrp-c01n02):

go through all steps in Cluster 1 - Cluster 11 blog-posts (do only stuff related to the failed node)
Cluster 12 blog-post - go through steps till "Login to any of the cluster node and authenticate hacluster user." part (do only stuff related to the failed node), then:

passwd hacluster
from an active node:

pcs node maintenance agrp-c01n02
pcs cluster auth agrp-c01n02

from agrp-c01n02:

pcs cluster auth
pcs cluster start
pcs cluster status # node must be in maintenance mode with many errors due to absence of drbd / virsh and other packages

then go through Cluster 12, starting at "Check cluster is functioning properly (on both nodes)" till "Quorum:" part

go through all steps in Cluster 14 blog-post (do only stuff related to the failed node)
Cluster 16 blog-post - go through steps till "Setup common DRBD options" part (do only stuff related to the failed node), then:

from agrp-c01n01:

rsync -av /etc/drbd.d root@agrp-c01n02:/etc/

from agrp-c01n02:

drbdadm create-md r{0,1}
drbdadm up r0; drbdadm secondary r0
drbd-overview
drbdadm up r1; drbdadm secondary r1
drbd-overview
wait till full synchronisation
reboot failed node

Cluster 17 blog-post - go through steps till "Setup DLM and CLVM" (do only stuff related to the failed node), then:

drbdadm up all
cat /proc/drbd

Cluster 19 blog-post - only do check of the SNMP from the failed node:

snmpwalk -v 2c -c agrp-c01-community 10.10.53.12
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action list
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action list

Cluster 20 blog-post - go through steps till "Provision Planning" (do only stuff related to the failed node), then:

rsync -av /etc/libvirt/qemu/networks/ovs-network.xml root@agrp-c01n02:/root
systemctl start libvirtd
virsh net-define /root/ovs-network.xml
virsh net-list --all
virsh net-start ovs-network
virsh net-autostart ovs-network
virsh net-list
systemctl stop libvirtd
rm /root/ovs-network.xml

For each VM add constraint to ban VM start on failed node (I assume n02 to fail). Below command adds -INFINITY location constraint for specified resource and node:

pcs resource ban vm01-rntp agrp-c01n02
pcs resource ban vm02-rftp agrp-c01n02

Unmaintenance failed node from survived one and start cluster on the failed node:

pcs node unmaintenance agrp-c01n02
pcs cluster start
pcs status
wait till r0 & r1 DRBD resources are masters on both nodes and all resources (besides all VMs) are started on both nodes

Cluster 18 blog-post, do only:

yum install gfs2-utils -y
tunegfs2 -l /dev/agrp-c01n01_vg0/shared # to view shared LV
dlm_tool ls # name clvmd z& shared / members 1 2
pvs # should only show drbd and sdb devices
lvscan # List all logical volumes in all volume groups (3 OS LV, shared & 1 LV per VM)

Cluster 21 blog-post:

do "Firewall setup to support KVM Live Migration" (do only stuff related to the failed node)
crm_simulate -sL | grep " vm[0-9]"
SELunux related:

ls -laZ /shared # must show "virt_etc_t" in all lines except related to ".."
if above line is not true, do stuff in "SELinux related issues" (do only stuff related to the failed node)

One by one (for each VM):

remove ban constraint for the first VM:

pcs resource clear vm01-rntp

verify that constraints are removed:

pcs constraint location

if this VM must be started on the restored node - wait till live migration is performed

Congratulations your cluster is restored into normal operation

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)