Cluster 25. Restoring failed node.
In case of hardware failure and need to restore one of the node (ex. agrp-c01n02):
- go through all steps in Cluster 1 - Cluster 11 blog-posts (do only stuff related to the failed node)
- Cluster 12 blog-post - go through steps till "Login to any of the cluster node and authenticate hacluster user." part (do only stuff related to the failed node), then:
- passwd hacluster
- from an active node:
- pcs node maintenance agrp-c01n02
- pcs cluster auth agrp-c01n02
- from agrp-c01n02:
- pcs cluster auth
- pcs cluster start
- pcs cluster status # node must be in maintenance mode with many errors due to absence of drbd / virsh and other packages
- then go through Cluster 12, starting at "Check cluster is functioning properly (on both nodes)" till "Quorum:" part
- go through all steps in Cluster 14 blog-post (do only stuff related to the failed node)
- Cluster 16 blog-post - go through steps till "Setup common DRBD options" part (do only stuff related to the failed node), then:
- from agrp-c01n01:
- rsync -av /etc/drbd.d root@agrp-c01n02:/etc/
- from agrp-c01n02:
- drbdadm create-md r{0,1}
- drbdadm up r0; drbdadm secondary r0
- drbd-overview
- drbdadm up r1; drbdadm secondary r1
- drbd-overview
- wait till full synchronisation
- reboot failed node
- Cluster 17 blog-post - go through steps till "Setup DLM and CLVM" (do only stuff related to the failed node), then:
- drbdadm up all
- cat /proc/drbd
- Cluster 19 blog-post - only do check of the SNMP from the failed node:
- snmpwalk -v 2c -c agrp-c01-community 10.10.53.12
- fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action list
- fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action list
- Cluster 20 blog-post - go through steps till "Provision Planning" (do only stuff related to the failed node), then:
- rsync -av /etc/libvirt/qemu/networks/ovs-network.xml root@agrp-c01n02:/root
- systemctl start libvirtd
- virsh net-define /root/ovs-network.xml
- virsh net-list --all
- virsh net-start ovs-network
- virsh net-autostart ovs-network
- virsh net-list
- systemctl stop libvirtd
- rm /root/ovs-network.xml
- For each VM add constraint to ban VM start on failed node (I assume n02 to fail). Below command adds -INFINITY location constraint for specified resource and node:
- pcs resource ban vm01-rntp agrp-c01n02
- pcs resource ban vm02-rftp agrp-c01n02
- Unmaintenance failed node from survived one and start cluster on the failed node:
- pcs node unmaintenance agrp-c01n02
- pcs cluster start
- pcs status
- wait till r0 & r1 DRBD resources are masters on both nodes and all resources (besides all VMs) are started on both nodes
- Cluster 18 blog-post, do only:
- yum install gfs2-utils -y
- tunegfs2 -l /dev/agrp-c01n01_vg0/shared # to view shared LV
- dlm_tool ls # name clvmd z& shared / members 1 2
- pvs # should only show drbd and sdb devices
- lvscan # List all logical volumes in all volume groups (3 OS LV, shared & 1 LV per VM)
- Cluster 21 blog-post:
- do "Firewall setup to support KVM Live Migration" (do only stuff related to the failed node)
- crm_simulate -sL | grep " vm[0-9]"
- SELunux related:
- ls -laZ /shared # must show "virt_etc_t" in all lines except related to ".."
- if above line is not true, do stuff in "SELinux related issues" (do only stuff related to the failed node)
- One by one (for each VM):
- remove ban constraint for the first VM:
- pcs resource clear vm01-rntp
- verify that constraints are removed:
- pcs constraint location
- if this VM must be started on the restored node - wait till live migration is performed
- Congratulations your cluster is restored into normal operation
No comments:
Post a Comment