Thursday, November 8, 2018

Cluster 25. Restoring failed node.

In case of hardware failure and need to restore one of the node (ex. agrp-c01n02):
  1. go through all steps in Cluster 1 - Cluster 11 blog-posts (do only stuff related to the failed node)
  2. Cluster 12 blog-post - go through steps till "Login to any of the cluster node and authenticate hacluster user." part  (do only stuff related to the failed node), then:
    1. passwd hacluster
    2. from an active node:
      1. pcs node maintenance agrp-c01n02
      2. pcs cluster auth agrp-c01n02
    3. from agrp-c01n02:
      1. pcs cluster auth
      2. pcs cluster start
      3. pcs cluster status # node must be in maintenance mode with many errors due to absence of drbd / virsh and other packages
    4. then go through Cluster 12, starting at "Check cluster is functioning properly (on both nodes)" till "Quorum:" part
  3. go through all steps in Cluster 14 blog-post (do only stuff related to the failed node)
  4. Cluster 16 blog-post - go through steps till "Setup common DRBD options" part  (do only stuff related to the failed node), then:
    1. from agrp-c01n01:
      1. rsync -av /etc/drbd.d root@agrp-c01n02:/etc/
    2. from agrp-c01n02:
      1. drbdadm create-md r{0,1}
      2. drbdadm up r0; drbdadm secondary r0
      3. drbd-overview
      4. drbdadm up r1; drbdadm secondary r1
      5. drbd-overview
      6. wait till full synchronisation
      7. reboot failed node
  5. Cluster 17 blog-post - go through steps till "Setup DLM and CLVM" (do only stuff related to the failed node), then:
    1. drbdadm up all
    2. cat /proc/drbd
  6. Cluster 19 blog-post - only do check of the SNMP from the failed node:
    1. snmpwalk -v 2c -c agrp-c01-community 10.10.53.12
    2. fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action list
    3. fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action list
  7. Cluster 20 blog-post - go through steps till "Provision Planning" (do only stuff related to the failed node), then:
    1. rsync -av /etc/libvirt/qemu/networks/ovs-network.xml  root@agrp-c01n02:/root
    2. systemctl start libvirtd 
    3. virsh net-define /root/ovs-network.xml 
    4. virsh net-list --all 
    5. virsh net-start ovs-network 
    6. virsh net-autostart ovs-network 
    7. virsh net-list 
    8. systemctl stop libvirtd
    9. rm  /root/ovs-network.xml
  8. For each VM add constraint to ban VM start on failed node (I assume n02 to fail). Below command adds -INFINITY location constraint for specified resource and node:
    1. pcs resource ban vm01-rntp agrp-c01n02
    2. pcs resource ban vm02-rftp agrp-c01n02
  9. Unmaintenance failed node from survived one and start cluster on the failed node:
    1. pcs node unmaintenance agrp-c01n02
    2. pcs cluster start
    3. pcs status
    4. wait till r0 & r1 DRBD resources are masters on both nodes and all resources (besides all VMs) are started on both nodes
  10. Cluster 18 blog-post, do only:
    1. yum install gfs2-utils -y
    2. tunegfs2 -l /dev/agrp-c01n01_vg0/shared # to view shared LV
    3. dlm_tool ls # name clvmd z& shared / members 1 2 
    4. pvs # should only show drbd and sdb devices
    5. lvscan # List all logical volumes in all volume groups (3 OS LV, shared & 1 LV per VM)
  11. Cluster 21 blog-post:
    1. do "Firewall setup to support KVM Live Migration" (do only stuff related to the failed node)
    2. crm_simulate -sL | grep " vm[0-9]"
    3. SELunux related:
      1. ls -laZ /shared # must show "virt_etc_t" in all lines except related to ".."
      2. if above line is not true, do stuff in "SELinux related issues" (do only stuff related to the failed node)
  12. One by one (for each VM):
    1. remove ban constraint for the first VM:
      1. pcs resource clear vm01-rntp
    2. verify that constraints are removed:
      1. pcs constraint  location
    3. if this VM must be started on the restored node - wait till live migration is performed
  13. Congratulations your cluster is restored into normal operation

No comments:

Post a Comment