Friday, April 27, 2018

Cluster 22. Testing Live-Migration, Overall Recovery and Fail-over

Setup Descripted

We have installed 4 VMs:
  • agrp-c01n01 is the primary node for vm01 & vm02 so these VMs may start on agrp-c01n02 only when agrp-c01n01 is unavailable
  • agrp-c01n02 is the primary node for vm03 & vm04 so these VMs may start on agrp-c01n01 only when agrp-c01n02 is unavailable

VM Live Migration tests (manual, on cluster stop, manual withdrawal)

Whenever you use a migrate command (pcs resource move), Pacemaker creates a permanent location constraint pinning the resource on that node. Something like:
pcs constraint --full | grep -E "Enabled.+vm[0-9]"
    Enabled on: agrp-c01n01 (score:INFINITY) (role: Started) (id:cli-prefer-vm02-www)
This is usually undesirable, to revoke this constraint once the resource migration has been completed or when node is restored issue the pcs recource clear vm02-www
The resource-stickiness will cause the resource to stay where it is, anyway. But by default:
pcs property list --defaults | grep stick
 default-resource-stickiness: 0
In this posts I prefer VM to automatically migrate to the primary node when it's available. If it is pointless for you, use resource-stickiness:
pcs resource update vm02-www meta resource-stickiness=INFINITY

Manual Live Migration

pcs status resources
virsh console vm02-www # from node agrp-c01n01
uptime # note the uptime
pcs resource move vm02-www agrp-c01n02 # moving vm02-www resource from agrp-c01n01 to the agrp-c01n02
virsh console vm02-www # from node agrp-c01n02
uptime # uptime must be equal to the previous result or more
pcs constraint --full | grep -E "Enabled.+Started.+vm[0-9]"
pcs recource clear vm02-www # from any node - this should cause vm02-www to migrate back to the agrp-c01n01

Automatic Live Migration on cluster stop (on one node)

pcs status resources
virsh console vm02-www # from node agrp-c01n01
uptime # note the uptime
pcs cluster stop # stopping cluster on agrp-c01n01
pcs status # now vm01 , vm02, vm03 and vm04 must be started on agrp-c01n02
virsh console vm02-www # from node agrp-c01n02
uptime # uptime must be equal to the previous result or more
pcs constraint --full | grep -E "Enabled.+vm[0-9]" # as you see no constraints are added, because migration is caused by cluster system itself due to agrp-c01n01 going offline
pcs cluster start # starting cluster on agrp-c01n01
pcs status # now vm01 and vm02 should migrate back to agrp-c01n01 automatically

Controlled Migration and Node Withdrowal

This steps must be repeated one time for each node (first agrp-c01n01 & then agrp-c01n02):
agrp-c01n01 withdrawal:
resources=$(pcs resource | grep -E  "VirtualDomain.+n01" | awk '{print $1}')
for resource in $resources; do pcs resource move $resource agrp-c01n02; done
pcs status
pcs cluster stop # on agrp-c01n01
systemctl poweroff # on agrp-c01n01
resources=$(pcs constraint --full | grep -E "Enabled.+Started.+vm[0-9]" | awk '{print $7}' | cut -d\- -f3,4 | cut -d\) -f 1)
for resource in $resources; do pcs resource clear $resource; done
power-on power-offed node
pcs cluster start # all VM should migrate to their original positions

agrp-c01n02 withdrawal:
resources=$(pcs resource | grep -E  "VirtualDomain.+n02" | awk '{print $1}')
for resource in $resources; do pcs resource move $resource agrp-c01n01; done
pcs status
pcs cluster stop # on agrp-c01n02
systemctl poweroff # on agrp-c01n02
resources=$(pcs constraint --full | grep -E "Enabled.+Started.+vm[0-9]" | awk '{print $7}' | cut -d\- -f3,4 | cut -d\) -f 1)
for resource in $resources; do pcs resource clear $resource; done
power-on power-offed node
pcs cluster start # all VM should migrate to their original positions

VM resource restarting test

This process should be repeated for all VM's on it's primary node. Here only steps for vm01-nagios are shown but they are identical for the other VMs:
clear; tail -f -n 0 /var/log/messages
virsh console vm01-nagios
shutdown -h 0
In the output of the /var/log/messages you we'll see below lines:
....
Apr 26 16:08:16 agrp-c01n01 systemd-machined: Machine qemu-3-vm01-nagios terminated
Apr 26 16:08:17 agrp-c01n01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --if-exists del-port vnet0
Apr 26 16:08:24 agrp-c01n01 pengine[2795]: warning: Processing failed op monitor for vm01-nagios on agrp-c01n01: not running (7)
...
Apr 26 16:08:24 agrp-c01n01 crmd[2796]:  notice: Initiating start operation vm01-nagios_start_0 locally on agrp-c01n01
...
Apr 26 16:08:25 agrp-c01n01 crmd[2796]:  notice: Result of start operation for vm01-nagios on agrp-c01n01: 0 (ok)
pcs status # vm01-nagios should be "Started"

Nodes Crash-Test

Simulating Software Crash

Crashing agrp-c01n01:
clear; tail -f -n 0 /var/log/messages # on agrp-c01n02
echo c > /proc/sysrq-trigger # on agrp-c01n01
What is in the log (key points):
16:28:17 agrp-c01n02 corosync[2217]: [TOTEM ] A processor failed, forming new configuration.
16:28:18 agrp-c01n02 attrd[2228]:  notice: Node agrp-c01n01 state is now lost
16:28:18 agrp-c01n02 kernel: dlm: closing connection to node 1
agrp-c01n02 corosync[2217]: [MAIN  ] Completed service synchronization, ready to provide service.
16:28:18 agrp-c01n02 dlm_controld[3148]: 2203 fence request 1 pid 22241 nodedown time 1524745698 fence_all dlm_stonith
agrp-c01n02 stonith-ng[2226]:  notice: Requesting peer fencing (reboot) of agrp-c01n01
16:28:22 agrp-c01n02 kernel: drbd r0: PingAck did not arrive in time.
16:28:22 agrp-c01n02 kernel: drbd r0: helper command: /sbin/drbdadm fence-peer r0
16:28:23 agrp-c01n02 pengine[2229]: warning: Cluster node agrp-c01n01 will be fenced: peer is no longer part of the cluster
16:28:23 agrp-c01n02 pengine[2229]: warning: Node agrp-c01n01 is unclean
16:28:23 agrp-c01n02 crmd[2230]:  notice: Requesting fencing (reboot) of node agrp-c01n01
16:28:26 agrp-c01n02 kernel: drbd r1: PingAck did not arrive in time.
16:28:2d6 agrp-c01n02 kernel: drbd r1: helper command: /sbin/drbdadm fence-peer r1
16:28:47 agrp-c01n02 stonith-ng[2226]:  notice: Call to fence_ipmi_n01 for 'agrp-c01n01 reboot' on behalf of stonith-api.22241@agrp-c01n02: OK (0)
16:28:47 agrp-c01n02 crmd[2230]:  notice: Peer agrp-c01n01 was terminated (reboot) by agrp-c01n02 for agrp-c01n02: OK (ref=47de2c75-7e7c-49d3-9796-17779e46e0bd) by client crmd.2230
16:28:47 agrp-c01n02 pengine[2229]:  notice:  * Start      vm02-www          (       agrp-c01n02 )
16:28:47 agrp-c01n02 pengine[2229]:  notice:  * Start      vm01-nagios        (       agrp-c01n02 )
16:28:48 agrp-c01n02 crm-fence-peer.sh[22390]: INFO peer is fenced, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r1-ms_drbd_r1'
26 16:28:48 agrp-c01n02 crm-fence-peer.sh[22271]: INFO peer is fenced, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r0-ms_drbd_r0'
16:28:49 agrp-c01n02 kernel: GFS2: fsid=agrp-c01:shared.1: jid=0: Looking at journal...
16:28:49 agrp-c01n02 kernel: GFS2: fsid=agrp-c01:shared.1: recover generation 9 done
16:28:49 agrp-c01n02 crmd[2230]:  notice: Result of start operation for vm02-www on agrp-c01n02: 0 (ok)
26 16:28:49 agrp-c01n02 crmd[2230]:  notice: Result of start operation for vm01-nagios on agrp-c01n02: 0 (ok)

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):
constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)
for constraint in $constraints; do pcs constraint remove $constraint; done
Login to the agrp-c01n01:
pcs cluster start # vm01 & vm02 will migrate to the agrp-c01n01


Crashing agrp-c01n02:
clear; tail -f -n 0 /var/log/messages # on agrp-c01n01
echo c > /proc/sysrq-trigger # on agrp-c01n02
What is in the log (key points): 
mostly the same as for agrp-c01n01

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):
constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)
for constraint in $constraints; do pcs constraint remove $constraint; done
Login to the agrp-c01n02:
pcs cluster start # vm03 & vm04 will migrate to the agrp-c01n02

Simulating Hardware Crash

Crashing agrp-c01n01:
clear; tail -f -n 0 /var/log/messages # on agrp-c01n02
power-off the node (pull power cord out of PSU)
What is in the log (key points): 
mostly the same as for agrp-c01n01 (while Simulating Software Crash), different points:
18:05:58 agrp-c01n02 corosync[4176]: [TOTEM ] A processor failed, forming new configuration.
18:06:34 agrp-c01n02 fence_ipmilan: Connection timed out
18:06:02 agrp-c01n02 crm-fence-peer.sh[6507]: No messages received in 3 seconds.. aborting
18:07:10 agrp-c01n02 stonith-ng[4190]:  notice: Call to fence_ipmi_n01 for 'agrp-c01n01 reboot' on behalf of stonith-api.6572@agrp-c01n02: Connection timed out (-110)
Apr 26 18:07:10 agrp-c01n02 stonith-ng[4190]: warning: Agent 'fence_ifmib' does not advertise support for 'reboot', performing 'off' action instead
18:07:28 agrp-c01n02 crm-fence-peer.sh[6507]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r1-ms_drbd_r1'
18:07:28 agrp-c01n02 kernel: drbd r1: helper command: /sbin/drbdadm fence-peer r1 exit code 5 (0x500)
18:07:28 agrp-c01n02 kernel: drbd r1: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)
18:07:30 agrp-c01n02 stonith-ng[4190]:  notice: Call to fence_ifmib_n01 for 'agrp-c01n01 reboot' on behalf of stonith-api.6572@agrp-c01n02: OK (0)
18:07:30 agrp-c01n02 kernel: drbd r0: fence-peer helper returned 7 (peer was stonithed)
18:07:32 agrp-c01n02 crmd[4194]:  notice: Result of start operation for vm01-nagios on agrp-c01n02: 0 (ok)
18:07:32 agrp-c01n02 crmd[4194]:  notice: Result of start operation for vm02-www on agrp-c01n02: 0 (ok)
power-on agrp-c01n01
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action status
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action on
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action status
Login to the agrp-c01n01 and verify it's operational
pcs stonith cleanup # on agrp-c01n02

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):
constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)
for constraint in $constraints; do pcs constraint remove $constraint; done
Login to the agrp-c01n01:
pcs cluster start # vm01 & vm02 will migrate to the agrp-c01n01
pcs status


Crashing agrp-c01n02:
clear; tail -f -n 0 /var/log/messages # on agrp-c01n02
power-off the node (pull power cord out of PSU)
What is in the log (key points): 
mostly the same as for agrp-c01n01 (while Simulating Software & Hardware Crash)


power-on agrp-c01n02
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action status
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action on
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action status
Login to the agrp-c01n02 and verify it's operational
pcs stonith cleanup # on agrp-c01n01

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):
constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)
for constraint in $constraints; do pcs constraint remove $constraint; done
Login to the agrp-c01n02:
pcs cluster start # vm01 & vm02 will migrate to the agrp-c01n01
pcs status


Administrative modes: standby, unmanaged, maintenance

Recurring monitor operations behave differently under various administrative settings:
  1. When a resource is unmanaged: No monitors will be stopped. If the unmanaged resource is stopped on a node where the cluster thinks it should be running, the cluster will detect and report that it is not, but it will not consider the monitor failed, and will not try to start the resource until it is managed again. Starting the unmanaged resource on a different node is strongly discouraged and will at least cause the cluster to consider the resource failed, and may require the resource’s target-role to be set to Stopped then Started to be recovered.
  2. When a node is put into standby: All resources will be moved away from the node, and all monitor operations will be stopped on the node, except those with role=Stopped. Monitor operations with role=Stopped will be started on the node if appropriate.
  3. When the cluster is put into maintenance mode: All resources will be marked as unmanaged. All monitor operations will be stopped, except those with role=Stopped. As with single unmanaged resources, starting a resource on a node other than where the cluster expects it to be will cause problems.
Maintenance mode and making resource unmanaged are preferred method if you are doing the online changes on the cluster nodes.  Standby mode is preferred if you need some hardware maintenance.

Managed/unmanaged resources

To make resource unmanaged via :
pcs resource unmanage libvirtd # it the same as: pcs resource update libvirtd meta is-managed=false
pcs status
 Clone Set: libvirtd-clone [libvirtd]
     libvirtd (systemd:libvirtd): Started agrp-c01n01 (unmanaged)
     libvirtd (systemd:libvirtd): Started agrp-c01n02 (unmanaged)
To make resource managed again:
pcs resource manage libvirtd # it the same as: pcs resource update libvirtd meta is-managed=true
pcs status
 Clone Set: libvirtd-clone [libvirtd]
     Started: [ agrp-c01n01 agrp-c01n02 ]

Standby/Unstandby 


Standby means that node is not permitted to run any resources but participates in voting (if note shutdown).

To move node to the standby mode:
pcs node standby agrp-c01n02
Aftre issuing - all VMs are migrated to the agrp-c01n01 and then all resources on agrp-c10n02 are stopped and node itself is shown as:
Node agrp-c01n02: standby
pcs quorum status # we'll see that standby node is also participating in voting
Node in standby mode can be restarted or shut-downed, then the status will change into:
Node agrp-c01n02: OFFLINE (standby)
Also total votes will be "1":
pcs quorum status

To clear standby mode after reboot or shutdown:
pcs cluster start # on node agrp-c01n02
pcs node unstandby agrp-c01n02


Maintenance

In a Pacemaker cluster, as in a standalone system, operators must complete maintenance tasks such as software upgrades and configuration changes. Here's what you need to keep Pacemaker's built-in monitoring features from creating unwanted side effects.
With clone and master-slave resources better way is to place node into standby mode, because node in maintenance mode can be fenced by the other node (i.e. DRBD PingAck will not arrive int time and node will be fenced).

Maintenance entire cluster:
pcs property list --defaults | grep mainte
pcs property set maintenance-mode=true
or
pcs node maintenace --all
pcs status
              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

Online: [ agrp-c01n01 agrp-c01n02 ]

Full list of resources:
All resources are shown with " (unmanaged)" added to the end of the resource status line. 

In maintenance mode, you can stop or restart cluster resources at will. Pacemaker will not attempt to restart them. All resources automatically become unmanaged, that is, Pacemaker will cease monitoring them and hence be oblivious about their status. You can even stop all Pacemaker services on a node, and all the daemons and processes originally started as Pacemaker managed cluster resources will continue to run.
You should know that when you start Pacemaker services on a node while the cluster in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a "probe") for every resource just so it has an understanding of what resources are currently running on that node. It will, however, take no further action other than determining the resources' status.
Maintenance mode is something you enable before running other maintenance actions, not when you're already half-way through them. And unless you're very well versed in the interdependencies of resources running on the cluster you're working on, it's usually the very safest option. In short: when doing maintenance on your Pacemaker cluster, by default, enable maintenance mode before you start, and disable it after you're done.

To unmaintenance entire cluster:
pcs property set maintenance-mode=false
or
pcs node unmaintenance --all

Maintenance single node:
pcs node maintenance # will set local node into maintenance mode
pcs node maintenance agrp-c01n01 # will set agrp-c01n01 into maintenance mode


Resources disable/enable

If you need to disable (stop) resource:
pcs resource stop vm01-nagios
pcs status
 vm01-nagios (ocf::heartbeat:VirtualDomain): Started agrp-c01n01 (disabled)
virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     vm02-www                       running

To enable resource:
pcs resource enable vm01-nagios
pcs status
 vm01-nagios (ocf::heartbeat:VirtualDomain): Started agrp-c01n01
virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     vm02-www                     running
 2     vm01-nagios                    running


This tutorials were used to understand and setup clustering: 
AN!Cluster
unixarena
clusterlabs.org
hastexo.com

Monday, April 23, 2018

Cluster 21. Making installed VM (Virtual Machine) a cluster resource.

Related to libvirtd

In order to start VM as HA resource, libvirtd must be up and running (on both nodes):
creare resource make it clone and start after shredfs because libvirtd uses pools (virsh pool-list virsh pool-info) and files pool is /shared/files
pcs resource create libvirtd systemd:libvirtd
pcs resource clone libvirtd clone-max=2 clone-node-max=1 interleave=true
pcs constraint order start sharedfs-clone then start libvirtd-clone
pcs constraint colocation add libvirtd-clone with sharedfs-clone
some options can be found here: Cluster 17

Firewall setup to support KVM Live Migration

Setup firewall ports for KVM live-migration (on both nodes):
On node1:
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.10.53.2/32" port protocol="tcp" port="49152-49216" accept'
firewall-cmd --reload
firewall-cmd --list-all
On node2:
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.10.53.1/32" port protocol="tcp" port="49152-49216" accept'
firewall-cmd --reload
firewall-cmd --list-all


49152-49216 is a pool of tcp ports used randomly by virsh to perfor live migration.

Related to VM itself


In order for the cluster to manage a server, it must know where to find the "definition" file that describes the virtual machine and its hardware. When the server was created with virt-install, it saved this definition file in /etc/libvirt/qemu/
Normal libvirtd tools are not cluster-aware, so we don't want them to see our server except when it is running. We will get this done via "undefine" our VM.

First we'll share definition:
virsh list --all # list running and power-off VMs
virsh dumpxml vm02-www # view VM definition xml dump
mkdir /shared/definitions
vursh shutdown vm02-www
virsh dumpxml vm02-www > /shared/definitions/vm02-www.xml # save dump to the shared directory, this file will be used to start, stop, recover and migrate the VM
verify that xml is saved properly # because next step will destroy the VM

Stop and destroy VM:
virsh destroy vm02-www
virsh undefine vm02-www
virsh list --all # be sure that needed VM is undefined

Setup VM cluster resource (this command is executed on VM primary node - vm02-www primary node is agrp-c01n01):
pcs resource create vm02-www ocf:heartbeat:VirtualDomain hypervisor="qemu:///system" config="/shared/definitions/vm02-www.xml" migration_transport=ssh meta allow-migrate=true op monitor interval="30" timeout="30s" op start interval="0" timeout="240s" op stop interval="0" timeout="120s"

Options described(for all options see pcs resource describe VirtualDomain or man ocf_heartbeat_VirtualDomain):
  1. hypervisor="qemu:///system" - you can find this uri by executing virsh --quiet uri
  2. migration_transport=ssh - use ssh while migrating VM
  3. meta allow-migrate=true - Resources have two types of options: meta-attributes and instance attributes. Meta-attributes apply to any type of resource, while instance attributes are specific to each resource agent. Visit clusterlabs.org/meta

pcs constraint order start libvirtd-clone then vm02-www
pcs constraint colocation add vm02-www with libvirtd-clone 
Adding below constraint is needed because without it after node returning (after fail or manual cluster stop-starting) pacemaker will try to migrate VM to the primary node without waiting for DRBD promotion:
pcs constraint colocation add vm02-www with master ms_drbd_r0

Scores are calculated per resource and node. Any node with a negative score for a resource can’t run
that resource. The cluster places a resource on the node with the highest score for it. Positive values indicate a preference for running the affected resource(s) on this node — the higher the value, the stronger the preference. Negative values indicate the resource(s) should avoid this node (a value of - INFINITY changes "should" to "must"):
pcs constraint location add lc_vm02_n01 vm02-www agrp-c01n01 1
pcs constraint location add lc_vm02_n02 vm02-www agrp-c01n02 0
Above location constraints are needed to automatically live-migrate to the node which is primary for that VM. For vm02-www primary node is agrp-c01n01)

To view current score for the resource:
crm_simulate -sL | grep " vm[0-9]"
 vm02-www (ocf::heartbeat:VirtualDomain): Started agrp-c01n02
native_color: vm02-www allocation score on agrp-c01n01: -INFINITY
native_color: vm02-www allocation score on agrp-c01n02: 0
-INFINITY for agrp-c01n01 is really not because of constraint but because of agrp-c01n01 is offline

SELinux related issues

SELinux is preventing /usr/bin/virsh from read access on the file vm01-nagios.xml

semodule -DB # enables complete logging SELinux messages to the audit.log
open one more ssh to the node
tail -f -n0 /var/log/audit/audit.log
pcs resources cleanup
The message appeared:
type=AVC msg=audit(1524726228.964:514): avc:  denied  { read } for  pid=8711 comm="virsh" name="vm01-nagios.xml" dev="dm-5" ino=3477882 scontext=system_u:system_r:virsh_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file

Its's complaining about access to the vm01-nagios.xml file which is on device dm-5 and having inode 3477882. Let's find which device is it:
ls -lah /dev/mapper | grep dm-5 # It's agrp--c01n01_vg0-shared
Let's find what is inode 3477882:
find /shared -inum 3477882 # It's /shared/definitions/vm01-nagios.xml
Let's view SELinux context for /shared (we can also view context for only that file but we know that we can have many definitions in the /shared):
ls -laZ /shared # . (meaning current directory - /shared) context is system_u:object_r:unlabeled_t:s0 this context is not permissive enough, so we'll change it (only on one node but verify on the other):
semanage fcontext -a -t virt_etc_t '/shared(/.*)?' 
restorecon -r /shared 
ls -laZ /shared
semodule -B # disable audit.log for contexts with dontaudit option enabled

This tutorials were used to understand and setup clustering: 
AN!Cluster
unixarena
clusterlabs.org
rarforge.com

Restoring SQL queries from /var/log/asterisk/messages log-files.

One of customers had problem with Asterisk CEL due to insufficient HDD space left on Asterisk server. To extract SQL queries from messages, I used these sequence:
  1. cd /var/log/asterisk/
  2. grep "cel_odbc.c: Insert failed on" messages | awk '{$1=$2=$3=$4=$5=$6=$7=$8=$9=$10=$11=""; print $0}'  > messages_last
  3. removing all sequences of more than 1 white-spaces (previous command created 11 whitespace sequence):
    1. tr -s " " < messages_last > messages_truncated
  4. remove one white space left before INSERT statement:
    1. sed -i 's/^ *//' messages_truncated
  5. append ";" to the end of each line:
    1. sed -i 's/$/;/' messages_truncated
  6. execute queries from the file:
    1. mysql -u root -p asterisk
    2. SOURCE /var/log/asterisk/messages_truncated
    3. or
    4. mysql -u root -p asterisk < /var/log/asterisk/messages_truncated



Tuesday, April 17, 2018

Cluster 20. Install & setup environment needed for clustered virtualization

KVM Installation & initial setup

Install packages needed for KVM:
yum install –y kvm virt-manager virt-install libvirt libvirt-python libguestfs-tools syslinux pciutils
Verify that the packages were installed correctly:
lsmod | grep kvm # to see if kvm anf kvm_intel modules are loaded

Packages descripted:

  • kvm - hypervisor
  • virt-manager - package contains several command-line utilities (also GUI tools) for building and installing new virtual machines, and virt-clone for cloning existing virtual machines
  • libvirt - is a C toolkit to interact with the virtualization capabilities of recent versions of Linux (and other OSes). The library aims at providing a long term stable C API for different virtualization mechanisms. It currently supports QEMU, KVM, XEN, OpenVZ, LXC, and VirtualBox.
  • libvirt-python - package provides a module that permits applications written in the Python programming language to call the interface supplied by the libvirt library, to manage the virtualization capabilities of recent versions of Linux (and other OSes).
  • libguestfs-tool - This package contains the guestfish (shell and command-line tool for examining and modifying virtual machine filesystems) and various virtualization tools, including virt-cat, virt-df, virt-edit, virt-filesystems, virt-inspector, virt-ls, virt-make-fs, virt-rescue, virt-resize, virt-tar, and virt-win-reg
  • syslinux - is a suite of bootloaders, currently supporting DOS FAT filesystems, Linux ext2/ext3 filesystems (EXTLINUX), PXE network boots (PXELINUX), or ISO 9660 CD-ROMs  (ISOLINUX). It also includes a tool, MEMDISK, which loads legacy operating systems from these media.
  • pciutils - The pciutils package contains various utilities for : inspecting and setting devices connected to the PCI bus.
Check and destroy the default libvirtd bridge (By default, VMs will only have network access to other VMs on the same server (and to the host itself)

systemctl start libvirtd
systemctl status libvirtd

via private network 192.168.122.0. If you want the VMs to have access to your LAN, then you must create a network bridge on the host.):
ip route | grep virbr0
virsh net-destroy default 
virsh net-autostart default --disable 
virsh net-undefine default
ip route | grep virbr0

Check and disable libvirtd:
systemctl status libvirtd
systemctl stop libvirtd
systemctl disable libvirtd

Provision Planning

The servers I'm using to write this tutorial are a little modest in the RAM department with only 16 GiB of RAM. We need to subtract at least 2 GiB for the host nodes, leaving us with a total of 14 GiB. That needs to be divided up among all your servers. Now, nothing says you have to use it all, of course. It's perfectly fine to leave some RAM unallocated for future use. This is really up to you and your needs.

Let's put together a table with the RAM we plan to allocate and summarizing the LV we're going to create for each server. The LV will be named after the server they'll be assigned to with the suffix _0. Later, if we add a second "hard drive" to a server, it will have the suffix _1 and so on.

ServerRAM (GiB)Storage Pool (VG)LV nameLV size
vm01-nagios2agrp-c01n01vm01-nagios_0150 GB
vm02-www4agrp-c01n01vm02-www_0150 GB
vm03-mysql3agrp-c01n02vm03-mysql_0100 GB
vm04-asterisk4agrp-c01n02vm04-asterisk_0100 GB
Total13 GiB-----------------------------500 GB

As you see, we'll use 13 GiB of RAM, so remaining RAM amount will be 3 GiB (16-13=3). And we'll use 500GB of storage, so remaining amount of VM dedicated storage (both DRBD r0+r1=1000GB) will be 500GB (1000-500=500)
The same approach can be used for CPU - read this blog-post - how-many-vCPU-per-pCPU

Provision Shared CentOS ISOs

Before we can install the OS, we need to copy the installation media and our driver disk, if needed, and put them in the /shared/files.
For our needs we'll install CentOS6 & CentOS7 machines (for Windows machines, please visit: AN!Cluster_Tutorial - alteeve.com). So download both CentOS 6 & 7 Minimal images from one of the nodes and then send it to the other (I'll be using one of our office machines):
pcs cluster start --all # if didn't start previously
rsync -av --progress CentOS-7-x86_64-Minimal-1708.iso root@172.16.51.1:/shared/files/
rsync -av --progress CentOS-6.9-x86_64-minimal.iso root@172.16.3.235:/shared/files

Creating Storage for VMs

Earlier, we used parted to examine our free space and create our DRBD partitions. Unfortunately, parted shows sizes in GB (base 10) where LVM uses GiB (base 2). If we used LVM's "xxG size notation, it will use more space than we expect, relative to our planning in the parted stage. LVM doesn't allow specifying new LV sizes in GB instead of GiB, so here we will specify sizes in MiB to help narrow the differences.
Storage creating is the same for all VM. So I'll show only one LV creation:
lvcreate -L 150000M -n vm01-nagios_0 agrp-c01n01_vg0
or you can use bytes-count (ie 150GB=150*1024*1024*1024bytes=161061273600b):
lvcreate -L 161061273600b -n vm01-nagios_0 agrp-c01n01_vg0
lvdisplay /dev/agrp-c01n01_vg0/vm01-nagios_0
To remove lv:
lvremove /dev/agrp-c01n01_vg0/vm01-nagios_0 

Creating OpenVSwitch group for VMs

Find name of the bridge:
ovs-vsctl list Bridge | grep name
Add port group  to the file /shared/provision/ovs-network.xml (if more than one vlan is needed – add <portgroup>..</portgroup> for every vlan)

<network>
<name>ovs-network</name>
<forward mode='bridge'/>
<bridge name='ovs_kvm_bridge'/>
<virtualport type='openvswitch'/>
    <portgroup name='vlan-51'>
         <vlan>
            <tag id='51'/>
        </vlan>
   </portgroup>
</network>

To add networ to a KVM (from both nodes)
systemctl start libvirtd
virsh net-define /shared/provision/ovs-network.xml
net-list --all
virsh net-start ovs-network
virsh net-autostart ovs-network 
virsh net-list
systemctl stop libvirtd

To delete network from KVM (if needed - from both nodes):
virsh net-list
virsh net-destroy ovs-network
virsh net-autostart --disable ovs-network
virsh net-undefine ovs-network

Virtio

So-called "full virtualization" is a nice feature because it allows you to run any operating system virtualized. However, it's slow because the hypervisor has to emulate actual physical devices such as RTL8139 network cards . This emulation is both complicated and inefficient. 
Virtio is a virtualization standard for network and disk device drivers where just the guest's device driver "knows" it is running in a virtual environment, and cooperates with the hypervisor. This enables guests to get high performance network and disk operations, and gives most of the performance benefits of paravirtualization.

Creating virt-install call

touch /shared/provision/vm01-nagios.sh 
chmod 755 /shared/provision/vm01-nagios.sh 
vim /shared/provision/vm01-nagios.sh
virt-install --connect qemu:///system \
--name=vm01-nagios \
--ram=2048 \
--arch=x86_64 \
--vcpus=2 \
--location=/shared/files/CentOS-6.9-x86_64-minimal.iso \
--os-variant=centos6.9 \
--network network=ovs-network,portgroup=vlan-51,model=virtio \
--disk path=/dev/agrp-c01n01_vg0/vm01-nagios_0,bus=virtio \
--graphics none \
--extra-args 'console=ttyS0'




Options Described:
  1. --connect qemu:///system - This tells virt-install to use the QEMU hardware emulator (as opposed to Xen, for example) and to install the server on to local node.
  2. --name vm01-nagios - This sets the name of the server. It is the name we will use in the cluster configuration and whenever we use the libvirtd tools, like virsh.
  3. --ram 2048 - This sets the amount of RAM, in MiB, to allocate to this server. Here, we're allocating 2 GiB, which is 2048 MiB.
  4. --arch x86_64 - i386 – 32bit old CPUs, i686 – 32bit new CPUs, x86-64 – 64bit CPUs
  5. --vcpus 2 - This sets the number of CPU cores to allocate to this server. Here, we're allocating two CPUs.
  6. --location /shared/files/CentOS-6.9-x86_64-minimal.iso - Distribution tree installation source. virt-install can recognize certain distribution trees and fetches a bootable kernel/initrd pair to launch the install.
  7. --os-variant centos6.9 - This tweaks the virt-manager's initial method of running and tunes the hypervisor to try and get the best performance for the server. There are many possible values here for many, many different operating systems. If you run osinfo-query os on your node, you will get a full list of available operating systems. If you can't find your exact operating system, select the one that is the closest match.
  8. --network network=ovs-network,portgroup=vlan-51,model=virtio - This tells the hypervisor that we want to create a network card using the virtio "hardware" and that we want it plugged into the ovs-network bridge's  vlan-51 portgroup. We only need one network card, but if you wanted two or more, simply repeat this command. If you create two or more bridges, you can have different network devices connect to different bridges.
  9. --disk path=/dev/agrp-c01n01_vg0/vm01-nagios_0,bus=virtio - This tells the hypervisor what LV to use for the server's "hard drive". It also tells it to use the virtio emulated SCSI controller.
  10.  --graphics none - we'll use only CLI without any GUI (also for installation)
  11. --extra-args 'console=ttyS0' - this is needed to see installation process from console

Installing VM on the node

We can install any server from either node. However, we know that each server has a preferred node, so it's sensible to use that host for the installation stage. In this case of vm01-nagios, the preferred host is agrp-c01n01, so we'lluse it to start the installation.
  • ssh to the agrp-c01n01
  • systemctl start libvirtd
  • /shared/provision/vm01-nagios.sh
  • Go through steps of text-mode installation
  • To exit installed VM hit Ctrl+5 (remote connect) or Ctrl+] (local connect)
  • To connect to the installed VM virsh console vm01-nagios
  • To list installed systems and their operating mode virsh list --all
  • To start "shut-offed" VM virsh start vm01-nagios
Steps to perform on a VM after installation (if you need them): 
For CentOS6:
  1. chkconfig ip6tables off
  2. service ip6tables stop
  3. cat /etc/sysconfig/network
    1. NETWORKING=yes
    2. NETWORKING_IPV6=no
  4. vi /etc/sysctl.conf
    1. net.ipv6.conf.all.disable_ipv6 = 1
    2. net.ipv6.conf.default.disable_ipv6 = 1
    3. kernel.panic = 5 # self-reboot in 5 seconds when panicking
  5. sysctl -p
  6. vi /etc/sysconfig/network-scripts/ifcfg-eth0
    1. NM_CONTROLLED=no
    2. ONBOOT=yes
  7. service network restart
  8. ip route
For CentOS7 (this version by default does self-restart on kernel panicking):
  1. systemctl stop NetworkManager
  2. systemctl disable NetworkManager
  3. chkconfig network on
  4. systemctl start network
  5. vi /etc/sysconfig/network-scripts/ifcfg-eth0
    1. ONBOOT=yes
  6. systemctl restart network
  7. ip route
To learn IP addresses and OVS port names of the VM (execute from the node where VM is situated): 
for name in $(virsh list | awk '{print $2}' | grep -v '^$\|Name'); do echo $name;virsh domiflist $name; echo""; done 

vm02-www 
Interface Type     Source           Model   MAC 
---------------------------------------------------------------------- 
vnet0      bridge   ovs-network   virtio    52:54:00:77:3a:a0

nagios
Interface Type     Source           Model   MAC
----------------------------------------------------------------------
vnet1      bridge   ovs-network   virtio    52:54:00:77:3d:19


VM shutdown test

To test if node can be shutdown:
virsh shutdown vm01-nagios
If shutdown is not performed and node remains active (mostly this is problem on CentOS6):
virsh console vm01-nagios
yum -y install acpid
service acpid start
chkconfig --level 235 acpid on
chkconfig --list acpid
Test again:
virsh shutdown vm02-www

ACPI (Advanced Configuration and Power Interface) — enhanced interface for power supply management. ACPI is the component of many modern computers. It gives PC users ability to manage power supply programmatically and also query batteries state and parameters.

This tutorials were used to understand and setup clustering: 



Tuesday, April 10, 2018

Cisco ASA how to find why packet is not going in or out through VPN

For example we want to check access through INSIDE interface from 10.10.100.100 outside client's tcp port 30000 to the 10.20.100.100 internal server's 3389 tcp port (Windown RDP).
First you need to check packet "movement":
packet-tracer input INSIDE tcp 10.10.100.100 30000 10.20.100.100 3389 detailed

Correct all problems appeared in each Phase. If the only "Drop" result is on VPN Phase, then:
sh run route | grep 10.20.100.100 # found gateway is 10.30.100.100
sh run group-policy | grep 10.30.100.100 # found group-policy name is GP_10.30.100.100
sh run group-policy GP_110.30.100.100 | grep vpn-filter # found ACL name is INSIDE.30.100.100.vpn.filter
No you can verify this ACL and add needed permissions

Thursday, April 5, 2018

Install Gnome Desktop GUI to the CentOS7

  1. yum install yum-utils
  2. yum grouplist | grep -i desktop
  3. yum groupinstall "Gnome Desktop"
  4. systemctl set-default graphical.target
  5. reboot