IT Stuff: April 2018

Friday, April 27, 2018

Cluster 22. Testing Live-Migration, Overall Recovery and Fail-over

Setup Descripted

We have installed 4 VMs:

agrp-c01n01 is the primary node for vm01 & vm02 so these VMs may start on agrp-c01n02 only when agrp-c01n01 is unavailable
agrp-c01n02 is the primary node for vm03 & vm04 so these VMs may start on agrp-c01n01 only when agrp-c01n02 is unavailable

VM Live Migration tests (manual, on cluster stop, manual withdrawal)

Whenever you use a migrate command (pcs resource move), Pacemaker creates a permanent location constraint pinning the resource on that node. Something like:

pcs constraint --full | grep -E "Enabled.+vm[0-9]"

Enabled on: agrp-c01n01 (score:INFINITY) (role: Started) (id:cli-prefer-vm02-www)

This is usually undesirable, to revoke this constraint once the resource migration has been completed or when node is restored issue the pcs recource clear vm02-www.

The resource-stickiness will cause the resource to stay where it is, anyway. But by default:

pcs property list --defaults | grep stick

default-resource-stickiness: 0

In this posts I prefer VM to automatically migrate to the primary node when it's available. If it is pointless for you, use resource-stickiness:

pcs resource update vm02-www meta resource-stickiness=INFINITY

Manual Live Migration

pcs status resources
virsh console vm02-www # from node agrp-c01n01
uptime # note the uptime
pcs resource move vm02-www agrp-c01n02 # moving vm02-www resource from agrp-c01n01 to the agrp-c01n02
virsh console vm02-www # from node agrp-c01n02
uptime # uptime must be equal to the previous result or more
pcs constraint --full | grep -E "Enabled.+Started.+vm[0-9]"
pcs recource clear vm02-www # from any node - this should cause vm02-www to migrate back to the agrp-c01n01

Automatic Live Migration on cluster stop (on one node)

pcs status resources
virsh console vm02-www # from node agrp-c01n01
uptime # note the uptime
pcs cluster stop # stopping cluster on agrp-c01n01
pcs status # now vm01 , vm02, vm03 and vm04 must be started on agrp-c01n02
virsh console vm02-www # from node agrp-c01n02
uptime # uptime must be equal to the previous result or more
pcs constraint --full | grep -E "Enabled.+vm[0-9]" # as you see no constraints are added, because migration is caused by cluster system itself due to agrp-c01n01 going offline
pcs cluster start # starting cluster on agrp-c01n01
pcs status # now vm01 and vm02 should migrate back to agrp-c01n01 automatically

Controlled Migration and Node Withdrowal

This steps must be repeated one time for each node (first agrp-c01n01 & then agrp-c01n02):

agrp-c01n01 withdrawal:

resources=$(pcs resource | grep -E "VirtualDomain.+n01" | awk '{print $1}')

for resource in $resources; do pcs resource move $resource agrp-c01n02; done

pcs status

pcs cluster stop # on agrp-c01n01

systemctl poweroff # on agrp-c01n01

resources=$(pcs constraint --full | grep -E "Enabled.+Started.+vm[0-9]" | awk '{print $7}' | cut -d\- -f3,4 | cut -d\) -f 1)

for resource in $resources; do pcs resource clear $resource; done

power-on power-offed node

pcs cluster start # all VM should migrate to their original positions

agrp-c01n02 withdrawal:

resources=$(pcs resource | grep -E "VirtualDomain.+n02" | awk '{print $1}')

for resource in $resources; do pcs resource move $resource agrp-c01n01; done

pcs status

pcs cluster stop # on agrp-c01n02

systemctl poweroff # on agrp-c01n02

resources=$(pcs constraint --full | grep -E "Enabled.+Started.+vm[0-9]" | awk '{print $7}' | cut -d\- -f3,4 | cut -d\) -f 1)

for resource in $resources; do pcs resource clear $resource; done

power-on power-offed node

pcs cluster start # all VM should migrate to their original positions

VM resource restarting test

This process should be repeated for all VM's on it's primary node. Here only steps for vm01-nagios are shown but they are identical for the other VMs:

clear; tail -f -n 0 /var/log/messages

virsh console vm01-nagios

shutdown -h 0

In the output of the /var/log/messages you we'll see below lines:

....

Apr 26 16:08:16 agrp-c01n01 systemd-machined: Machine qemu-3-vm01-nagios terminated

Apr 26 16:08:17 agrp-c01n01 ovs-vsctl: ovs|00001|vsctl|INFO|Called as ovs-vsctl --timeout=5 -- --if-exists del-port vnet0

Apr 26 16:08:24 agrp-c01n01 pengine[2795]: warning: Processing failed op monitor for vm01-nagios on agrp-c01n01: not running (7)

...

Apr 26 16:08:24 agrp-c01n01 crmd[2796]: notice: Initiating start operation vm01-nagios_start_0 locally on agrp-c01n01

...

Apr 26 16:08:25 agrp-c01n01 crmd[2796]: notice: Result of start operation for vm01-nagios on agrp-c01n01: 0 (ok)

pcs status # vm01-nagios should be "Started"

Nodes Crash-Test

Simulating Software Crash

Crashing agrp-c01n01:

clear; tail -f -n 0 /var/log/messages # on agrp-c01n02

echo c > /proc/sysrq-trigger # on agrp-c01n01

What is in the log (key points):

16:28:17 agrp-c01n02 corosync[2217]: [TOTEM ] A processor failed, forming new configuration.

16:28:18 agrp-c01n02 attrd[2228]: notice: Node agrp-c01n01 state is now lost

16:28:18 agrp-c01n02 kernel: dlm: closing connection to node 1

agrp-c01n02 corosync[2217]: [MAIN ] Completed service synchronization, ready to provide service.

16:28:18 agrp-c01n02 dlm_controld[3148]: 2203 fence request 1 pid 22241 nodedown time 1524745698 fence_all dlm_stonith

agrp-c01n02 stonith-ng[2226]: notice: Requesting peer fencing (reboot) of agrp-c01n01

16:28:22 agrp-c01n02 kernel: drbd r0: PingAck did not arrive in time.

16:28:22 agrp-c01n02 kernel: drbd r0: helper command: /sbin/drbdadm fence-peer r0

16:28:23 agrp-c01n02 pengine[2229]: warning: Cluster node agrp-c01n01 will be fenced: peer is no longer part of the cluster

16:28:23 agrp-c01n02 pengine[2229]: warning: Node agrp-c01n01 is unclean

16:28:23 agrp-c01n02 crmd[2230]: notice: Requesting fencing (reboot) of node agrp-c01n01

16:28:26 agrp-c01n02 kernel: drbd r1: PingAck did not arrive in time.

16:28:2d6 agrp-c01n02 kernel: drbd r1: helper command: /sbin/drbdadm fence-peer r1

16:28:47 agrp-c01n02 stonith-ng[2226]: notice: Call to fence_ipmi_n01 for 'agrp-c01n01 reboot' on behalf of stonith-api.22241@agrp-c01n02: OK (0)

16:28:47 agrp-c01n02 crmd[2230]: notice: Peer agrp-c01n01 was terminated (reboot) by agrp-c01n02 for agrp-c01n02: OK (ref=47de2c75-7e7c-49d3-9796-17779e46e0bd) by client crmd.2230

16:28:47 agrp-c01n02 pengine[2229]: notice: * Start vm02-www ( agrp-c01n02 )

16:28:47 agrp-c01n02 pengine[2229]: notice: * Start vm01-nagios ( agrp-c01n02 )

16:28:48 agrp-c01n02 crm-fence-peer.sh[22390]: INFO peer is fenced, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r1-ms_drbd_r1'

26 16:28:48 agrp-c01n02 crm-fence-peer.sh[22271]: INFO peer is fenced, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r0-ms_drbd_r0'

16:28:49 agrp-c01n02 kernel: GFS2: fsid=agrp-c01:shared.1: jid=0: Looking at journal...

16:28:49 agrp-c01n02 kernel: GFS2: fsid=agrp-c01:shared.1: recover generation 9 done

16:28:49 agrp-c01n02 crmd[2230]: notice: Result of start operation for vm02-www on agrp-c01n02: 0 (ok)

26 16:28:49 agrp-c01n02 crmd[2230]: notice: Result of start operation for vm01-nagios on agrp-c01n02: 0 (ok)

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):

constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)

for constraint in $constraints; do pcs constraint remove $constraint; done

pcs cluster start # vm01 & vm02 will migrate to the agrp-c01n01

Crashing agrp-c01n02:

clear; tail -f -n 0 /var/log/messages # on agrp-c01n01

echo c > /proc/sysrq-trigger # on agrp-c01n02

What is in the log (key points):

mostly the same as for agrp-c01n01

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):

constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)

for constraint in $constraints; do pcs constraint remove $constraint; done

pcs cluster start # vm03 & vm04 will migrate to the agrp-c01n02

Simulating Hardware Crash

Crashing agrp-c01n01:
clear; tail -f -n 0 /var/log/messages # on agrp-c01n02

power-off the node (pull power cord out of PSU)

What is in the log (key points):

mostly the same as for agrp-c01n01 (while Simulating Software Crash), different points:
18:05:58 agrp-c01n02 corosync[4176]: [TOTEM ] A processor failed, forming new configuration.
18:06:34 agrp-c01n02 fence_ipmilan: Connection timed out
18:06:02 agrp-c01n02 crm-fence-peer.sh[6507]: No messages received in 3 seconds.. aborting

18:07:10 agrp-c01n02 stonith-ng[4190]: notice: Call to fence_ipmi_n01 for 'agrp-c01n01 reboot' on behalf of stonith-api.6572@agrp-c01n02: Connection timed out (-110)
Apr 26 18:07:10 agrp-c01n02 stonith-ng[4190]: warning: Agent 'fence_ifmib' does not advertise support for 'reboot', performing 'off' action instead

18:07:28 agrp-c01n02 crm-fence-peer.sh[6507]: INFO peer is not reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-r1-ms_drbd_r1'

18:07:28 agrp-c01n02 kernel: drbd r1: helper command: /sbin/drbdadm fence-peer r1 exit code 5 (0x500)

18:07:28 agrp-c01n02 kernel: drbd r1: fence-peer helper returned 5 (peer is unreachable, assumed to be dead)

18:07:30 agrp-c01n02 stonith-ng[4190]: notice: Call to fence_ifmib_n01 for 'agrp-c01n01 reboot' on behalf of stonith-api.6572@agrp-c01n02: OK (0)

18:07:30 agrp-c01n02 kernel: drbd r0: fence-peer helper returned 7 (peer was stonithed)

18:07:32 agrp-c01n02 crmd[4194]: notice: Result of start operation for vm01-nagios on agrp-c01n02: 0 (ok)

18:07:32 agrp-c01n02 crmd[4194]: notice: Result of start operation for vm02-www on agrp-c01n02: 0 (ok)

power-on agrp-c01n01

fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action status
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action on
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action status
Login to the agrp-c01n01 and verify it's operational
pcs stonith cleanup # on agrp-c01n02

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):

constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)

for constraint in $constraints; do pcs constraint remove $constraint; done

pcs cluster start # vm01 & vm02 will migrate to the agrp-c01n01
pcs status

Crashing agrp-c01n02:

clear; tail -f -n 0 /var/log/messages # on agrp-c01n02

power-off the node (pull power cord out of PSU)

What is in the log (key points):

mostly the same as for agrp-c01n01 (while Simulating Software & Hardware Crash)

power-on agrp-c01n02

fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action status

fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action on

fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action status

pcs stonith cleanup # on agrp-c01n01

BE CAREFUL: on production system it's better to start cluster on restored node, wait until r0 is UpToDate an then delete r0 constraint (r0 will be promoted to master for both nodes) and then wait until r1 UpToDate an then delete r1 constraint (r1 will be promoted to master for both nodes):

constraints=$(pcs constraint --full | grep -E "drbd-fence.+rule" | awk '{print $4}' | cut -d\: -f 2 | cut -d\) -f 1)

for constraint in $constraints; do pcs constraint remove $constraint; done

pcs cluster start # vm01 & vm02 will migrate to the agrp-c01n01

pcs status

Administrative modes: standby, unmanaged, maintenance

Recurring monitor operations behave differently under various administrative settings:

When a resource is unmanaged: No monitors will be stopped. If the unmanaged resource is stopped on a node where the cluster thinks it should be running, the cluster will detect and report that it is not, but it will not consider the monitor failed, and will not try to start the resource until it is managed again. Starting the unmanaged resource on a different node is strongly discouraged and will at least cause the cluster to consider the resource failed, and may require the resource’s target-role to be set to Stopped then Started to be recovered.
When a node is put into standby: All resources will be moved away from the node, and all monitor operations will be stopped on the node, except those with role=Stopped. Monitor operations with role=Stopped will be started on the node if appropriate.
When the cluster is put into maintenance mode: All resources will be marked as unmanaged. All monitor operations will be stopped, except those with role=Stopped. As with single unmanaged resources, starting a resource on a node other than where the cluster expects it to be will cause problems.

Maintenance mode and making resource unmanaged are preferred method if you are doing the online changes on the cluster nodes. Standby mode is preferred if you need some hardware maintenance.

Managed/unmanaged resources

To make resource unmanaged via :

pcs resource unmanage libvirtd # it the same as: pcs resource update libvirtd meta is-managed=false

pcs status

Clone Set: libvirtd-clone [libvirtd]

libvirtd (systemd:libvirtd): Started agrp-c01n01 (unmanaged)

libvirtd (systemd:libvirtd): Started agrp-c01n02 (unmanaged)

To make resource managed again:

pcs resource manage libvirtd # it the same as: pcs resource update libvirtd meta is-managed=true

pcs status

Clone Set: libvirtd-clone [libvirtd]

Started: [ agrp-c01n01 agrp-c01n02 ]

Standby/Unstandby

Standby means that node is not permitted to run any resources but participates in voting (if note shutdown).

To move node to the standby mode:
pcs node standby agrp-c01n02
Aftre issuing - all VMs are migrated to the agrp-c01n01 and then all resources on agrp-c10n02 are stopped and node itself is shown as:
Node agrp-c01n02: standby
pcs quorum status # we'll see that standby node is also participating in voting
Node in standby mode can be restarted or shut-downed, then the status will change into:
Node agrp-c01n02: OFFLINE (standby)
Also total votes will be "1":
pcs quorum status

To clear standby mode after reboot or shutdown:
pcs cluster start # on node agrp-c01n02
pcs node unstandby agrp-c01n02

Maintenance

In a Pacemaker cluster, as in a standalone system, operators must complete maintenance tasks such as software upgrades and configuration changes. Here's what you need to keep Pacemaker's built-in monitoring features from creating unwanted side effects.

With clone and master-slave resources better way is to place node into standby mode, because node in maintenance mode can be fenced by the other node (i.e. DRBD PingAck will not arrive int time and node will be fenced).

Maintenance entire cluster:

pcs property list --defaults | grep mainte

pcs property set maintenance-mode=true

pcs node maintenace --all

pcs status

*** Resource management is DISABLED ***

The cluster will not attempt to start, stop or recover services

Online: [ agrp-c01n01 agrp-c01n02 ]

Full list of resources:

All resources are shown with " (unmanaged)" added to the end of the resource status line.

In maintenance mode, you can stop or restart cluster resources at will. Pacemaker will not attempt to restart them. All resources automatically become unmanaged, that is, Pacemaker will cease monitoring them and hence be oblivious about their status. You can even stop all Pacemaker services on a node, and all the daemons and processes originally started as Pacemaker managed cluster resources will continue to run.

You should know that when you start Pacemaker services on a node while the cluster in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a "probe") for every resource just so it has an understanding of what resources are currently running on that node. It will, however, take no further action other than determining the resources' status.

Maintenance mode is something you enable before running other maintenance actions, not when you're already half-way through them. And unless you're very well versed in the interdependencies of resources running on the cluster you're working on, it's usually the very safest option. In short: when doing maintenance on your Pacemaker cluster, by default, enable maintenance mode before you start, and disable it after you're done.

To unmaintenance entire cluster:

pcs property set maintenance-mode=false

pcs node unmaintenance --all

Maintenance single node:

pcs node maintenance # will set local node into maintenance mode

pcs node maintenance agrp-c01n01 # will set agrp-c01n01 into maintenance mode

Resources disable/enable

If you need to disable (stop) resource:

pcs resource stop vm01-nagios

pcs status

vm01-nagios (ocf::heartbeat:VirtualDomain): Started agrp-c01n01 (disabled)

virsh list --all

Id Name State
----------------------------------------------------
1 vm02-www running

To enable resource:

pcs resource enable vm01-nagios

pcs status

vm01-nagios (ocf::heartbeat:VirtualDomain): Started agrp-c01n01

virsh list --all

Id Name State
----------------------------------------------------
1 vm02-www running
2 vm01-nagios running

This tutorials were used to understand and setup clustering:
AN!Cluster
unixarena
clusterlabs.org
hastexo.com

Monday, April 23, 2018

Cluster 21. Making installed VM (Virtual Machine) a cluster resource.

Related to libvirtd

In order to start VM as HA resource, libvirtd must be up and running (on both nodes):
creare resource make it clone and start after shredfs because libvirtd uses pools (virsh pool-list virsh pool-info) and files pool is /shared/files
pcs resource create libvirtd systemd:libvirtd
pcs resource clone libvirtd clone-max=2 clone-node-max=1 interleave=true
pcs constraint order start sharedfs-clone then start libvirtd-clone
pcs constraint colocation add libvirtd-clone with sharedfs-clone
some options can be found here: Cluster 17

Firewall setup to support KVM Live Migration

Setup firewall ports for KVM live-migration (on both nodes):
On node1:
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.10.53.2/32" port protocol="tcp" port="49152-49216" accept'
firewall-cmd --reload
firewall-cmd --list-all
On node2:
firewall-cmd --permanent --add-rich-rule='rule family="ipv4" source address="10.10.53.1/32" port protocol="tcp" port="49152-49216" accept'
firewall-cmd --reload
firewall-cmd --list-all

49152-49216 is a pool of tcp ports used randomly by virsh to perfor live migration.

Related to VM itself

In order for the cluster to manage a server, it must know where to find the "definition" file that describes the virtual machine and its hardware. When the server was created with virt-install, it saved this definition file in /etc/libvirt/qemu/
Normal libvirtd tools are not cluster-aware, so we don't want them to see our server except when it is running. We will get this done via "undefine" our VM.

First we'll share definition:
virsh list --all # list running and power-off VMs
virsh dumpxml vm02-www # view VM definition xml dump
mkdir /shared/definitions
vursh shutdown vm02-www
virsh dumpxml vm02-www > /shared/definitions/vm02-www.xml # save dump to the shared directory, this file will be used to start, stop, recover and migrate the VM
verify that xml is saved properly # because next step will destroy the VM

Stop and destroy VM:
virsh destroy vm02-www
virsh undefine vm02-www
virsh list --all # be sure that needed VM is undefined

Setup VM cluster resource (this command is executed on VM primary node - vm02-www primary node is agrp-c01n01):
pcs resource create vm02-www ocf:heartbeat:VirtualDomain hypervisor="qemu:///system" config="/shared/definitions/vm02-www.xml" migration_transport=ssh meta allow-migrate=true op monitor interval="30" timeout="30s" op start interval="0" timeout="240s" op stop interval="0" timeout="120s"

Options described(for all options see pcs resource describe VirtualDomain or man ocf_heartbeat_VirtualDomain):

hypervisor="qemu:///system" - you can find this uri by executing virsh --quiet uri
migration_transport=ssh - use ssh while migrating VM
meta allow-migrate=true - Resources have two types of options: meta-attributes and instance attributes. Meta-attributes apply to any type of resource, while instance attributes are specific to each resource agent. Visit clusterlabs.org/meta

pcs constraint order start libvirtd-clone then vm02-www

pcs constraint colocation add vm02-www with libvirtd-clone

Adding below constraint is needed because without it after node returning (after fail or manual cluster stop-starting) pacemaker will try to migrate VM to the primary node without waiting for DRBD promotion:

pcs constraint colocation add vm02-www with master ms_drbd_r0

Scores are calculated per resource and node. Any node with a negative score for a resource can’t run
that resource. The cluster places a resource on the node with the highest score for it. Positive values indicate a preference for running the affected resource(s) on this node — the higher the value, the stronger the preference. Negative values indicate the resource(s) should avoid this node (a value of - INFINITY changes "should" to "must"):
pcs constraint location add lc_vm02_n01 vm02-www agrp-c01n01 1
pcs constraint location add lc_vm02_n02 vm02-www agrp-c01n02 0

Above location constraints are needed to automatically live-migrate to the node which is primary for that VM. For vm02-www primary node is agrp-c01n01)

To view current score for the resource:

crm_simulate -sL | grep " vm[0-9]"

vm02-www (ocf::heartbeat:VirtualDomain): Started agrp-c01n02

native_color: vm02-www allocation score on agrp-c01n01: -INFINITY

native_color: vm02-www allocation score on agrp-c01n02: 0

-INFINITY for agrp-c01n01 is really not because of constraint but because of agrp-c01n01 is offline

SELinux related issues

SELinux is preventing /usr/bin/virsh from read access on the file vm01-nagios.xml

semodule -DB # enables complete logging SELinux messages to the audit.log
open one more ssh to the node
tail -f -n0 /var/log/audit/audit.log

pcs resources cleanup
The message appeared:
type=AVC msg=audit(1524726228.964:514): avc: denied { read } for pid=8711 comm="virsh" name="vm01-nagios.xml" dev="dm-5" ino=3477882 scontext=system_u:system_r:virsh_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file

Its's complaining about access to the vm01-nagios.xml file which is on device dm-5 and having inode 3477882. Let's find which device is it:
ls -lah /dev/mapper | grep dm-5 # It's agrp--c01n01_vg0-shared
Let's find what is inode 3477882:
find /shared -inum 3477882 # It's /shared/definitions/vm01-nagios.xml
Let's view SELinux context for /shared (we can also view context for only that file but we know that we can have many definitions in the /shared):
ls -laZ /shared # . (meaning current directory - /shared) context is system_u:object_r:unlabeled_t:s0 this context is not permissive enough, so we'll change it (only on one node but verify on the other):
semanage fcontext -a -t virt_etc_t '/shared(/.*)?'
restorecon -r /shared
ls -laZ /shared
semodule -B # disable audit.log for contexts with dontaudit option enabled

This tutorials were used to understand and setup clustering:
AN!Cluster
unixarena
clusterlabs.org
rarforge.com

Restoring SQL queries from /var/log/asterisk/messages log-files.

One of customers had problem with Asterisk CEL due to insufficient HDD space left on Asterisk server. To extract SQL queries from messages, I used these sequence:

cd /var/log/asterisk/
grep "cel_odbc.c: Insert failed on" messages | awk '{$1=$2=$3=$4=$5=$6=$7=$8=$9=$10=$11=""; print $0}' > messages_last
removing all sequences of more than 1 white-spaces (previous command created 11 whitespace sequence):

tr -s " " < messages_last > messages_truncated

remove one white space left before INSERT statement:

sed -i 's/^ *//' messages_truncated

append ";" to the end of each line:

sed -i 's/$/;/' messages_truncated

execute queries from the file:

mysql -u root -p asterisk
SOURCE /var/log/asterisk/messages_truncated
or
mysql -u root -p asterisk < /var/log/asterisk/messages_truncated

Tuesday, April 17, 2018

Cluster 20. Install & setup environment needed for clustered virtualization

KVM Installation & initial setup

Install packages needed for KVM:
yum install –y kvm virt-manager virt-install libvirt libvirt-python libguestfs-tools syslinux pciutils
Verify that the packages were installed correctly:
lsmod | grep kvm # to see if kvm anf kvm_intel modules are loaded

Packages descripted:

kvm - hypervisor
virt-manager - package contains several command-line utilities (also GUI tools) for building and installing new virtual machines, and virt-clone for cloning existing virtual machines
libvirt - is a C toolkit to interact with the virtualization capabilities of recent versions of Linux (and other OSes). The library aims at providing a long term stable C API for different virtualization mechanisms. It currently supports QEMU, KVM, XEN, OpenVZ, LXC, and VirtualBox.
libvirt-python - package provides a module that permits applications written in the Python programming language to call the interface supplied by the libvirt library, to manage the virtualization capabilities of recent versions of Linux (and other OSes).
libguestfs-tool - This package contains the guestfish (shell and command-line tool for examining and modifying virtual machine filesystems) and various virtualization tools, including virt-cat, virt-df, virt-edit, virt-filesystems, virt-inspector, virt-ls, virt-make-fs, virt-rescue, virt-resize, virt-tar, and virt-win-reg
syslinux - is a suite of bootloaders, currently supporting DOS FAT filesystems, Linux ext2/ext3 filesystems (EXTLINUX), PXE network boots (PXELINUX), or ISO 9660 CD-ROMs (ISOLINUX). It also includes a tool, MEMDISK, which loads legacy operating systems from these media.
pciutils - The pciutils package contains various utilities for : inspecting and setting devices connected to the PCI bus.

Check and destroy the default libvirtd bridge (By default, VMs will only have network access to other VMs on the same server (and to the host itself)

systemctl start libvirtd
systemctl status libvirtd

via private network 192.168.122.0. If you want the VMs to have access to your LAN, then you must create a network bridge on the host.):

ip route | grep virbr0

virsh net-destroy default

virsh net-autostart default --disable

virsh net-undefine default

ip route | grep virbr0

Check and disable libvirtd:
systemctl status libvirtd
systemctl stop libvirtd
systemctl disable libvirtd

Provision Planning

The servers I'm using to write this tutorial are a little modest in the RAM department with only 16 GiB of RAM. We need to subtract at least 2 GiB for the host nodes, leaving us with a total of 14 GiB. That needs to be divided up among all your servers. Now, nothing says you have to use it all, of course. It's perfectly fine to leave some RAM unallocated for future use. This is really up to you and your needs.

Let's put together a table with the RAM we plan to allocate and summarizing the LV we're going to create for each server. The LV will be named after the server they'll be assigned to with the suffix _0. Later, if we add a second "hard drive" to a server, it will have the suffix _1 and so on.

Server	RAM (GiB)	Storage Pool (VG)	LV name	LV size
vm01-nagios	2	agrp-c01n01	vm01-nagios_0	150 GB
vm02-www	4	agrp-c01n01	vm02-www_0	150 GB
vm03-mysql	3	agrp-c01n02	vm03-mysql_0	100 GB
vm04-asterisk	4	agrp-c01n02	vm04-asterisk_0	100 GB
Total	13 GiB	--------------	---------------	500 GB

As you see, we'll use 13 GiB of RAM, so remaining RAM amount will be 3 GiB (16-13=3). And we'll use 500GB of storage, so remaining amount of VM dedicated storage (both DRBD r0+r1=1000GB) will be 500GB (1000-500=500)
The same approach can be used for CPU - read this blog-post - how-many-vCPU-per-pCPU

Provision Shared CentOS ISOs

Before we can install the OS, we need to copy the installation media and our driver disk, if needed, and put them in the /shared/files.

For our needs we'll install CentOS6 & CentOS7 machines (for Windows machines, please visit: AN!Cluster_Tutorial - alteeve.com). So download both CentOS 6 & 7 Minimal images from one of the nodes and then send it to the other (I'll be using one of our office machines):

pcs cluster start --all # if didn't start previously

rsync -av --progress CentOS-7-x86_64-Minimal-1708.iso root@172.16.51.1:/shared/files/
rsync -av --progress CentOS-6.9-x86_64-minimal.iso root@172.16.3.235:/shared/files

Creating Storage for VMs

Earlier, we used parted to examine our free space and create our DRBD partitions. Unfortunately, parted shows sizes in GB (base 10) where LVM uses GiB (base 2). If we used LVM's "xxG size notation, it will use more space than we expect, relative to our planning in the parted stage. LVM doesn't allow specifying new LV sizes in GB instead of GiB, so here we will specify sizes in MiB to help narrow the differences.

Storage creating is the same for all VM. So I'll show only one LV creation:

lvcreate -L 150000M -n vm01-nagios_0 agrp-c01n01_vg0
or you can use bytes-count (ie 150GB=150*1024*1024*1024bytes=161061273600b):
lvcreate -L 161061273600b -n vm01-nagios_0 agrp-c01n01_vg0

lvdisplay /dev/agrp-c01n01_vg0/vm01-nagios_0
To remove lv:
lvremove /dev/agrp-c01n01_vg0/vm01-nagios_0

Creating OpenVSwitch group for VMs

Find name of the bridge:

ovs-vsctl list Bridge | grep name

Add port group to the file /shared/provision/ovs-network.xml (if more than one vlan is needed – add <portgroup>..</portgroup> for every vlan)

<network>
<name>ovs-network</name>
<forward mode='bridge'/>
<bridge name='ovs_kvm_bridge'/>
<virtualport type='openvswitch'/>
<portgroup name='vlan-51'>
<vlan>
<tag id='51'/>
</vlan>
</portgroup>
</network>

To add networ to a KVM (from both nodes):

systemctl start libvirtd
virsh net-define /shared/provision/ovs-network.xml
net-list --all
virsh net-start ovs-network
virsh net-autostart ovs-network

virsh net-list

systemctl stop libvirtd

To delete network from KVM (if needed - from both nodes):
virsh net-list
virsh net-destroy ovs-network
virsh net-autostart --disable ovs-network
virsh net-undefine ovs-network

Virtio

So-called "full virtualization" is a nice feature because it allows you to run any operating system virtualized. However, it's slow because the hypervisor has to emulate actual physical devices such as RTL8139 network cards . This emulation is both complicated and inefficient.

Virtio is a virtualization standard for network and disk device drivers where just the guest's device driver "knows" it is running in a virtual environment, and cooperates with the hypervisor. This enables guests to get high performance network and disk operations, and gives most of the performance benefits of paravirtualization.

Creating virt-install call

touch /shared/provision/vm01-nagios.sh

chmod 755 /shared/provision/vm01-nagios.sh

vim /shared/provision/vm01-nagios.sh
virt-install --connect qemu:///system \
--name=vm01-nagios \
--ram=2048 \
--arch=x86_64 \
--vcpus=2 \
--location=/shared/files/CentOS-6.9-x86_64-minimal.iso \
--os-variant=centos6.9 \
--network network=ovs-network,portgroup=vlan-51,model=virtio \
--disk path=/dev/agrp-c01n01_vg0/vm01-nagios_0,bus=virtio \
--graphics none \
--extra-args 'console=ttyS0'

Options Described:

--connect qemu:///system - This tells virt-install to use the QEMU hardware emulator (as opposed to Xen, for example) and to install the server on to local node.
--name vm01-nagios - This sets the name of the server. It is the name we will use in the cluster configuration and whenever we use the libvirtd tools, like virsh.
--ram 2048 - This sets the amount of RAM, in MiB, to allocate to this server. Here, we're allocating 2 GiB, which is 2048 MiB.
--arch x86_64 - i386 – 32bit old CPUs, i686 – 32bit new CPUs, x86-64 – 64bit CPUs
--vcpus 2 - This sets the number of CPU cores to allocate to this server. Here, we're allocating two CPUs.
--location /shared/files/CentOS-6.9-x86_64-minimal.iso - Distribution tree installation source. virt-install can recognize certain distribution trees and fetches a bootable kernel/initrd pair to launch the install.
--os-variant centos6.9 - This tweaks the virt-manager's initial method of running and tunes the hypervisor to try and get the best performance for the server. There are many possible values here for many, many different operating systems. If you run osinfo-query os on your node, you will get a full list of available operating systems. If you can't find your exact operating system, select the one that is the closest match.
--network network=ovs-network,portgroup=vlan-51,model=virtio - This tells the hypervisor that we want to create a network card using the virtio "hardware" and that we want it plugged into the ovs-network bridge's vlan-51 portgroup. We only need one network card, but if you wanted two or more, simply repeat this command. If you create two or more bridges, you can have different network devices connect to different bridges.
--disk path=/dev/agrp-c01n01_vg0/vm01-nagios_0,bus=virtio - This tells the hypervisor what LV to use for the server's "hard drive". It also tells it to use the virtio emulated SCSI controller.
--graphics none - we'll use only CLI without any GUI (also for installation)
--extra-args 'console=ttyS0' - this is needed to see installation process from console




Installing VM on the node

We can install any server from either node. However, we know that each server has a preferred node, so it's sensible to use that host for the installation stage. In this case of vm01-nagios, the preferred host is agrp-c01n01, so we'lluse it to start the installation.




ssh to the agrp-c01n01
systemctl start libvirtd
/shared/provision/vm01-nagios.sh
Go through steps of text-mode installation
To exit installed VM hit Ctrl+5 (remote connect) or Ctrl+] (local connect)
To connect to the installed VM virsh console vm01-nagios
To list installed systems and their operating mode virsh list --all
To start "shut-offed" VM virsh start vm01-nagios


Steps to perform on a VM after installation (if you need them): 







For CentOS6:


chkconfig ip6tables off
service ip6tables stop
cat /etc/sysconfig/network

NETWORKING=yes
NETWORKING_IPV6=no

vi /etc/sysctl.conf

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
kernel.panic = 5 # self-reboot in 5 seconds when panicking 

sysctl -p
vi /etc/sysconfig/network-scripts/ifcfg-eth0

NM_CONTROLLED=no
ONBOOT=yes

service network restart
ip route


For CentOS7 (this version by default does self-restart on kernel panicking):

systemctl stop NetworkManager

systemctl disable NetworkManager

chkconfig network on

systemctl start network

vi /etc/sysconfig/network-scripts/ifcfg-eth0

ONBOOT=yes

systemctl restart network

ip route

To learn IP addresses and OVS port names of the VM (execute from the node where VM is situated):
for name in $(virsh list | awk '{print $2}' | grep -v '^$\|Name'); do echo $name;virsh domiflist $name; echo""; done

vm02-www
Interface Type Source Model MAC
----------------------------------------------------------------------
vnet0 bridge ovs-network virtio 52:54:00:77:3a:a0

nagios
Interface Type Source Model MAC
----------------------------------------------------------------------
vnet1 bridge ovs-network virtio 52:54:00:77:3d:19

VM shutdown test

To test if node can be shutdown:

virsh shutdown vm01-nagios

If shutdown is not performed and node remains active (mostly this is problem on CentOS6):

virsh console vm01-nagios

yum -y install acpid

service acpid start

chkconfig --level 235 acpid on
chkconfig --list acpid

Test again:

virsh shutdown vm02-www

ACPI (Advanced Configuration and Power Interface) — enhanced interface for power supply management. ACPI is the component of many modern computers. It gives PC users ability to manage power supply programmatically and also query batteries state and parameters.

This tutorials were used to understand and setup clustering:

AN!Cluster
clusterlabs.org
redhat.com

Tuesday, April 10, 2018

Cisco ASA how to find why packet is not going in or out through VPN

For example we want to check access through INSIDE interface from 10.10.100.100 outside client's tcp port 30000 to the 10.20.100.100 internal server's 3389 tcp port (Windown RDP).

First you need to check packet "movement":

packet-tracer input INSIDE tcp 10.10.100.100 30000 10.20.100.100 3389 detailed

Correct all problems appeared in each Phase. If the only "Drop" result is on VPN Phase, then:

sh run route | grep 10.20.100.100 # found gateway is 10.30.100.100

sh run group-policy | grep 10.30.100.100 # found group-policy name is GP_10.30.100.100

sh run group-policy GP_110.30.100.100 | grep vpn-filter # found ACL name is INSIDE.30.100.100.vpn.filter

No you can verify this ACL and add needed permissions

Thursday, April 5, 2018

Install Gnome Desktop GUI to the CentOS7

yum install yum-utils
yum grouplist | grep -i desktop
yum groupinstall "Gnome Desktop"
systemctl set-default graphical.target
reboot