Monday, February 26, 2018

Cluster 12. Configuring the Cluster Foundation.

Time sync

Firs we need to keep time in sync between both nodes. Install ntp daemon, set time from pool,ntp,org, start ntp daemon:
yum install ntp -y
ntpdate pool.ntp.org
systemctl -l enable ntpd.service
systemctl -l start ntpd.service

Set proper timezone:
timedatectl list-timezones | grep Baku #you must select proper city
timedatectl set-timezone Asia/Baku
Verify:
timedatectl

Packages


Needed packages:
Corosync - manages cluster communication, quorum and membership, uses totem protocol for "heartbiting". Prior to CentOS 7 crosync itself only cared about who is a cluster member and making sure all members get all totem messages. What happens after the cluster reforms was up to the cluster manager (cman), and the resource group manager (rgmanager). In CentOS 7 cman work (mainly quorum - assign quorum votes and control them) is done by corosync and the rgmanager work is done by pacemaker. (To be frankly - cman work is now done by votequorum part of the corosync).

Pacemaker - cluster resource manager (pcsd is it's daemon)
pcs - (ccs in CentOS 6) - CentOS 7 command line configuration utility
psmisc - contains utilities for managing processes on your system: pstree, killall, and fuser.
policycoreutils-python - contains the core utilities that are required for the basic operation of a Security-Enhanced Linux (SELinux) system and its policies.
fence-agents - provides various agents for fencing (ipmi fence, cisco fence etc.)
dlm - Distributed Lock Manager

Install needed packages:
I prefer to use bash autocomplete: bash-autocomplete-setup-link
yum install -y  corosync pacemaker pcs psmisc policycoreutils-python fence-agents dlm

Setup initial cluster 

Related to pcsd.service:
systemctl -l enable pcsd.service
systemctl -l start pcsd.service
systemctl -l status pcsd.service #pcsd service automatically starts corosync and pacemaker services when needed

Setup password for cluster user (CentOS 6 cluster user was ricci)
passwd hacluster


Setup and verify firewall:
Verify if firewalld is active:
firewall-cmd --state
firewall-cmd --permanent --add-service=high-availability # view /usr/lib/firewalld/services/high-availability.xml to see which port are in high-availability service
firewall-cmd --reload
firewall-cmd --list-all

Ports listed in /usr/lib/firewalld/services/high-availability.xml :
tcp 2224 - PCSD Web UI (High Availability Web Management)
tcp 3121 - Pacemaker Remote
tcp 5403 - needed for corosync-qnetd
udp 5404 - totem protocol multicast
udp 5405 - totem protocol

tcp 21064 - DLM


Login to any of the cluster node and authenticate “hacluster” user. 
We must use names from /etc/hosts which are resolved to our bcn-bond1 addresses (this step automatically setups Corosync authentication):
pcs cluster auth agrp-c01n01 agrp-c01n02 #username is hacluster

To change hacluster password:
passwd hacluster
pcs cluster auth agrp-c01n01 agrp-c01n02 --force  # "--force" will force authentication even node is already authenticated 

To verify:
cat /var/lib/pcsd/tokens # both nodes info must be here

Create new cluster named agrp-c01 (this step automatically setups Corosync cluster membership & also synchronizes configuration):
pcs cluster setup --name agrp-c01 agrp-c01n01 agrp-c01n02 --transport # udpu transport - UDP Unicast

After all executed steps new file is created:
vi /etc/corosync/corosync.conf
secauth: off attribute. This controls whether the cluster communications are encrypted or not. We can safely disable this because we're working on a known-private network, which yields two benefits; It's simpler to setup and it's a lot faster. If you must encrypt the cluster communications, then you can do so here.

pcs cluster start --all # --all option will start cluster on all nodes & also will start corosync.service and pacemaker.service.
If you want corosync and pacemaker to start automatically on boot:
pcs cluster enable --all # this tutorial doesn't use this setting

Check cluster is functioning properly (on both nodes):

systemctl status corosync # must be active/disabled without any error

Use corosync-cfgtool -s to check whether cluster communication is happy (output must have local node's proper IP in id and "no faults" in status):
Printing ring status.
Local node ID 2
RING ID 0
id 10.10.53.2
status ring 0 active with no faults

Next, check the membership and quorum APIs (both nodes must join the cluster):

corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.10.53.1)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.10.53.2)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

Verify that corosync uses no-multicats (udpu transport - UDP Unicast):
corosync-cmapctl | grep transport
totem.transport (str) = udpu

pcs status corosync
Membership information
----------------------
    Nodeid      Votes Name
         1          1 agrp-c01n01 (local)
         2          1 agrp-c01n02

systemctl status pacemaker # must be active/disabled without any error

Verify that all Pacemaker daemons (pacemaker itself + 6 daemons) are loaded:
ps axf | grep pacemaker 
7588 ?        Ss     0:00 /usr/sbin/pacemakerd -f
 7589 ?        Ss     0:00  \_ /usr/libexec/pacemaker/cib
 7590 ?        Ss     0:00  \_ /usr/libexec/pacemaker/stonithd
 7591 ?        Ss     0:00  \_ /usr/libexec/pacemaker/lrmd
 7592 ?        Ss     0:00  \_ /usr/libexec/pacemaker/attrd
 7593 ?        Ss     0:00  \_ /usr/libexec/pacemaker/pengine
 7594 ?        Ss     0:00  \_ /usr/libexec/pacemaker/crmd

Finally check cluster overall status:
pcs status # both nodes must be online (it can take several minutes to become online)

Finally, ensure there are no startup errors (aside from messages relating to not having STONITH
configured, which are OK at this point):
journalctl | grep -i error


DC - this string is seen in the "pcs status" output
Designated Coordinator (DC) One CRM in the cluster is elected as Designated Coordinator (DC). The DC is the only entity in the cluster that can decide that a cluster wide change needs to be performed, such as fencing a node or moving resources around. The DC is also the node where the master copy of the CIB is kept. All other nodes get their configuration and resource allocation information from the current DC. The DC is elected from all nodes in the cluster after a membership change (lost nodes etc.).

Quorum:
The votequorum service is part of the corosync project. This service can be optionally loaded into the nodes of a corosync cluster to avoid split-brain situations. It does this by having a number of votes assigned to each system in the cluster and ensuring that only when a majority of the votes are present, cluster operations are allowed to proceed. The service must be loaded into all nodes or none. If it is loaded into a subset of cluster nodes the results will be unpredictable. The following corosync.conf extract will enable votequorum service within corosync: 
quorum { provider: corosync_votequorum } # verify with pcs cluster corosync | grep provider

votequorum reads its configuration from corosync.conf. Some values can be changed at runtime, others are only read at corosync startup. It is very important that those values are consistent across all the nodes participating in the cluster or votequorum behavior will be unpredictable.

The "two node cluster" is a use case that requires special consideration. With a standard two node cluster, each node with a single vote, there are 2 votes in the cluster. Using the simple majority calculation (50% of the votes + 1) to calculate quorum, the quorum would be 2. This means that the both nodes would always have to be alive for the cluster to be quorate and operate. Enabling two_node: 1, quorum is set artificially to 1. So simply saying, with two_node=1 one node will continue when the other node fails. The way it works is that in the event of a network outage both nodes race in an attempt to fence each other and the first to succeed continues in the cluster. The system administrator can also associate a delay with a fencing agent so that one node can be given priority in this situation so that it always wins the race. Also this delay will help to escape fence-looping.
pcs cluster corosync  | grep two_node
and
pcs quorum | grep "Flags\|Quorum" # it's like "Flags: 2Node" and "Quorum: 1"

two_node=1 requires expected_votes is set to 2 (pcs quorum status | grep Expected votes) and it's automatically set to this value (2) automatically when two node cluster is set up. Also this setting (two_node=1) considering that you have proper fencing setup - 

wait_for_all (pcs quorum | grep flags # it's like WaitForAll) - When enabled, the cluster will be quorate for the FIRST TIME only after all nodes have been visible at least once at the same time. The wait_for_all option is automatically enabled when a cluster has two nodes, does not use a quorum device, and auto_tie_breaker is disabled. You can override this by explicitly setting wait_for_all to 0 but in two-node cluster this is not recommended.
auto_tie_breaker When enabled, the cluster can suffer up to 50% of the nodes failing at the same time, in a deterministic fashion. auto_tie_breaker is not compatible with two_node as both are systems for determining what happens should there be an even split of nodes. If you have both enabled, then an error message will be issued and two_node will be disabled.
You can verify wait_for_all setting effect:
pcs cluster stop --all
pcs cluster start
pcs quorum status | grep "Qurum:" # you'll get "Quorum: Activity blocked"
Now:
pcs cluster start --all
pcs quorum status | grep "Qurum:" # you'll get "Quorum: 1"
pcs cluster stop here_name_of_the_other_node
pcs quorum status | grep "Qurum:" # you'll get "Quorum: 1"
So as you see - as expected, wait_for_all only needs all nodes to be online when starting the first time.

How to start nodes when you know the cluster is inquorate, but you are confident that the cluster should proceed with resource management regardless. It can be when one node is powered off and the other node didn't start cluster before first node power off. But you must be sure that the other node doesn't have access to the resources:
pcs quorum unblock # this disables wait_for_all option and then re-enables it



Location

Overall cluster configuration in xml format can be retrieved:
pcs cluster cib scope=configuration # cluster XML dump file, where scope is one from the list: configuration, crm_config, nodes, resources, constraints, status

Note: By default symmetric-cluster is created, meaning  all resources can run anywhere:
There are two alternative strategies. One way is to say that, by default, resources can run anywhere, and then the location constraints specify nodes that are not allowed (an opt-out cluster). The other way is to start with nothing able to run anywhere, and use location constraints to selectively enable allowed nodes (an opt-in cluster).

Destroying cluster

If something went wrong, you can destroy cluster and all it's configuration:
pcs cluster destroy --all
rm -rf /var/lib/pacemaker
rm -rf /var/lib/pcsd
rm -rf /etc/corosync

This tutorials were used to understand and setup clustering: 
AN!Cluster
unixarena
people.redhat.com

vgrename properly.

To change the name of the volume group, we need to actially change VG name, edit two files (/etc/fstab and /etc/grub2.cfg), create new initramfs image and reboot machine:
Make 2 variables:
old_name=vg_old
new_name=vg_new

Change VG name:
vgrename -v $old_name $new_name

Edit fstab:
sed -i "s/\/${old_name}-/\/${new_name}-/g" /etc/fstab

Edit grub2.cfg:
sed -i "s/\([/=]\)${old_name}\([-/]\)/\1${new_name}\2/g" /boot/grub2/grub.cfg

Create new initramfs-image
dracut -f -v /boot/initramfs-$(uname -r).img $(uname -r)

Reboot system (-f is required, without this switch, your system will shutdown and hung [see man systemctl]):
systemctl reboot -f

PS:

  1. you can use both of dracut or mkinitrd
  2. /etc/grub2.cgg is link to /boot/grub2/grub.cfg

Wednesday, February 21, 2018

Cluster 11. Pacemaker parts: CIB, PEngine, DC, CRMd, LRMd

Cluster simplified overview

We need to configure the cluster in two stages. This is because we have something of a chicken-and-egg problem:
  • We need clustered storage for our virtual machines.
  • Our clustered storage needs the cluster for fencing.
Conveniently, clustering has two logical parts:
  • Cluster communication and membership - cluster manager (which nodes are part of the cluster - managed by: cman in CentOS 6 & corosync.service in CentOS 7)
  • Cluster resource management (manages clustered service, storage, virtual servers - managed by rgmanager in CentOS 6 & pcsd.service in CentOS7) 
Right after a node fails, cluster manager initiates a fence agent against this lost node
After being told (by cluster manager) that the node is lost - resource manager looks to see what services might have been lost and decides what to do using resource management configuration. Usually cluster manager and resource manager can work independently, so to start cluster we need both services to be started.

Pacemaker


Pacemaker is the cluster resource manager (pcsd.service is it's daemon). Itself it consists of five key components:

  1. CIB
  2. CRMd
  3. LRMd
  4. PEngine
  5. STONITHd

The CIB uses XML to represent both the cluster’s configuration and current state of all resources in the cluster. The contents of the CIB are automatically kept in sync across the entire cluster and are used by the PEngine (Policy Engine) to compute the ideal state of the cluster and how it should be achieved.
This list of instructions is then fed to the Designated Controller (DC to find which node is currently selected as DC: pcs status | grep DC | awk '{print $3}'). Pacemaker centralizes all cluster decision making by electing one of the CRMd (Cluster Resource Management daemon) instances to act as a master. Should the elected CRMd process (or the node it is on) fail, a new one is quickly established. The DC carries out the PEngine’s instructions in the required order by passing them to either the Local
Resource Management daemon (LRMd) or CRMd peers on other nodes via the cluster messaging infrastructure (which in turn passes them on to their LRMd process).
The peer nodes all report the results of their operations back to the DC and, based on the expected and actual results, will either execute any actions that needed to wait for the previous one to complete, or abort processing and ask the PEngine to recalculate the ideal cluster state based on the unexpected results.
In some cases, it may be necessary to power off nodes in order to protect shared data or complete resource recovery. For this, Pacemaker comes with STONITHd (Fencing daemon).
In Pacemaker, STONITH devices are modeled as resources (and configured in the CIB) to enable them to be easily monitored for failure, however STONITHd takes care of understanding the STONITH topology such that its clients simply request a node be fenced, and it does the rest.

CIB

The cluster is defined by the CIB, which uses XML notation. The major sections that make up a CIB (pcs cluster cib which is a wrapper for cibadmin --query utility):

  • cib: The entire CIB is enclosed with a cib tag. Certain fundamental settings are defined as attributes of this tag.
    • configuration: This section — the primary focus of this document — contains traditional configuration information such as what resources the cluster serves and the relationships among them. Can be checked by pcs cluster cib scope=configuration 
    • crm_config: cluster-wide configuration options. Can be checked by pcs cluster cib scope=crm_config 
    • nodes: the machines that host the cluster.Can be checked by pcs cluster cib scope=nodes 
    • resources: the services run by the cluster. Can be checked by pcs cluster cib scope=resources 
    • constraints: indications of how resources should be placed. Can be checked by pcs cluster cib scope=constraints 
  • status: This section contains the history of each resource on each node. Based on this data, the cluster can construct the complete current state of the cluster. The authoritative source for this section is the local resource manager (lrmd process) on each cluster node, and the cluster will occasionally repopulate the entire section. For this reason, it is never written to disk, and administrators are advised against modifying it in any way.

Normally command line utilities are used to setup cluster. That tools abstract the XML. But overall understanding of the cluster work is more distinct when understanding not only configuration commands but also how this commands are translated into the xml and propagated over the all cluster nodes. To understand this abstraction:

  • Properties are XML attributes of an XML element.
  • Options are name-value pairs expressed as nvpair child elements of an XML element.


This tutorials were used to understand and setup clustering: 

CentOS 7 Command autocomplete.

In CentOS 7 minimal autocomplete packages are not installed by default:
yum install epel-release.noarch -y
yum install bash-completion bash-completion-extras -y
source /etc/profile.d/bash_completion.sh

Usage:
command [TAB][TAB]
command --[TAB][TAB]
man rsyn[TAB]
rpm -qi pacemak[TAB]

Monday, February 12, 2018

Cluster 10. Install HP SDR tools 

HP SDR is Software Delivery Repository, this can be used to manage HP tools inside Linux CLI. To install SDR:
On both nodes
cd ~
yum install wget -y
wget https://downloads.linux.hpe.com/SDR/add_repo.sh

Available repos can be found on (we need only SPP (Service Pack for Proliant repo):
https://downloads.linux.hpe.com/

Find your server generation:
dmidecode | grep "Product Name:" #mine is Gen8

Find your RedHat release:
cat /etc/redhat-release #mine is 7.4.blablabla we need only 7.4

To find architecture:
uname -r # mine is blablabla.el7.x86_64 we need only x86_64

So our baseurl will be:
baseurl=http://downloads.linux.hpe.com/SDR/repo/spp-gen8/RedHat/7.4-Server/x86_64/current/

vi /etc/yum.repos.d/spp.repo
[spp]
name=HP Service Pack for Proliant
baseurl=http://downloads.linux.hpe.com/SDR/repo/spp-gen8/RedHat/7.4-Server/x86_64/current/
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/GPG-KEY-spp

Verify that repo is added:
yum clean all
yum repolist # spp must be among the other repos list

To list available packages:
yum --disablerepo="*" --enablerepo="spp" list available

We need only hpssacli package to monitor RAID controlles:
yum install hpssacli -y

Find RAID controller on-board:
hpssacli ctrl all show

View arrays count on the found controller:
hpssacli ctrl slot=0 array all show

View physical HDD in desired array:
hpssacli ctrl slot=0 array A physicaldrive all show

View logicaldrives on the specified slot:
hpssacli ctrl slot=0 logicaldrive all show

This tutorial was used to understand and setup clustering: AN!Cluster

Cluster 9. SSH setup.

In this part of cluster setup we will setup SSH to be able to access one node from the other without password prompt. For that purpose we'll use public certificates, generated on each node.

Create RSA keys

On both nodes:
ssh-keygen -t rsa -N "" -b 4095 -f ~/.ssh/id_rsa
-t specifies type of key to be created (rsa / dsa / rsa1 etc.)
-N specifies using empty passhphrase
-b specifies the number of bits in the key to create (for RSA minimum is 1024)
-f specifies the filename of the key file

Populate known hosts

ssh-keyscan agrp-c01n01 >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n01.bcn >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n01.sn >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n01.ifn >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n02 >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n02.bcn >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n02.sn >> ~/.ssh/known_hosts
ssh-keyscan agrp-c01n02.ifn >> ~/.ssh/known_hosts

Copy Public Keys to Enable SSH Without a Password

In order to enable password-less login, we need to create a file called ~/.ssh/authorized_keys and put both nodes' public key in it. We will create the authorized_keys on agrp-c01n01 and then copy it over to agrp-c01n02.

Copy node 1 own RSA public key to the authorized_keys file:
cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

Remote copy node 2 RSA public key to the authorized_keys file:
ssh root@agrp-c01n02 "cat /root/.ssh/id_rsa.pub" >> ~/.ssh/authorized_keys

Verify file content:
In ~/.ssh/authorized_keys two entries must be, one for root@agrp-c01n01 and the other for root@agrp-c01n02

Copy authorized_keys from node 1 to the node 2:
rsync -av ~/.ssh/authorized_keys root@agrp-c01n02:/root/.ssh/

Verify password-less access if you can access both nodes from each other, then everything is OK:
From node 1:
ssh root@agrp-c01n02
From node 2:
ssh root@agrp-c01n01

This tutorial was used to understand and setup clustering: AN!Cluster

Cluster 8. IPMI

IPMI is short for Intelligent Platform Management Interface - this can be used to read values from server sensors, monitor RAID etc.
IPMI uses BMC (Baseboard Management Controller) it is main controller of IPMI and manages the interface between IPMI management software and server hardware platform. BMC is like small computer inside the server.

Vendor names for IPMI:
  • Fujitsu calls theirs iRMC
  • HP calls theirs iLO
  • Dell calls theirs DRAC
  • IBM calls their RSA 
We will use this for fencing - if one node stops, fenced will use IPMI fence agent to power off peer node. If you remember we used Gi 1/0/5 and Gi 2/0/5 port to connect cables going to each server iLO ports.

Install IPMI:
yum install ipmitool -y

Verify that ipmi device is seen:
ll /dev/ipmi*

Verify that ipmitool works:
ipmitool chassis status #if we have good output we're past 90% of the potential problems

ipmitool mc info #ipmi version, revision etc.

ipmitool fru print #field replaceable units - SN / CPU / RAM etc.

ipmitool sdr list #shows list of supported sensors

ipmitool sel elist #System Event Log (SEL) print errors

Finding IPMI LAN channel to assign it an IP:
ipmitool lan print NUM #start with NUM = 1 & increment untill something different than "Invalid channel: 1" appears

Setup IPMI BMC:
ipmitool lan set 2 ipsrc static
ipmitool lan set 2 ipaddr 10.10.53.11 # node 2 IP will be 10.10.53.21
ipmitool lan set 2 netmask 255.255.255.0

Verify (IPMI IP, net-mask, default gateway, MAC):
ipmitool lan print 2

Finding IPMI userID (we can skip this step because HP default iLO passwords are goof enough):
ipmitool user list 2 #remember userID of an admin user (well be used in the next step)
ipmitool user set password 1 p@$$w0rd #password in some IPMIs must be at least 8 characters long

From each node verify remote node:
From agrp-c01n01:
ipmitool -I lanplus -U Administrator -P p@$$w0rd -H agrp-c01n02.ipmi chassis power status
Response must be like:
Chassis Power is on
From agrp-c01n02:
ipmitool -I lanplus -U Administrator -P p@$$w0rd -H agrp-c01n01.ipmi chassis power status
Response must be like:
Chassis Power is on

Other usefull commands:
Power-on:
ipmitool -I lanplus -U Administrator -P 8PVRREBK -H agrp-c01n01.ipmi chassis power on
Power-off gracefully (shutdown OS and power off the server to standby power mode):
ipmitool -I lanplus -U Administrator -P 8PVRREBK -H agrp-c01n01.ipmi chassis power  soft
Reboot:
ipmitool -I lanplus -U Administrator -P 8PVRREBK -H agrp-c01n01.ipmi chassis power reset


This tutorial was used to understand and setup clustering: AN!Cluster


Friday, February 9, 2018

Cluster 7. What is split-brain, quorum, DLM & fencing, totem protocol & CPG.

Split-brain

split-brain is a state in which nodes lose contact with each other and then try to take control of shared resources or simultaneously provide clustered services. This leads to actually corrupting and loosing data. To avoid split-brain situations quorum is used.

Quorum

Quorum algorithm used in the Red Hat Cluster is a simple majority meaning that more than half of the hosts must be online and communicating in order to provide services: (nodes_count / 2 + 1) rounding down:
  • If we have 3 nodes in a cluster, voices count = 3, quorum  = 3 / 2 + 1 = 1.5 + 1 = 2.5 ~ 2 , so at least 2 nodes needed to form new cluster after 1 node fails, if 2 nodes fail cluster will hung
  • If we have 4 nodes in a cluster, voices count = 4, quorum = 4 / 2 + 1 = 2 + 1 = 3 , so at least 3 nodes needed to form new cluster after 1 node fails, if 2 nodes fail cluster will hang
  • If we have 5 nodes in a cluster, voices count = 5, quorum = 5 / 2 + 1 = 2.5 + 1 = 3.5 ~ 3 , so at least 3 nodes needed to form new cluster after 1 node fails, if 2 nodes fail cluster will rebuild
In cluster with 2 nodes any failure will cause 50/50 split, hanging both nodes. To make 2 nodes cluster fault-tolerant fencing is used (in corosync 2.4 we have options to use quorum with 2 node cluster but fencing also needed).

If cluster is split into two or more partitions, group of machines having quorum, can form new cluster.

PS we can use qdisk to form quorum in cluster of 2 nodes, but this one is not working with DRBD which we are going to use for HDD replication. Also we are going to use corosync 2.4 which has options like two_node & wait_for_all which are not working with qdisk.

Fencing aka STONITH

Fencing means putting the target node into a state where it can not affect cluster resources or provide cluster services. This can be accomplished by powering it off (power fencing), disconnect it from SAN storage and/or network (fabric fencing).
Fence is absolutely critical part for clustering. Without fully functional fencing your cluster will fail. 
Linux-HA used STONITH ("Shoot The Other Node In The Head") term and Red Hat used the term - "fencing". Both terms can be used interchangeably.
When nodes fail or cluster split into partitions winning node or partition (winning here means - "having quorum") will fence losers (in two node cluster with corosync 2.4 one node will have quorum and try to fence the other node, with network failure this can end with fencing loop - both nodes fencing each other forever. To solve that - you need to setup delay in fencing for the preferred node).
If all (or the only) configured fence fails, fence daemon will start over. Fence daemon will wait and loop forever until a fence agent succeeds. During this time, the cluster is effectively hung.
Once a fence_agent succeeds, fence daemon notifies DLM and lost locks are recovered. This is how Fencing & DLM are cooperating.

DLM (Distributed Lock Manager)

File system locking in Linux is done by POSIX or other type of locks available in system. DLM is used by cluster storage and resource manager in order to organize and serialize the access (it manages locks). dlm daemon runs in user-space (kernel space is used to run OS critical components and user-space is used to provide memory for software), this software communicates with DLM in kernel. The lockspace  (locking definite resource) is given to the requester node,  the other node can request lockspace only after first node releases the lock.
PS - DLM is used only with cluster aware file-systems.

Totem protocol, CPG & virtual synchrony

Totem protocol is used to send token messages between cluster nodes. A token is passed around to each node, the node does some work, and then it passes the token on to the next node. This goes around and around all the time. Should a node not pass its token on after a short time-out period (defaults to 238ms), the token is declared lost, an error count (defaults to 4 losses) goes up and a new token is sent. If too many tokens are lost in a row, the node is declared lost. The cluster checks which members it still has, and if that provides enough votes for quorum.

The closed process group (CPG) is a small process layer on top of the totem protocol provided by corosync. It handles the sending and delivery of messages among nodes in a consistent order. It adds PIDs and group names to the membership layer. Only members of the group get the messages, thus it is a "closed" group. So in other words - CPG is simply a private group of processes in a cluster.
The ordered delivery of messages among cluster nodes is referred to as "virtual synchrony". 

Virtual synchrony (DLM & CPG cooperation)

DLM messages are in ordered delivery due to using totem's CPG. When a node wants to start a clustered service (cluster-aware file-system), this node can start this service only after achieving a lock from DLM. After starting this clustered service, this node announces other nodes - members of the CPG. So after issuing DLM (when stating clustered service or requesting storage lock etc.) every member (node) notifies other CPG members (nodes). 
Messages can only be sent to the members of the CPG while the node has a totem token from corosync.

This tutorial was used to understand and setup clustering: AN!Cluster


Wednesday, February 7, 2018

Cluster 6. Putting all devices hostnames and IP addresses to the /etc/hosts

Three different networks are used in our cluster:
  1. BCN (Back-Channel Network - 10.0.53.nodeIP/24) for cluster management traffic, IPMI, switches - node names will be as  agrp-c01n01.bcn
    1. IPMI IP will be - 10.clusterSerialNumber*10.53.nodeIP*10+1/24 (i.e. 10.10.53.11)
    2. switch stack Ip will be - 10.clusterSerialNumber*10.53.nodeIP*10+2/24 (i.e. 10.10.53.12):
      1. access stack and execute:
      2. int vlan 100
      3. ip address 10.10.53.12 255.255.255.0
      4. ip address 10.10.53.22 255.255.255.0 secondary
      5. do sh int vlan 100
  2. SN (Storage Network - 10.0.52.nodeIP/24) - node names will be as - agrp-c01n01.sn
  3. IFN (Internet-Facing Network - 10.0.51.ServerIP/24) - only for servers (virtual servers, hosted on a node) - node names will be as - agrp-c01n01.ifn
connect IPMI iLOes agrp-c01n01 goes to 1/0/17 and agrp-c01n02 to 2/0/17 - Back-Channel Network - 10.clusterSerialNumber*10.53.nodeIP/24 => 10.10.53.[12]0


Put below lines to the  /etc/hosts (it will be the same on both nodes):

### Nodes 
# agrp-c01n01
10.10.53.1    agrp-c01n01.bcn agrp-c01n01
10.10.53.11  agrp-c01n01.ipmi
10.10.52.1    agrp-c01n01.sn
172.16.51.1    agrp-c01n01.ifn

# agrp-c01n01
10.10.53.2    agrp-c01n02.bcn agrp-c01n02
10.10.53.21  agrp-c01n02.ipmi
10.10.52.2    agrp-c01n02.sn
172.16.51.2    agrp-c01n02.ifn

# Network Switches
10.10.53.12 agrp-stack01
10.10.53.22 agrp-stack01

Save and exit, verify with ping script. This script will ping every host in /etc/hosts file and then will display ping result showing how many packets are send, received and packet loss:
for name in $(grep -E "^(172|10)" /etc/hosts | awk '{print $2}');
do
   echo "NAME=$name";
   ping $name -c 3 | grep "packet loss";
   echo "";
done

Only ping to agrp-c01n01.ipmi & agrp-c01n02.ipmi must response "Destination Host Unreachable" because we haven't yer set up IPMI

This tutorial was used to understand and setup clustering: AN!Cluster

Cluster 5. Nodes naming, Configuring Interfaces, Linux bonds and bridge or Open vSwitch.

Nodes naming convention:

  1. four letter code of the cluster owner name (i.e. AIST Group becomes agrp)
  2. plus c+cluster number (c01 - first cluster in a company)
  3. plus n+01 or 02 (node number in a cluster)
  4. so the name will be: agrp-c01n01 & agrp-c01n02
Change host-names on both nodes:
  • hostnamectl set-hostname agrp-c01n01 --static
  • hostnamectl status
  • logout
  • login
  • verify that hostname is displayed properly both on login screen and on CLI prompt:
    • agrp-c0n01 login:
    • [root@agrp-c01n01 ~]#
All commands below must be executed on both nodes (with proper IP addresses, here will be only agrp-c01n01 related commands).

Bond is the same as LAG (Link Aggregation) - RAID1 for network interfaces (if one goes down, the other will remain working).

Linux bonding and bridging 

Nodes naming and IP addresses:

NodeIP & BCN devIP & SN dev Ip & IFN dev 
agrp-c01n0110.10.53.1 on bcn_bond1 10.10.52.1 on sn_bond1172.16.51.1 on ifn_bridge1 (ifn_bond1 slaved)
agrp-c01n0210.10.53.2 on bcn_bond1 10.10.52.2 on sn_bond1172.16.51.2 on ifn_bridge1 (ifn_bond1 slaved)

In other articles the table was such this one:

SubnetVIDNICLink 1NICLink 2BondNet IP
BCN100eno1bcn_link1eno4back_link.100bcn_bond10.10.53.0/24
SN200eno2sn_link1eno4back_link.200sn_bond10.10.52.0/24
IFN51eno3ifn_link1eno4back_link.51ifn_bond172.16.51.0/24

That was so for simplicity. We will be using VLAN on all physical interfaces, so the actual table must be:

SubnetVIDNICLink 1NICLink 2BondNet IP
BCN100eno1bcn_link1.100eno4back_link.100bcn_bond110.10.53.0/24
SN200eno2sn_link1.200eno4back_link.200sn_bond110.10.52.0/24
IFN51eno3ifn_link1.51eno4back_link.51ifn_bond1172.16.51.0/24

ifn_bridge1 will be used as virtual switch for our servers (VMs) - it will give our VMs access to the VLAN 51 (our IFN). ifn_bond1 will connect to the ifn_bridge1 to connect to the real world

BCN setup

Setup back_link to be a member of  VLAN 100 (BCN):
back_link, only below lines must be in config file:

DEVICE=back_link
NAME=back_link
BOOTPROTO=none
ONBOOT=yes
HWADDR=proper_MAC_here

back_link.100

DEVICE=back_link.100
NAME=back_link.100
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
SLAVE=yes
MASTER=bcn_bond1

Setup bcn_link1 to represent actual VLAN:
bcn_link1

DEVICE=bcn_link1
NAME=bcn_link1
BOOTPROTO=none
ONBOOT=yes
HWADDR=proper_MAC_here


bcn_link1.100

DEVICE=bcn_link1.100
NAME=bcn_link1.100
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
SLAVE=yes
MASTER=bcn_bond1

Bonding options:
  1. mode=1 => Active/Passive
  2. miimon=100 => test interfaces every 100ms (MII - (Media Independent Interface) means that media type can be any - fiber,copper etc. / mon - monitoring) 
  3. downdelay=0 => when link goes down immediately switch to the other interface in bond
  4. updelay=120000 => switch back to the primary interface in 2 minutes
  5. use_carrier=1 => check the link state
Setup bcn_bond1:

vi /etc/sysconfig/network-scripts/ifcfg-bcn_bond1
DEVICE="bcn_bond1"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=bcn_link1.100"
IPADDR=10.10.53.1
NETMASK=255.255.255.0

systemctl restart network.service

After setting agrp-c01n02, verify ping between nodes:

agrp-c01n01# ping 10.10.53.2
agrp-c01n02# ping 10.10.53.1

SN setup

Setup back_link to be a member of  VLAN 200 (SN):

back_link.200

DEVICE=back_link.200
NAME=back_link.200
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
SLAVE=yes
MASTER=sn_bond1

Setup sn_link1 to represent actual VLAN:
sn_link1

DEVICE=sn_link1
NAME=sn_link1
BOOTPROTO=none
ONBOOT=yes
HWADDR=proper_MAC_here


sn_link1.200

DEVICE=sn_link1.200
NAME=sn_link1.200
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
SLAVE=yes
MASTER=sn_bond1

Setup sn_bond1:

vi /etc/sysconfig/network-scripts/ifcfg-sn_bond1
DEVICE="sn_bond1"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=sn_link1.100"
IPADDR=10.10.52.1
NETMASK=255.255.255.0

systemctl restart network.service

After setting agrp-c01n02, verify ping between nodes:

agrp-c01n01# ping 10.10.52.2
agrp-c01n02# ping 10.10.52.1

IFN setup

Setup back_link to be a member of  VLAN 51 (IFN):

back_link.51

DEVICE=back_link.51
NAME=back_link.51
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
SLAVE=yes
MASTER=ifn_bond1

Setup ifn_link1 to represent actual VLAN:
ifn_link1

DEVICE=ifn_link1
NAME=ifn_link1
BOOTPROTO=none
ONBOOT=yes
HWADDR=proper_MAC_here


ifn_link1.51

DEVICE=ifn_link1.51
NAME=ifn_link1.51
BOOTPROTO=none
ONBOOT=yes
VLAN=yes
SLAVE=yes
MASTER=ifn_bond1

Setup ifn_bond1:

vi /etc/sysconfig/network-scripts/ifcfg-ifn_bond1
DEVICE="ifn_bond1"
BOOTPROTO="none"
ONBOOT="yes"
BONDING_OPTS="mode=1 miimon=100 use_carrier=1 updelay=120000 downdelay=0 primary=ifn_link1.51"
BRIDGE=ifn_bridge1

Setup ifn_bridge1:
DEFROUTE=yes allows to use this interface as window to the outer world

vi /etc/sysconfig/network-scripts/ifcfg-ifn_bridge1
DEVICE=ifn_bridge1
TYPE=Bridge
BOOTPROTO=none
IPADDR=172.16.51.1
NETMASK=255.255.255.0
GATEWAY=172.16.51.254
DNS1=8.8.8.8
DNS2=8.8.4.4
DEFROUTE=yes

systemctl restart network.service

ping default gateway:
ping 172.16.51.254

After setting agrp-c01n02, verify ping between nodes:

agrp-c01n01# ping 172.16.51.2
agrp-c01n02# ping 172.16.51.1

Verifying

On both nodes verify master and slaves and interface states (Up/Down):
ip link | grep ifn
ip link | grep sn
ip link | grep bcn
ip link | grep back

Verify bonds (settings, slave status, failures count):
cat /proc/net/bonding/ifn_bond1
cat /proc/net/bonding/sn_bond1
cat /proc/net/bonding/bcn_bond1

Verify bridge (ifn_bridge1 must be shown, STP enabled must be no):
brctl show

PS if you encounter MAC flapping error on Cisco stack, like:
%SW_MATM-4-MACFLAP_NOTIF: Host aaaa.bbbb.cccc in vlan 51 is flapping between port Po1 and port Gi2/0/4
Then add MACADDR parameter to all bond interfaces. This MACADDR must be equal to the MAC of the non backup-link because back_link is in 3 VLANs and that can cause bond to choose back_link NIC MAC for all bond and VLAN interfaces (by default bond uses first added slave's MAC as it's own MAC).

Open vSwitch

We will bond all 4 interfaces (from eno1 through eno4) to the OvS bond ovs_bond. And then we'll create OvS internal ports and assign them IP:

Subnet
VID
OvS internal port
Net IP
BCN
100
bcn-bond1
10.10.53.0/24
SN
200
sn-bond1
10.10.52.0/24
IFN
51
ifn-bond1
172.16.51.0/24

Nodes naming and IP addresses:

NodeIP & BCN devIP & SN dev Ip & IFN dev 
agrp-c01n0110.10.53.1 on bcn-bond1 10.10.52.1 on sn-bond1172.16.51.1 on ifn-bond1
agrp-c01n0210.10.53.2 on bcn-bond1 10.10.52.2 on sn-bond1172.16.51.2 on ifn-bond1

Below commands must be executed on both nodes (with parameters appropriate to each node)


Create OvS bridge and bonds

Create OvS switch:
ovs-vsctl add-br ovs_kvm_bridge

Disable STP on this bridge:
ovs-vsctl set bridge ovs_kvm_bridge stp_enable=false

Add bonds to the ifn_bridge:
ovs-vsctl add-bond ovs_kvm_bridge ovs_bond eno1 eno2 eno3 eno4 trunks=100,200,51
In future if you need to add new VLANs to the trunk, execute below command with proper VLANs list:
ovs-vsctl set port ovs_bond trunks=100,200,300,400 etc.

Enabling LACP LAG protocol:
ovs-vsctl set port ovs_bond lacp=active bond_mode=balance-slb  bond-updelay=120000 bond-downdelay=0 other_config:lacp-time=fast  other_config:lacp-fallback-ab=true# no space is allowed in "config:lacp" part of configuration

To view bond interface configuration:
ovs-vsctl list Port ovs_bond

If you made mistake while configuring (i.e. wrote "lacp_time" instead of "lacp-time"):
ovs-vsctl remove port ovs_bond other_config lacp_time fast

lacp-time - either slow or fast -defines whether LACP packets are sent every 1 second, or every 30 seconds.
lacp-fallback-ab - if LACP failes - Active-Backup bonding will be used
balance-slb - Source-load Balancing (this is default on Cisco LACP bonds - sh run all | incl load-balance will give you src-mac):

  1. The source MAC address is extracted, and a hashing algorithm is used to map it to a hash number 0-255. 
  2. Each hash is assigned to one of the NICs on the bond, which means packets with the same hash are always sent through the same NIC. 
  3. If a new hash is found, it is assigned to the NIC that currently has the lowest utilization. 
  4. In practice, this means that when virtual machines (VMs) are set up on a bond, packets from one VM (with the same source MAC) will always be sent through the same NIC.


To remove ports and bridge (if something went wrong):
ovs-vsctl del-port ovs_kvm_bridge ovs_bond1
ovs-vsctl del-br ovs_kvm_bridge

To view bond configuration:
ovs-appctl bond/show ovs_bond # bond_mode must be active-backup / lacp_status = negotiated / all interfaces slave eno{1..4} : enabled
ovs-appctl lacp/show ovs_bond | head -n 7 # status: active negotiated / lacp_time: fast  


To view MAC address table:
ovs-appctl fdb/show ovs_kvm_bridge

Below command can be used to verify overall OvS bridge configuration (including STP status), -S option makes output scroll-able with keyboard left-right arrow keys:
ovsdb-client dump | less -S

Create OvS internal ports for node and assign them IP:

Setup IFN ifn-bond1, make it internal and assign VLAN ID 51:
ovs-vsctl add-port ovs_kvm_bridge ifn-bond1 -- set interface ifn-bond1 type=internal -- set port ifn-bond1 tag=51

Assign an IP to ifn-bond1:

vi /etc/sysconfig/network-scripts/ifcfg-ifn-bond1
DEVICE=ifn-bond1
NAME=ifn-bond1
ONBOOT=yes
BOOTPROTO=none
IPADDR=172.16.51.1
NETMASK=255.255.255.0
GATEWAY=172.16.51.254
DNS1=8.8.8.8
DNS2=8.8.4.4
DEFROUTE=yes

Setup BCN bcn-bond1, make it internal and assign VLAN ID 100:
ovs-vsctl add-port ovs_kvm_bridge bcn-bond1 -- set interface bcn-bond1 type=internal -- set port bcn-bond1 tag=100

vi /etc/sysconfig/network-scripts/ifcfg-bcn-bond1
DEVICE="bcn-bond1"
BOOTPROTO="none"
ONBOOT="yes"
IPADDR=10.10.53.1
NETMASK=255.255.255.0

Setup SN sn-bond1, make it internal and assign VLAN ID 200:
ovs-vsctl add-port ovs_kvm_bridge sn-bond1
ovs-vsctl set interface sn-bond1 type=internal
ovs-vsctl set port sn-bond1 tag=200

vi /etc/sysconfig/network-scripts/ifcfg-sn-bond1
DEVICE="sn-bond1"
BOOTPROTO="none"
ONBOOT="yes"
IPADDR=10.10.52.1
NETMASK=255.255.255.0

systemctl restart network.service

To list all ports which OpenvSwitch sees:
ovs-vsctl list-ports ovs_kvm_bridge # will show:
bcn-bond1
ifn-bond1
ovs_bond
sn-bond1

To listen to the ports traffic:
yum install tcpdump
tcpdump -i port_name # port name is one of the ports seen by OvS

Verifying

IFN test:
From agrp-c01n01 ping 172.16.51.2
From agrp-c01n01 ping 172.16.51.254
From agrp-c01n02 ping 172.16.51.1
From agrp-c01n01 ping 172.16.51.254
BCN test:
From agrp-c01n01 ping 10.10.53.2
From agrp-c01n02 ping 10.10.53.1
SN test:
From agrp-c01n01 ping 10.10.52.2
From agrp-c01n02 ping 10.10.52.1

After-setup steps

Backup configs after setting done (either using Linux bonding and bridging or Open vSwitch):
rsync -av /etc/sysconfig/network-scripts /root/backups

This tutorials were used to understand and setup clustering: 
AN!Cluster
brezular.com
citrix.com

Tuesday, February 6, 2018

Cluster 4. Mapping physical interfaces to device names

If you are going to use OvS networking, then skip this article and go to the "Cluster 5" article.

SubnetVIDNICLink 1NICLink 2BondNet IP
BCN100eno1bcn_link1eno4back_link.100bcn_bond10.10.53.0/24
SN200eno2sn_link1eno4back_link.200sn_bond10.10.52.0/24
IFN51eno3ifn_link1eno4back_link.51ifn_bond172.16.51.0/24

First we'll disable new NIC naming which comes with CentOS7 by default:
biosdevname=0
ln -s /dev/null /etc/udev/rules.d/80-net-name-slot.rules

Lets add MAC addresses to all ifcfg files (on the both nodes):
cd /etc/sysconfig/network-scripts
for int in $(ls -1 ifcfg-eno* | cut -d'-' -f2);
do
   mac=$(ip link show $int | grep ether | awk '{print $2}');
   echo "HWADDR=\"$mac\"" >> ifcfg-$int;
done

Now rename device names according to the table above:
mv ifcfg-eno1 ifcfg-bcn_link1
mv ifcfg-eno2 ifcfg-sn_link1
mv ifcfg-eno3 ifcfg-ifn_link1
mv ifcfg-eno4 ifcfg-back_link1

Now change NAME and DEVICE parameters in the appropriate files:
for name in bcn_link1 sn_link1 ifn_link1 back_link:
do
   sed -i "s/DEVICE=.*/DEVICE=$name/" ifcfg-$name;
   sed -i "s/NAME=.*/NAME=$name/" ifcfg-$name;
done

Now we can reboot each node and verify NIC names:
reboot
After reboot to verify proper naming, execute:
ip link

The last step will be manual test (on both nodes):

  1. tail -f -n 0 /var/log/messages # -f means "as file growth" , -n 0 means "initially show no lines"
  2. then unplug and plug-in cables
  3. if every cable becomes "Link is down" and then "Link is up", then everything is OK 
  4. backup new configs:
  5. rsync -av /etc/sysconfig/network-scripts /root/backups

This tutorial was used to understand and setup clustering: AN!Cluster

Cluster 3. Network Switches.

You can use any switches that support VLAN and multicasting groups (also it's possible to use unicast at least with corosync 2.4.0). I'll use Cisco Catalyst 2960-S stackable switches (they form a stack - which can be managed as one switch).
We will use 2 Cisco stack switches (you can use just one switch) in order to make links redundant - eno4 of nodes must be connected to different switches, eno1, eno2, eno3 must be connected to the switch where the other node's eno4 interface is connected.

To setup stack:
  1. Connect stack cables to the proper ports Stack1 to Stack1 and Stack2 to Stack2
  2. Connect console cable to the Cisco and to your PC/NB
  3. power on switches
  4. verify that switch knows whe is a mater and who is a slave:
    1. do sh sw
    2. Master switch  Role must be - Master, current state - Ready
    3. Slave switch Role must be - Member, current state  - Ready
    4. Also verify that MSTR led on master switch is green
  5. Verify stack-ports:
    1. do sh sw stack-ports - all ports must be - OK
    2. power-off master switch and verify that slave becomes master:
      1. do sh sw
      2. Removed switch Role must change to - Member, State must change to - Removed
      3. Remaining switch Role must change to - Master, State must change to - Ready
Now you can setup stack from any switch:
  1. name stack like agrp-stack01 (agrp is 4 letter owner code, stack01 is simply serial number of the stack, it's our first stack and because of that stack serial number is 01):
    1. hostname agrp-stack01
  2. configure username and password (simple passwords are giver for reference only):
    1. aaa new-model
    2. aaa authentication login default local
    3. username admin privilege 15 secret 123456
    4. enable secret 12345
  3. create VLANs:
    1. vlan 100
    2. name BCN
    3. vlan 200
    4. name SN
    5. vlan 51
    6. name IFN
    7. vlan 1000
    8. name Deafult
  4. disable all interfaces:
    1. int range gi 1/0/1 - 28, gi 2/0/1-28
    2. shut
    3. do wr
  5. enable needed interfaces and make them member of the needed VLAN, also disable STP on ports (STP requires blocking traffic to prevent loops, such a behavior can cause nodes think other node is dead while it's however alive):
    1. disable STP for BCN and SN:
      1. no spanning-tree vlan 100
      2. no spanning-tree vlan 200
      3. no spanning-tree vlan 51
    2. port 1/0/24 & 2/0/24 will be IFN uplink ports - going to the other switch:
      1. int ra gi 1/0/24 , gi 2/0/24
        1. no shut
        2. create LAG/bonding as in my case other switch is single switch, not a stack:
        3. channel-group 1 mode on
      2. configure Port-Channel1:
        1.  int Po1
        2. sw mode access
        3. sw nonegotiate
        4. sw access vlan 51
        5. show etherchannel summary:
          1. "Ports" must be - Gi1/0/24(P) Gi2/0/24(P) - meaning that both ports are bundled in a LAG
      3. Configure iLO port - port 1/0/17 & 2/0/17 will be iLO interface connected ports (here we will insert node1 iLO to the switch1 and node2 iLO to the switch2):
        1. node1 iLO - 1/0/17 - label this cable c01n01_ipmi
        2. node2 iLO - 2/0/17 - label this cable c01n02_ipmi
        3. int ra gi 1/0/17 , gi 2/0/17
          1. no shut
          2. sw nonegotiate
          3. sw mode access
          4. sw access vlan 100
        4. int gi1/0/17
          1. description agrp-c01n01
        5. int gi2/0/17
          1. description agrp-c01n02
Switch ports will be used as 5 ports per cluster, so that we can use each switch stack for serving up to 4 clusters:

  • 1st cluster - gi1/0/1-4,17 & gi2/0/1-4,17
  • 2nd cluster - gi1/0/5-8,19 & gi2/0/5-8,19
  • 3rd cluster - gi1/0/9-12,21 & gi2/0/9-12,21
  • 4th cluster - gi1/0/13-16,23 & gi2/0/13-16,23

We have two options to further setup our stack, one is for Linux bonding and bridging and the other  is for Open vSwitch.

Linux bonding and bridging

SubnetVIDNICLink 1NICLink 2BondNet IP
BCN100eno1bcn_link1eno4back_link.100bcn_bond10.10.53.0/24
SN200eno2sn_link1eno4back_link.200sn_bond10.10.52.0/24
IFN51eno3ifn_link1eno4back_link.51ifn_bond172.16.51.0/24
    1. port 1/0/1 & 2/0/1 will be BCN ports:
      1. node1 eno1 - 1/0/1
      2. node2 eno1 - 2/0/1
        1. int ra gi 1/0/1 , gi 2/0/1
          1. no shut
          2. sw mode trunk
          3. sw nonegotiate
          4. sw trunk allowed vlan 100
          5. sw trunk native vl 1000
      3. port 1/0/2 & 2/0/2 will be SN ports:
        1. node1 eno2 - 1/0/2
        2. node2 eno2 - 2/0/2
        3. int ra gi 1/0/2 , gi 2/0/2
          1. no shut
          2. sw mode trunk
          3. sw nonegotiate
          4. sw trunk allowed vlan 200
          5. sw trunk native vl 1000
      4. port 1/0/3 & 2/0/3 will be IFN ports:
        1. node1 eno3 - 1/0/3
        2. node2 eno3 - 2/0/3
        3. int ra gi 1/0/3 , gi 2/0/3
          1. no shut
          2. sw mode trunk
          3. sw nonegotiate
          4. sw trunk allowed vlan 51
          5. sw trunk native vl 1000
      5. port 1/0/4 & 2/0/4 will be backup ports (here we will insert node1 port eno4 to the switch2 and node2 eno4 to the switch1):
        1. node1 eno4 - 2/0/4
        2. node2 eno4 - 1/0/4
        3. int ra gi 1/0/4 , gi 2/0/4
          1. no shut
          2. sw nonegotiate
          3. sw mode trunk
          4. sw trunk allowed vlan 100,200,51
          5. sw trunk native vl 1000

    Open vSwitch

    We will bond all 4 interfaces (from eno1 through eno4) to the OvS bond - ovs_bond.
    And then we'll create OvS internal ports and assign them IP:

    Subnet
    VID
    OvS internal port
    Net IP
    BCN
    100
    bcn-bond1
    10.10.53.0/24
    SN
    200
    sn-bond1
    10.10.52.0/24
    IFN
    51
    ifn-bond1
    172.16.51.0/24
      1. ports Gi1/0/1-1/0/4 and Gi2/0/1-2/0/4 will be trunk ports carrying all VLANs:
        1. node1 eno1 and eno3 - 1/0/1 & 1/0/3 - label this 2 cables eno1_c01n01_ovs_bond and eno3_c01n01_ovs_bond
        2. node1 eno2 and eno4 - 2/0/2 & 2/0/4 - label this 2 cables eno2_c01n01_ovs_bond and eno4_c01n01_ovs_bond
        3. node2 eno1 and eno3 - 2/0/1 & 2/0/3 - label this 2 cables eno1_c01n02_ovs_bond and eno3_c01n02_ovs_bond
        4. node2 eno2 and eno4 - 1/0/2 & 1/0/4 - label this 2 cables eno2_c01n02_ovs_bond and eno4_c01n02_ovs_bond
          1. int ra gi 1/0/1, gi 1/0/3, gi 2/0/2, gi 2/0/4
            1. description agrp-c01n01
            2. channel-group 2 mode active #enabling LACP use different channel-group numbers for nodes
            3. no shut
          2. int Po2
            1. description agrp-c01n01
            2. sw mode trunk
            3. sw nonegotiate
            4. sw trunk allowed vlan 100,200,51
            5. sw trunk native vl 1000
            6. no shut
          3. int ra gi 1/0/2, gi 1/0/4, gi 2/0/1, gi 2/0/3
            1. description agrp-c01n02
            2. channel-group 3 mode active #enabling LACP use different channel-group numbers for nodes
            3. no shut
          4. int Po3
            1. description agrp-c01n02
            2. sw mode trunk
            3. sw nonegotiate
            4. sw trunk allowed vlan 100,200,51
            5. sw trunk native vl 1000
            6. no shut
          5. sh int port-channel {1|2|3} # to view info about LAG interfaces (choose needed LAG number)
      LACP bandwidth - the maximum through-output will remain equal to the through-output of the single link. In fact you get more lanes to move but the maximum speed remains the same. By enabling LACP you increase maximum overall bandwidth. This achieved using load-balancing (Cisco default LB mechanism is source-MAC balancing).

        This tutorial was used to understand and setup clustering: AN!Cluster 

        Cluster 2. Post Install steps, Networking initial setup.

        First and most important:
        YOU MUST HAVE PHYSICAL ACCESS TO BOTH NODES IN ORDER TO SETUP CLUSTER

        As for now we have 2 nodes (physical hardware server, which is/will be a member of a cluster) with installed OS. In this article we will do some steps needed after installation of the OS and iLO access.
        1. First of all connect both servers to the network and give them Internet access permissions (don't think about IP addressing scheme, for now we just need Internet access)
        2. on both nodes
          1. yum update -y
          2. NM makes many decisions itself which is not appropriate for cluster:
            1. yum remove NetworkManager -y
          3. verify that firewalld enabled and started:
            1. systemctl status firewalld
        Backup existing network configs (on both nodes):
        mkdir -p /root/backups/
        yum install rsync -y
        # -v - be verbose
        # -a archive-mode (recursive, copy links, preserve (permissions, timestamps, owners, groups, dev-files)
        rsync -av /etc/sysconfig/network-scripts /root/backups/

        Enabling all interfaces (on both nodes):
        cd /etc/sysconfig/network-scripts
        for int in $(ls -1 ifcfg-eno*);
        do
           sed -i 's/ONBOOT=.*/ONBOOT="yes"/' $int;
           sed -i 's/BOOTPROTO=.*/BOOTPROTO="none"/' $int;
        done
        to check changes:
        # -U0 will show anly changed line if diff
        # verify all files:
        for int in $(ls -1 ifcfg-eno*);
        do
           diff -U0 /root/backups/network-scripts/ifcfg-$int ifcfg-$int;
        done
        systemctl enable network.service
        systemctl start network.service
        systemctl status network.service
        # to verify that all interfaces are enabled:
        ip link

        Four different networks will be used in our cluster:
          1. BCN (Back-Channel Network - 10.clusterSerialNumber*10.53.nodeIP/24) - for cluster management 
          2. IPMIN (IPMI/iLO Network - 10.clusterSerialNumber*10.53.nodeIP+10+1/24) - for cluster management 
          3. SN (Storage Network - 10.clusterSerialNumber*10.52.nodeIP/24) - for nodes storage replication 
          4. IFN (Internet-Facing Network - 172.16.51.ServerIP/24) - for access to nodes anf for servers (virtual servers, hosted on a node)
          Disable IPv6:
          vi /etc/sysctl.conf
          net.ipv6.conf.default.disable_ipv6 = 1
          net.ipv6.conf.all.disable_ipv6 = 1

          Disable zeroconf (Ip addresses starting with 169):
          vi /etc/sysconfig/network
          NOZEROCONF=true

          reboot both servers

          Two options in networking setup

          We have two options for networking setup. One option is using Linux bonding and bridging, provided by kernel and bridge-utils package and the other is to use OvS (Open vSwitch) provided by OvS.
          Linux bridge doesn't "understand" VLANs, it just connects VM Server virtual ports to the outer world. To support more than one VLAN with Linux bridges we need to setup as many bridges as VLANs count we desire to be.
          OvS is more wide approach as in future you can easily add more VLANs to your cluster (you can serve VMs in more than one VLAN). Also we can say that OvS supports all (or many of) features normal hardware switch will support.

          Linux bonding and bridging

          yum install bridge-utils -y

          We will be using four interfaces, bonded into three pairs of one physical NIC with VLAN and one SubNIC with VLAN in Active/Passive (mode=1 other types are not recommended for reliable clustering environment) configuration.
          It's our fisrst cluster, so clusterSerialNumber = 1. IP address 2nd octet will be 1 *10 = 10
          eno1 => bcn_link1
          eno2 => sn_link1
          eno3 => ifn_link1
          eno4 => back_link.100 / back_link.200 / back_link.51

          To find physical port corresponding to the CentOS links, you can use ethtool -p command, i.e.:
          ethtool -p eno4  # physical port corresponding to this interface will blink until you Ctrl+C

          Subnet VID NIC Link 1 NIC Link 2 Bond Net IP
          BCN 100 eno1 bcn_link1eno4 back_link.100 bcn_bond 10.10.53.0/24
          SN 200eno2 sn_link1 eno4 back_link.200 sn_bond 10.10.52.0/24
          IFN 51 eno3 ifn_link1eno4 back_link.51 ifn_bond 172.16.51.0/24

          Open vSwitch

          Proceed to this link to install OvS.

          We will be using four interfaces, bonded into two pairs in Active/Passive configuration.
          It's our fisrst cluster, so clusterSerialNumber = 1. IP address 2nd octet will be 1 *10 = 10
          eno1 => ovs_bond1
          eno2 => ovs_bond2
          eno3 => ovs_bond1
          eno4 => ovs_bond2

          To find physical port corresponding to the CentOS links, you can use ethtool -p command, i.e.:
          ethtool -p eno4  # physical port corresponding to this interface will blink until you Ctrl+C

          We will bond all 4 interfaces (from eno4 through eno4) to the OvS bonds. And then we'll create OvS internal ports and assign them IP:

          Subnet
          VID
          OvS internal port
          Net IP
          BCN
          100
          bcn-bond1
          10.10.53.0/24
          SN
          200
          sn-bond1
          10.10.52.0/24
          IFN
          51
          ifn-bond1
          172.16.51.0/24

          This tutorial was used to understand and setup clustering: AN!Cluster