Thursday, December 20, 2018

Installing Asterisk on CentOS7 (step-by-step)

Shrink home volume size and extend root volume size (if needed): 
umount /home
df -h # verify that /home is unmounted
parted -l # to find type of the home LV filesystem (fxs in my case)
lvremove /dev/centos/home # remove LV from the VG
lvcreate -L 5G -n home centos
mkfs.xfs /dev/mapper/centos-home
mount -a
lvs
lsblk
df -h
lvextend -l +100%FREE /dev/centos/root
lvextend -l +100%FREE /dev/centos/root 
xfs_growfs /dev/centos/root
df -h
lsblk
lvs

Setup networking:
systemctl stop NetworkManager
systemctl disable NetworkManager
chkconfig network on
systemctl start network
vi /etc/sysconfig/network-scripts/ifcfg-eth0 # assign IPADDR / PREFIX / GATEWAY / DNS1 / DNS2
vi /etc/sysconfig/network and add below two lines:
NETWORKING=yes
NOZEROCONF=true
systemctl restart network

Disable SELinux:
vi /etc/sysconfig/selinuix and change:
SELINUX=enabled to SELINUX=disabled

Setup NTP:
yum install -y ntp && ntpdate pool.ntp.org && chkconfig ntpd && service ntpd start
systemctl enable ntpd.service
systemctl status ntpd.service
systemctl start ntpd.service
systemctl status ntpd.service

Preinstall steps:
adduser asteriskpbx && passwd asteriskpbx && yum install sudo && visudo # uncomment #%wheel  ALL=(ALL)       ALL line 
vi /etc/group # add asteriskpbx to wheel group: wheel:x:10:root,asteriskpbx
usermod -L asteriskpbx #make asteriskpbx user nologin

Before Asterisk installation, install all prerequisite packages:
yum -y install make gcc gcc-c++ make subversion libxml2-devel ncurses-devel openssl-devel vim-enhanced man glibc-devel autoconf libnewt kernel-devel kernel-headers linux-headers openssl-devel zlib-devel libsrtp libsrtp-devel uuid libuuid-devel mariadb-server jansson-devel libsqlite3x libsqlite3x-devel epel-release.noarch bash-completion bash-completion-extras unixODBC unixODBC-devel libtool-ltdl libtool-ltdl-devel mysql-connector-odbc mlocate libiodbc

Initial setup MariaDB:
systemctl status mariadb
systemctl enable mariadb
systemctl start mariadb
systemctl status mariadb
/usr/bin/mysql_secure_installation

Asterisk Installation:
yum update -y
yum install wget -y
mkdir -p ~/src/asterisk/
cd src/asterisk/
wget http://downloads.asterisk.org/pub/telephony/asterisk/asterisk-13-current.tar.gz
tar -xzvf asterisk-13-current.tar.gz 
cd asterisk-13.23.1/
If your system is 64 bit and you want PJSIP:
./configure --libdir=/usr/lib64 --with-pjproject-bundled
Otherwise:
./configure
menuselect/menuselect --list-options
make
make install
make samples
make config
safe_asterisk
systemctl status asterisk

After install steps:
chown -R asteriskpbx:asteriskpbx /usr/lib/asterisk/
chown -R asteriskpbx:asteriskpbx /usr/lib64/asterisk/
chown -R asteriskpbx:asteriskpbx /var/lib/asterisk
chown -R asteriskpbx:asteriskpbx /var/spool/asterisk/
chown -R asteriskpbx:asteriskpbx /var/log/asterisk/
chown -R asteriskpbx:asteriskpbx /var/run/asterisk/
chown -R asteriskpbx:asteriskpbx /usr/sbin/ast
chown -R asteriskpbx:asteriskpbx /usr/sbin/asterisk
chown -R asteriskpbx:asteriskpbx /var/log/asterisk
asterisk -r
exit

If want -  load modules needed for PJSIP

vi /etc/asterisk/modules.conf
[modules]
;AUTOLOAD
autoload=yes
;LOAD
preload => res_odbc.so
preload => res_config_odbc.so
load => res_pjsip.so
load => res_pjsip_pubsub.so
load => res_pjsip_session.so
load => chan_pjsip.so
load => res_pjsip_exten_state.so
load => res_pjsip_authenticator_digest.so
load => res_timing_timerfd.so

vi /etc/asterisk/asterisk.conf and uncomment and change this 2 lines:
runuser = asteriskpbx           ; The user to run as.
rungroup = asteriskpbx        ; The group to run as.

asterisk -rx "core restart now"

setup CEL with ODBC

odbcinst -j # verify that odbcinst.ini & odbc.ini are created
mysql -u root -p mysql
> CREATE USER 'asterisk'@'%' IDENTIFIED BY 'some_secret_password';
> CREATE DATABASE asterisk;
> GRANT ALL PRIVILEGES ON asterisk.* TO 'asterisk'@'%';

mysql -u asterisk -p asterisk # check access with asterisk user
odbcinst -q -d # verify that [MySQL] driver is seen
vi /etc/odbc.ini and add following:
[asterisk-connector]  # connector-name / DSN
Description =MySQL connection to 'asterisk' database
Driver =MySQL # driver name from the odbcinst.ini
Database =asterisk # database to connect
Server =localhost  # we'll connect to the server itself
charset =UTF8
UserName =asterisk # DB user-name for asterisk DB
Password =some_secret_password # asterisk DB-user password
Port =3306
Socket =/var/lib/mysql/mysql.sock

echo "select 1" | isql -v asterisk-connector asterisk 'some_secret_password' # check connection

Add connector to the asterisk res_odbc.conf:
[asterisk]
enabled => yes
dsn => asterisk-connector ; DSN is Data Source Name (from odbc.ini)
username => asterisk
password => some_secret_password
;pooling => no ; old option, replaced by max_connections
;limit => 1  ; old option, replaced by max_connections
pre-connect => yes
max_connections => 1

asterisk -rx "core restart now"
asterisk -rx "odbc show" # must show "Number of active connections: 1 (out of 1)"

vi cel.conf add following:
[general]
enable=yes
apps=all
events=all

vi cel_odbc.conf add following:
[mycel]
connection=asterisk
table=cel

vi /etc/my.cnf  add following under [mysqld]:
character_set_server=utf8

mysql -u asterisk -p asterisk
CREATE TABLE cel
(
id INT(20) NOT NULL AUTO_INCREMENT,
eventtype INT(11) COLLATE utf8_general_ci NOT NULL,
eventtime TIMESTAMP NOT NULL,
userdeftype VARCHAR(30) COLLATE utf8_general_ci NULL,
cid_name VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_num VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_ani VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_rdnis VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_dnid VARCHAR(80) COLLATE utf8_general_ci NULL,
exten VARCHAR(30) COLLATE utf8_general_ci NULL,
context VARCHAR(30) COLLATE utf8_general_ci NULL,
channame VARCHAR(30) COLLATE utf8_general_ci NULL,
appname VARCHAR(30) COLLATE utf8_general_ci NULL,
appdata VARCHAR(150) COLLATE utf8_general_ci NULL,
accountcode VARCHAR(30) COLLATE utf8_general_ci NULL,
peeraccount VARCHAR(30) COLLATE utf8_general_ci NULL,
uniqueid VARCHAR(30) COLLATE utf8_general_ci NULL,
linkedid VARCHAR(30) COLLATE utf8_general_ci NULL,
amaflags INT(11) NULL,
userfield VARCHAR(30) COLLATE utf8_general_ci NULL,
peer VARCHAR(30) COLLATE utf8_general_ci NULL,
PRIMARY KEY (id),
KEY uniqueid (uniqueid)
)
ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;

asterisk -r asterisk
*CLI> module reload res_odbc.so asterisk
*CLI> module reload cel_odbc.so asterisk
*CLI> cel show status




Monday, November 26, 2018

SSH Tunnel (Local, Remote)

ssh-host is host where "ssh" command is executed to create ssh-tunnel
ssh-peer is host to which ssh-host connects via SSH to form an ssh-tunnel
destination-host is a host we want to access over the ssh-tunnel

For both Local and Remote SSH tunnels:
  1. connection from ssh-host to ssh-peer is allowed
  2. only traffic between ssh-host ans ssh-peer is encrypted (this traffic is in ssh-tunnel), traffic after ssh-tunnel (connection to destination-host itself) is not encrypted and security here based on TCP protocol being used (HTTP, FTP will remain unencrypted / HTTPS, SSH will be encrypted)

Local

Generally: 
  1. created on ssh-host
  2. accessed from ssh-host 
  3. port is listened on ssh-host

Local - tunnel is created and accessed on the <ssh-host>, <destination-host>:<destination-port> must be accessible from the <ssh-peer> (see below explanation):
[root@ssh-host  ~] ssh -L <port-to-listen-on-ssh-host>:<destination-host>:<destination-port> <ssh-peer>

<ssh-host> connects to <ssh-peer> with SSH protocol to form an ssh-tunnel. When <ssh-host> connects to  <localhost>:<port-to-listen-on-ssh-host> on itself, <ssh-peer> connects to the <destination-host>:<destination-port> and sends this connection over SSH to the <ssh-host> which listens to this connection traffic at the <port-to-listen-on-ssh-host>
Or in other words:
<destination-host>:<destination-port> is accessed as <localhost>:<port-to-listen-on-ssh-host> from <ssh-host>

PS if <destination-host>:<destination-port> will be for example localhost:80, then ssh-peer will connect to itself on 80 port

Remote

Generally: 
  1. created on ssh-host
  2. accessed from ssh-peer
  3. port is listened on ssh-peer

Remote - tunnel is created on the <ssh-host> and accessed on the <ssh-peer>, <destination-host>:<destination-port> must be accessible from the <ssh-host> (see below explanation):
[root@ssh-host  ~] ssh -R <port-to-listen-on-ssh-peer>:<destination-host>:<destination-port> <ssh-peer>

<ssh-host> connects to <ssh-peer> with SSH protocol to form an ssh-tunnel. When <ssh-peer> connects to  <localhost>:<port-to-listen-on-ssh-peer> on itself, <ssh-host> connects to the <destination-host>:<destination-port> and sends this connection over SSH to the <ssh-peer> which listens to this connection traffic at the <port-to-listen-on-ssh-peer>
Or in other words:
<destination-host>:<destination-port> is accessed as <localhost>:<port-to-listen-on-ssh-peer> from <ssh-peer>

PS if <destination-host>:<destination-port> will be for example localhost:80, then ssh-host will connect to itself on 80 port

Check


To view existent SSH tunnels (IPv4 (option -i4), IPv6 (option -i6) or both IPv4 and IPv6 (option -i) tunnels, don't do IP resolving - option -n; show numerical ports - option -P):
[admin@localhost ~]$ lsof -i4 -n -P | grep ssh

As background process


If you want create tunnels in the background and don't want to send any commands through SSH tunnel, then use options "-f" (forces going to the background and sending stdin to /dev/null but asks for passwords) and "-N" (do not execute a remote command), use below syntax:
[root@ssh-host  ~] ssh -f -N -L  <port-to-listen-on-ssh-host>:<destination-host>:<destination-port> <ssh-peer>
[root@ssh-host  ~] ssh -f -N -R <port-to-listen-on-ssh-peer>:<destination-host>:<destination-port> <ssh-peer>

Address binding

If you want, you can use address binding to more easily identify SSH tunnels:
  1. Tunnel to 10.10.10.100:
    1. ssh -L 127.0.0.100:2100:localhost:22 root@10.10.10.100
  2. Tunnel to 10.11.11.200:
    1. ssh -L 127.0.0.200:2200:localhost:22 root@10.11.11.200
  3. Now you have two "links" to access these SSH tunnels:
    1. ssh root@127.0.0.100 -p 2100 for 10.10.10.100
    2. ssh root@127.0.0.200 -p 2200 for 10.11.11.200

Accessing one host over another

If you have 2 servers you want to interconnect (the servers can't access each other directly) and have third server (can access both 1st and 2nd server) and have all traffic in the route being encrypted:
  1. Setup
    1. 1st server IP 10.10.10.1
    2. 2nd server IP 11.11.11.2
    3. 3rd server IP 12.12.12.3
    4. you want to access 10.10.10.1 from 11.11.11.2
  2. Configure
    1. [admin@12.12.12.3 ~] ssh -L 2001:localhost:22 root@10.10.10.1
    2. [admin@12.12.12.3 ~] ssh -R 2003:localhost:2001 root@11.11.11.2
  3. Access 10.10.10.1 from 11.11.11.2
    1. Access with SSH
      1. ssh root@localhost -p 2003
    2. rsync
      1. rsync -av -e "ssh -p2003" /some-dir/some-file root@localhost:/sync-dest-dir/
    3. rsync with synchronized files (directories are not deleted) deletion on source server:
      1.  rsync -av -e "ssh -p2003"  --remove-source-files /some-dir/some-file root@localhost:/sync-dest-dir/

Cisco ASA IPSec VPN (IKEv1 / IKEv2) with pre-shared key

Setup - we'll interconnect two branches:
  1. Peers use VLAN 56:
    1. one branch has interface with an IP address 10.10.10.1/24 (we'll call these "their-side")
    2. the other branch has an interface IP address 10.10.10.2/24 (we'll call these "our-side")
  2. Encryption domains (network which we want to interconnect via VPN):
    1. their-side has LAN net  192.168.1.0/24
    2. our-side has LAN net 192.168.2.0/24
  3. IKEv1 or IKEv2 can be used
  4. Also assume that both branches use dedicated interface for VPN connection and this is not interface facing the Interntet (this made for simplicity and you can use the same setup to configure already functioning interfaces)

Aggressive or main mode


Normally main mode is used, so check, that aggressive mode is disabled globally: 
sh run | grep crypto ikev1 am-disable

Phase1


Check if needed IKE Phase1 policy is already created (choose IKE version you need 1 or 2):
  1. IKEv1 Phase1 policy:
    1. sh run crypto ikev1 | grep crypto ikev1 policy|pre-share|aes-256| sha|group 5|86400
  2. IKEv2 Phase1 policy (for IKEv2 integrity=hash, prf (Pseudo-Random Function must be = integrity):
    1. sh run crypto ikev2 | grep crypto ikev2 policy|aes-256| sha|group 5|sha|86400

If need is not found, create Phase1 policy (choose IKE version you need 1 or 2):
  1. for IKEv1:
    1. crypto ikev1 policy 160
      1.  authentication pre-share
      2.  encryption aes-256
      3.  hash sha
      4.  group 5
      5.  lifetime 86400
  2. for IKEv2:
    1. crypto ikev2 policy 40
      1.  encryption des
      2.  integrity sha
      3.  group 5 2
      4.  prf sha
      5.  lifetime seconds 86400

Phase2


Check if needed IKE Phase1 policy is already created (choose IKE version you need 1 or 2):
  1. IKEv1 Phase2 policy:
    1. sh run crypto ipsec | grep ikev1.+esp-aes-256.+sha
  2. IKEv1 Phase2 policy:
    1. sh run crypto ipsec | grep ikev2|aes-256|sha-1
If need is not found, create Phase1 policy (choose IKE version you need 1 or 2):
  1. for IKEv1:
    1. crypto ipsec ikev1 transform-set ESP-AES-256-SHA esp-aes-256 esp-sha-hmac
  2. for IKEv2:
    1. crypto ipsec ikev2 ipsec-proposal AES256-SHA1
      1.  protocol esp encryption aes-256
      2.  protocol esp integrity sha-1

Interface,  & route 


Setup interface which will be used for IPSec VPN initiation (this interface is one peer and the other side is also peer of the VPN tunnel), I suppose that VLAN is used:
interface GigabitEthernet1/10.56
vlan 56
nameif TEST 
security-level 1 
ip address 10.10.10.2 255.255.255.0 # the other peer is 10.10.10.1/24

Check reverse-path:
ip verify reverse-path interface TEST

Set fragment-chain length:
fragment chain 1 TEST

If you don't use proxy-ARP, disable it:
sysopt noproxyarp TEST

If you use ASA cluster and don't want this interfeace link to be monitored: 
no monitor-interface TEST

Create route to the other side (other side encryption domain):
route TEST 192.168.1.0 255.255.255.0 10.10.10.1 1

Group-Policy


VPN Group-Policy (peer IP address is used in naming):
group-policy GP_10.10.10.1 internal 
group-policy GP_10.10.10.1 attributes     
vpn-tunnel-protocol ikev1 
OR 
vpn-tunnel-protocol ikev2
OR
vpn-tunnel-protocol ikev1 ikev2

Tunnel-Group


VPN Tunnel-Group  (peer IP address is used in naming):
tunnel-group 10.10.10.1 type ipsec-l2l 
tunnel-group 10.10.10.1 general-attributes   
default-group-policy GP_10.10.10.1 
tunnel-group 10.10.10.1 ipsec-attributes

Then:
  1. for IKEv1:
    1. ikev1 pre-shared-key PSK-KEY-GOES-HERE
  2. for IKEv2:
    1. ikev2 local-authentication pre-shared-key PSK-KEY-GOES-HERE  
    2. ikev2 remote-authentication pre-shared-key PSK-KEY-GOES-HERE  
If keepalive is needed (normally this doesn't create a problem even if peer doesn't use this option)
isakmp keepalive threshold 10 retry 2 

Objects


Object local encryption domain (our LAN network - our network which will be seen from the other side of the VPN):
object network TEST_VPN_our_ED  
subnet 192.168.2.0 255.255.255.0 

Object remote local encryption domain (their LAN network - their network which will be seen by our side):
object network TEST_VPN_their_ED  
subnet 192.168.1.0 255.255.255.0

If you have another LAN network and want this network to access VPN too (but don't want or don't allowed to add this network to the VPN setup as another encryption domain), you can achieve this using NAT. For simplicity use VLAN ID as network NAT identifier (it will you to more easily identify NAT-ted traffic in log files):  
object network TEST_VPN_our_NET57_NAT 
host 192.168.2.57

VPN ACL & enable protocol on an interface (note here we first write "our IP" and then "their IP")


VPN ACL:
access-list TEST-VPN line 1 extended permit ip object TEST_VPN_our_ED object TEST_VPN_their_ED 

Enable VPN IKEV1 protocol on an interface (only once):
crypto ikev1 enable TEST
OR 
crypto ikev2 enable TEST

Crypto-Map & add map to the interface


TEST_map crypto-map creation:
crypto map TEST_map 1 match address TEST-VPN
crypto map TEST_map 1 set peer 10.10.10.1

Then:
  1. for IKEv1:
    1. crypto map TEST_map 1 set ikev1 transform-set ESP-AES-256-SHA
  2. for IKEv2:
    1. crypto map TEST_map 1 set ikev2 ipsec-proposal AES256-SHA1
crypto map TEST_map 1 set security-association lifetime seconds 28800
crypto map TEST_map 1 set security-association lifetime kilobytes unlimited

If PFS is needed:
crypto map TEST_map 1 set pfs group5

Add map to the interface (only once - when creating TEST_map):
crypto map TEST_map interface TEST

Interface ACL & access-group


Interface ACL:
access-list TEST_access_in extended permit ip object TEST_VPN_their_ED object TEST_VPN_our_ED 
access-list TEST_access_in extended permit icmp host 10.10.10.1 host 10.10.10.2
access-list TEST_access_in extended permit esp any4 interface TEST 
access-list TEST_access_in extended permit udp any4 interface TEST eq isakmp 
access-list TEST_access_in extended permit icmp any4 interface TEST 
access-list TEST_access_in extended deny ip any any 

access-group TEST_access_in in interface TEST

NAT & no-NAT (NAT exemption) examples/templates


Host 192.168.3.2 VPN-traffic NAT exemption (no-NAT):
nat (LAN57,TEST) source static lan57.srv.3.2 TEST_VPN_our_NET57_NAT destination static TEST_VPN_their_ED TEST_VPN_their_ED no-proxy-arp

Allowing Host 192.168.3.2
access-list TEST_access_in extended permit ip object TEST_VPN_their_ED object  lan57.srv.3.2

Group-Policy ACL (note here we first write "their IP" and then "our IP")

You can setup VPN with simple rules as TEST-VPN above and then make restrictions for ports, source IP etc:

Create group-policy ACL (we'll permit access from their net IP 192.168.1.10 to our net IP 192.168.2.10 port 443 and deny access for all others):
access-list TEST-VPN_GP_FILTER extended permit tcp host 192.168.1.10 host 192.168.2.10 eq 443
access-list TEST-VPN_GP_FILTER extended deny ip any any

group-policy GP_10.10.10.1 attributes
 vpn-filter value TEST-VPN_GP_FILTER

Monday, November 19, 2018

Cluster 26. Renaming pcs resource.

Below procedure found in:
https://bugzilla.redhat.com/show_bug.cgi?id=1126835 and tested by changing of the name of the resource type "ocf:heartbeat:VirtualDomain":

First:

  1. make resource unmanaged: pcs resource unmanage resource-old-name
  2. If this is VirtualDomain resource:
    1. change old-name to the new-name in XML definition files of your VM.

Backup existing config:
pcs cluster cib /tmp/cib.xml

Globally (not only first occurrence) change old-name to the new-name: 
sed 's/resource-old-name/resource-new-name/g' -i /tmp/cib.xml

Verify changes:
vi /tmp/cib.xml 

Push changed config to the cluster:
pcs cluster cib-push /tmp/cib.xml

Verify name change:
pcs status

Verify name change in the config dump:
pcs config | grep resource-new-name

Make resource managed again:
pcs resource manage resource-new-name

Thursday, November 8, 2018

Cluster 25. Restoring failed node.

In case of hardware failure and need to restore one of the node (ex. agrp-c01n02):
  1. go through all steps in Cluster 1 - Cluster 11 blog-posts (do only stuff related to the failed node)
  2. Cluster 12 blog-post - go through steps till "Login to any of the cluster node and authenticate hacluster user." part  (do only stuff related to the failed node), then:
    1. passwd hacluster
    2. from an active node:
      1. pcs node maintenance agrp-c01n02
      2. pcs cluster auth agrp-c01n02
    3. from agrp-c01n02:
      1. pcs cluster auth
      2. pcs cluster start
      3. pcs cluster status # node must be in maintenance mode with many errors due to absence of drbd / virsh and other packages
    4. then go through Cluster 12, starting at "Check cluster is functioning properly (on both nodes)" till "Quorum:" part
  3. go through all steps in Cluster 14 blog-post (do only stuff related to the failed node)
  4. Cluster 16 blog-post - go through steps till "Setup common DRBD options" part  (do only stuff related to the failed node), then:
    1. from agrp-c01n01:
      1. rsync -av /etc/drbd.d root@agrp-c01n02:/etc/
    2. from agrp-c01n02:
      1. drbdadm create-md r{0,1}
      2. drbdadm up r0; drbdadm secondary r0
      3. drbd-overview
      4. drbdadm up r1; drbdadm secondary r1
      5. drbd-overview
      6. wait till full synchronisation
      7. reboot failed node
  5. Cluster 17 blog-post - go through steps till "Setup DLM and CLVM" (do only stuff related to the failed node), then:
    1. drbdadm up all
    2. cat /proc/drbd
  6. Cluster 19 blog-post - only do check of the SNMP from the failed node:
    1. snmpwalk -v 2c -c agrp-c01-community 10.10.53.12
    2. fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action list
    3. fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action list
  7. Cluster 20 blog-post - go through steps till "Provision Planning" (do only stuff related to the failed node), then:
    1. rsync -av /etc/libvirt/qemu/networks/ovs-network.xml  root@agrp-c01n02:/root
    2. systemctl start libvirtd 
    3. virsh net-define /root/ovs-network.xml 
    4. virsh net-list --all 
    5. virsh net-start ovs-network 
    6. virsh net-autostart ovs-network 
    7. virsh net-list 
    8. systemctl stop libvirtd
    9. rm  /root/ovs-network.xml
  8. For each VM add constraint to ban VM start on failed node (I assume n02 to fail). Below command adds -INFINITY location constraint for specified resource and node:
    1. pcs resource ban vm01-rntp agrp-c01n02
    2. pcs resource ban vm02-rftp agrp-c01n02
  9. Unmaintenance failed node from survived one and start cluster on the failed node:
    1. pcs node unmaintenance agrp-c01n02
    2. pcs cluster start
    3. pcs status
    4. wait till r0 & r1 DRBD resources are masters on both nodes and all resources (besides all VMs) are started on both nodes
  10. Cluster 18 blog-post, do only:
    1. yum install gfs2-utils -y
    2. tunegfs2 -l /dev/agrp-c01n01_vg0/shared # to view shared LV
    3. dlm_tool ls # name clvmd z& shared / members 1 2 
    4. pvs # should only show drbd and sdb devices
    5. lvscan # List all logical volumes in all volume groups (3 OS LV, shared & 1 LV per VM)
  11. Cluster 21 blog-post:
    1. do "Firewall setup to support KVM Live Migration" (do only stuff related to the failed node)
    2. crm_simulate -sL | grep " vm[0-9]"
    3. SELunux related:
      1. ls -laZ /shared # must show "virt_etc_t" in all lines except related to ".."
      2. if above line is not true, do stuff in "SELinux related issues" (do only stuff related to the failed node)
  12. One by one (for each VM):
    1. remove ban constraint for the first VM:
      1. pcs resource clear vm01-rntp
    2. verify that constraints are removed:
      1. pcs constraint  location
    3. if this VM must be started on the restored node - wait till live migration is performed
  13. Congratulations your cluster is restored into normal operation

Tuesday, October 9, 2018

CentOS 7 Apache, Nginx, PHP-FPM, PrestaShop

OS related

Install CentOS7
Create admin user and add make it Administrator (group=wheel)
Give a root-password

I assume that your server IP address is 192.168.1.1

setenforce 0
getenforce
sed -i 's/enforcing/permissive/' /etc/sysconfig/selinux
sed -i 's/enforcing/permissive/' /etc/selinux/config
reboot
yum -y install wget unzip epel-release mlocate
updatedb
yum clean all
yum update -y
reboot

Install Apache & test default page

yum install httpd
sed -i 's/Listen 80/Listen 8080/' /etc/httpd/conf/httpd.conf
systemctl start httpd.service
systemctl enable httpd.service
systemctl status httpd
httpd -S
ss -tlpn | grep 8080
firewall-cmd --zone=public --permanent --add-port=8080/tcp
firewall-cmd --reload
192.168.1.1:8080 - test Apache welcome page
Disallow Apache to display directories and files within the web root directory /var/www/html:
sudo sed -i "s/Options Indexes FollowSymLinks/Options FollowSymLinks/" /etc/httpd/conf/httpd.conf
systemctl restart httpd

Install MariaDB

Choose database-name, user-name and pasword for your PrestaShop DB

Install MariaDB and set it to automatically start after system reboot:
yum install mariadb mariadb-server -y
systemctl start mariadb.service
systemctl enable mariadb.service
Execute the secure MySQL installation process:
/usr/bin/mysql_secure_installation
Go through the process in accordance with the instructions below:
Enter current password for root (enter for none): Press the Enter key
Set root password? [Y/n]: Input Y, then press the Enter key
New password: Input a new root password, then press the Enter key
Re-enter new password: Input the same password again, then press the Enter key
Remove anonymous users? [Y/n]: Input Y, then press the Enter key
Disallow root login remotely? [Y/n]: Input Y, then press the Enter key
Remove test database and access to it? [Y/n]: Input Y, then press the Enter key
Reload privilege tables now? [Y/n]: Input Y, then press the Enter key
Now, log into the MySQL shell so that you can create a dedicated database for PrestaShop:
mysql -u root -p
CREATE DATABASE pshop-db-name;
GRANT ALL PRIVILEGES ON pshop-db-name.* TO 'pshop-db-username'@'localhost' IDENTIFIED BY 'pshop-db-password' WITH GRANT OPTION;
FLUSH PRIVILEGES;
EXIT;

Install PHP


Install PHP and required extensions using YUM:
yum -y install php php-fpm php-mysql php-gd php-ldap php-odbc php-pear php-xml php-xmlrpc php-mbstring php-snmp php-soap php-mcrypt php-curl php-cli curl zlib
Editing php.ini for optimal performance.
sed -i '/memory_limit/c\memory_limit = 128M' /etc/php.ini
sed -i '/upload_max_filesize/c\upload_max_filesize = 16M' /etc/php.ini 
sed -i '/max_execution_time/c\max_execution_time = 60' /etc/php.ini
vi /var/www/html/info.php add: <?php phpinfo(); ?>
systemctl restart httpd
192.168.1.1:8080/info.php review:
Server API Apache 2.0 Handler
_SERVER["SERVER_SOFTWARE"] Apache/2.4.6 (CentOS) PHP/5.4.16
grep -E "mod_proxy.so|mod_proxy_fcgi.so" /etc/httpd/conf.modules.d/* => if no result:
vi /etc/httpd/conf/httpd.conf => find LoadModule and add:
LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_fcgi_module modules/mod_proxy_fcgi.so

Adding PHP-FPM (FastCGI Process Manager) support to Apache (all php scripts will be processed by PHP-FPM):
vi /etc/httpd/conf.d/php.conf find <FilesMatch \.php$> change:
#SetHandler application/x-httpd-php
 SetHandler "proxy:fcgi://127.0.0.1:9000" #PHP-FPM uses port 9000
systemctl start php-fpm.service
systemctl enable php-fpm.service
systemctl status php-fpm.service -l
systemctl restart httpd
192.168.1.1:8080/info.php review:
Server API FPM/FastCGI
rm /var/www/html/info.php
systemctl restart httpd

Creating PrestaShop Virtual Host for Apache

Disable (comment all lines) Apache's default welcome page:
sed -i 's/^/#&/g' /etc/httpd/conf.d/welcome.conf
Change "pshop-domain-name" to the name you bought:
mkdir -v /var/www/pshop-domain-name 
Create index.html test page:
echo "<h1 style='color: green;'>Presta Shop</h1>" | sudo tee /var/www/pshop-domain-name/index.html
Then create a phpinfo() file for each site so we can test PHP is configured properly:
echo "<?php phpinfo(); ?>" | sudo tee /var/www/pshop-domain-name/info.php
Make directory for available sites (sites configs will be here):
mkdir /etc/httpd/sites-available
This directory will contain links to the active sites (links to the files in sites-available):
mkdir /etc/httpd/sites-enabled
vi /etc/httpd/conf/httpd.conf
Add this line to the end of the file (this will allow us to quickly enable 
and disable sites by adding and removing links to their config files):
IncludeOptional sites-enabled/*.conf
vi /etc/httpd/sites-available/pshop-domain-name.conf
<VirtualHost *:8080>
    ServerName pshop-domain-name
    ServerAlias www.pshop-domain-name
    DocumentRoot /var/www/pshop-domain-name
    <Directory /var/www/pshop-domain-name>
        AllowOverride All
    </Directory>
</VirtualHost>
AllowOverride All enables .htaccess support
Make site available:
ln -s /etc/httpd/sites-available/pshop-domain-name.conf /etc/httpd/sites-enabled/pshop-domain-name.conf
Execute below to check that httpd.conf files are ok (for now AH00558 warning is ok):
apachectl -t
systemctl restart httpd
Check that green "Presta Shop" string is displayed (if you don't use Public DNS
add server IP with the corresponding site name to the /etc/hosts file): 
http://pshop-domain-name:8080/ 
Check that php uses FPM/FastAGI:
http://pshop-domain-name:8080/info.php

Installing and Configuring Nginx

yum install nginx
systemctl start nginx
systemctl status nginx -l
firewall-cmd --permanent --zone=public --add-service=http
firewall-cmd --reload
Test nginx deafult page:
http://192.16.8.1.1/
systemctl enable nginx
vi /etc/nginx/nginx.conf and comment all lines between "server {" and closing "}"
systemctl restart nginx -l
Check that default site is unavailable:
http://192.168.1.1/

mkdir /etc/nginx/sites-available
mkdir /etc/nginx/sites-enabled
vi /etc/nginx/nginx.conf 
find "http {" block
Add these lines to the end of the http {} block, then save the file:
include /etc/nginx/sites-enabled/*.conf;
server_names_hash_bucket_size 64;

Create nginx test site:
mkdir -v /usr/share/nginx/sample.org
As we did with Apache's virtual hosts, we'll again create 
index and phpinfo() files for testing after setup is complete:
echo "<h1 style='color: red;'>Sample.org</h1>" | sudo tee /usr/share/nginx/sample.org/inde
echo "<?php phpinfo(); ?>" | sudo tee /usr/share/nginx/sample.org/info.php
Now create a virtual host file for the domain sample.org
Nginx calls server {. . .} areas of a configuration file server blocks. 
Create a server block for the primary virtual host, sample.org. 
The default_server configuration directive makes this the default
virtual host which processes HTTP requests that do not match any other virtual host:
vi /etc/nginx/sites-available/sample.org.conf
server {
    listen 80 default_server;

    root /usr/share/nginx/sample.org;
    index index.php index.html index.htm;

    server_name www.sample.org;
    location / {
        try_files $uri $uri/ /index.php;
    }

    location ~ \.php$ {
        # if the file is not there show a error : mynonexistingpage.php -> 404
        try_files $uri =404;
        
        # pass to the php-fpm server
        fastcgi_pass 127.0.0.1:9000;
        # also for fastcgi try index.php
        fastcgi_index index.php;
        # some tweaking
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param SCRIPT_NAME $fastcgi_script_name;
        fastcgi_buffer_size 128k;
        fastcgi_buffers 256 16k;
        fastcgi_busy_buffers_size 256k;
        fastcgi_temp_file_write_size 256k;
        include fastcgi_params;
    }
}

Enable sample.org site:
ln -s /etc/nginx/sites-available/sample.org.conf /etc/nginx/sites-enabled/sample.org.conf
Check nginx config files for syntax:
nginx -t
systemctl reload nginx -l
Check that sample.org is working:
sample.org

Check nginx is working:
sample.org/info.php
_SERVER["SERVER_SOFTWARE"] nginx/1.12.2
_SERVER["DOCUMENT_ROOT"] /usr/share/nginx/sample.org
Disable sample.org :
rm /etc/nginx/sites-enabled/sample.org.conf
nginx -t
systemctl reload nginx -l

Configuring Nginx for Apache's Virtual Hosts (Proxy to Apache then to FPM)

Let's create an additional Nginx virtual host with multiple domain names
in the server_name directives. Requests for these domain names will be proxied to Apache:
vi /etc/nginx/sites-available/apache.conf
Add the code block below. The try_files directive makes Nginx look for files in the document root and directly serve them. If the file has a .php extension, the request is passed to Apache. Even if the file is not found in the document root, the request is passed on to Apache so that application features like permalinks work without problems:
server {
        listen   80; 

        root /var/www/pshop-domain-name; 
        index index.php index.html index.htm;

        server_name pshop-domain-name www.pshop-domain-name; 

        location / {
        try_files $uri $uri/ /index.php;
        }

        location ~ \.php$ {
        proxy_set_header X-Real-IP  $remote_addr;
        proxy_set_header X-Forwarded-For $remote_addr;
        proxy_set_header Host $host;
        proxy_pass http://127.0.0.1:8080;
        }

         location ~ /\.ht {
                deny all;
        }
}
ln -s /etc/nginx/sites-available/apache.conf /etc/nginx/sites-enabled/apache.org.conf
nginx -t
systemctl -l reload nginx

Make Apache pshop-domain-name accessible only from localhost:
vi /etc/httpd/sites-enabled/pshop-domain-name.conf
<VirtualHost 127.0.0.1:8080>
    ServerName pshop-domain-name
    ServerAlias www.pshop-domain-name
    DocumentRoot /var/www/pshop-domain-name
    <Directory /var/www/pshop-domain-name>
        AllowOverride All
    </Directory>
</VirtualHost>
systemctl restart httpd

Configuring Nginx for Apache's Virtual Hosts (Proxy to FPM, no Apache)

systemctl stop httpd
systemctl disable httpd
firewall-cmd --zone=public --remove-port=8080/tcp
firewall-cmd --reload

Warning: The location ~ /\. directive is very important; this prevents Nginx from printing the contents of files like .htaccess and .htpasswd which contain sensitive information.
vi /etc/nginx/sites-available/pshop-domain-name.conf
server {
    listen 80 default_server;

    root /var/www/pshop-domain-name;
    index index.php index.html index.htm;

    server_name www.pshop-domain-name pshop-domain-name;
    location / {
        try_files $uri $uri/ /index.php;
    }

    location ~ \.php$ {
        # if the file is not there show a error : mynonexistingpage.php -> 404
        try_files $uri =404;

        # pass to the php-fpm server
        fastcgi_pass 127.0.0.1:9000;
        # also for fastcgi try index.php
        fastcgi_index index.php;
        # some tweaking
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param SCRIPT_NAME $fastcgi_script_name;
        fastcgi_buffer_size 128k;
        fastcgi_buffers 256 16k;
        fastcgi_busy_buffers_size 256k;
        fastcgi_temp_file_write_size 256k;
        include fastcgi_params;
    }

    location ~ /\. {
        deny all;
    }
}
rm /var/www/pshop-domain-name/info.php
systemctl restart nginx

Installing PHP7 after installing old PHP

yum install http://rpms.remirepo.net/enterprise/remi-release-7.rpm -y
yum install yum-utils -y
yum-config-manager --enable remi-php72
yum update php php-zip
yum update
reboot

Installing PrestaShop

Download the latest stable version of PrestaShop from prestashop.com:
mkdir prestashop
Extract all to the prestashop directory:
unzip prestashop_1.7.4.3.zip -d prestashop
mv prestashop/* /var/www/pshop-domain-name/

Check user name used in pshop-domain-name (it must be apache):
echo "<?php echo exec('whoami'); ?>" | sudo tee /var/www/pshop-domain-name/whoami.php
192.168.1.1/whoami.php
rm /var/www/pshop-domain-name/whoami.php
chown -R apache: /var/www/pshop-domain-name/prestashop/
systemctl restart nginx

I don't know why but I couldn't install PrestaShop using GoogleChrome, so use MozillaFirefox (I didn't try other browser) to install PrestaShop:
192.168.1.1/prestashop

If you have any troubles accessing 192.168.1.1/prestashop , you can use file which comes with PrestaShop (vi docs/server_config/nginx.conf.dist) - change content of your /etc/nginx/sites-available/pshop-domain-name.conf to the content of /var/www/pshop-domain-name/docs/server_config/nginx.conf.dist and making changes appropriate to you shop:

  • server_name 
  • root
  • fastcgi_pass 127.0.0.1:9000;
  • #fastcgi_pass unix:/run/php/php7.0-fpm.sock;
systemctl restart nginx

After progress bar goes from 0% to 100%, you'll see 192.168.1.1/install
Now if you want, you can switch to the Chrome:
192.168.1.1/install
chown -R apache: /var/www/pshop-domain-name/


systemctl restart nginx
After several steps you'll be suggested to install php-intl (PHP internationalization) 
and PHP accelerator:
yum install php-intl
Check that php-intl is enabled:
php --ri intl
systemctl restart php-fpm

To view all components needed by PrestaShop or suggested to be installed on the server:
wget https://github.com/PierreRambaud/phppsinfo/archive/master.zip
unzip master.zip
cp phppsinfo-master/phppsinfo.php /var/www/pshop-domain-name/
chown -R apache: /var/www/pshop-domain-name/
systemctl restart nginx -l
login/pass are the same - prestashop
http://192.168.1.1/phppsinfo.php
Change all you find to the recomended values (vi /etc/php.ini) and install additional PHP extensions (yum install php-NeededExtensionName)

systemctl restart php-fpm
systemctl restart nginx -l

check all parameters again:
http://192.168.1.1/phppsinfo.php
If everything is ok:
rm /var/www/pshop-domain-name/phppsinfo.php
systemctl restart php-fpm
systemctl restart nginx -l

http://192.168.1.1/install/index.php :
Installation is very straight-forward, the only note - use previously created database credentials when specifying database related stuff.

rm -rf /var/www/pshop-domain-name/install/
rm -f /var/www/pshop-domain-name/Install_PrestaShop.html
rm -f /var/www/pshop-domain-name/INSTALL.txt
ll /var/www/pshop-domain-name/ | grep admin
Enter admin panel with the found name:
192.168.1.1/admin985ftb6s2

Monday, October 8, 2018

Cisco ASA logging to CentOS 7 rsyslog & logrotate

First of all install CentOS 7 and yum update it.
systemctl status rsyslog.service

If rsyslog is not installed:
yum install rsyslog

Edit rsyslog config (we'll use UDP for messages logging):
vi /etc/rsyslog.conf
search for imudp and uncomment:
$ModLoad imudp
$UDPServerRun 514

systemctl restart rsyslog
systemctl status rsyslog.service

For SELinux semanage packet:
yum install policycoreutils-python

To view which port are allowed by SELinux:
semanage port -l | grep syslog

See if rsyslog is listening to any ports:
ss -nlp | grep rsyslog

firewall-cmd --list-all # find zone name (mine is public)
Allow traffic for rsyslog in that zone:
firewall-cmd --permanent --zone=public --add-port=514/udp
systemctl restart firewalld.service
firewall-cmd --list-all

Creating files for ASA log:
cd /var/log
touch asa.log
vi /etc/syslog.conf 

Log severity levels 
There are eight in total as per Cisco’s definitions below: 
  • 0 = Emergencies => Extremely critical “system unusable” messages 
  • 1 = Alerts => Messages that require immediate administrator action 
  • 2 = Critical => A critical condition 
  • 3 = Errors => An error message (also the level of many access list deny messages) 
  • 4 = Warnings => A warning message (also the level of many other access list deny messages) 
  • 5 = Notifications => A normal but significant condition (such as an interface coming online) 
  • 6 = Informational => An informational message (such as a session being created or torn down) 
  • 7 = Debugging => A debug message or detailed accounting message
Facility - term used to properly identify devices syslog messages. To find ASA facility:
sh log set | grep Fac|fac

Default ASA facility is 20, which is corresponding to rsyslog local4 facility (facility 21 = syslog local5, facility 22 = syslog local6 etc.).

Create a new comment that fits your needs (below lines must be inserted right after #### RULES #### in rsyslog.conf otherwise all messages will be duplicated into messages and boot.log):
# Logs sent from the ASA  IP 10.10.10.10 are saved to /var/log/asa.log file here we have 2 options:
# 1 Use facility to identify message (each equipment has predefined log facility, for example :
local4.info /var/log/asa.log
# 2 Use an IP address to identify message:
if $fromhost-ip=='10.10.10.10' then /var/log/asa.log
        {
         /var/log/asa.log
         stop
        }
# you can use any of two, but only one of them, otherwise all messages will be written twice to the same file

In order for the changes to take effect we need to restart the syslog service. 
systemctl restart rsyslog

Configure clock on an ASA (NTP or manual):
clock timezone AZS 4
clock set 12:33:00 10 Sep 2018
show clock

ASA logging destinations (ASA CLI parameters to logging command): 
  • console – logs are viewed in realtime while connecteng via Serial console 
  • asdm – logs can be viewed in the ASDM GUI. 
  • monitor – logs to a Telnet or SSH session.
  • buffered – this is the internal memory buffer 
  • host – a remote syslog server IP and interface
  • trap – severity for remote syslog
  • mail – send generated logs via SMTP 
  • flow-export-syslogs – send event messages via NetFlow v9

Configure ASA logging to remote rsyslog server (also configuring buffer):
  1. enabling logging:
    1. logging enable 
  2. enable timestamping of log messages:
    1. logging timestamp 
  3. confgure buffer (when buffer filled up - oldest messages are overwritten):
    1. logging buffer-size 128000 
  4. severity level for buffered logging:
    1. logging buffered warnings 
  5. using informational severity:
    1. logging trap informational 
  6. IP of the rsyslog server:
    1. logging host inside 10.10.10.20 
  7. Verify logging settings:
    1. show logging setting
  8. Set up message logging queue (default is 512 messages, max queue size on ASA-5505 is 1024, on ASA-5510 is 2048 and 8192 on all other platforms):
    1. logging queue 1024
    2. show logging queue
Configure logrotate for asa.log:
cat /etc/logrotate.d/rotate_asa_log.conf
 # name of the log-file :
/var/log/asa.log {
    # rotate log daily :
    daily
    # keep 400 old log-files :
    rotate 400
    # compress old log-file after postscript execution :
    compress
    # rotate if log-file size equals or larger than 2GB :
    size 2G
    # add %Y%m%d to the end of the old log-file :
    dateext
    # use -%d%m%Y instead of the default %Y%m%d :
    dateformat -%d%m%Y
    # create empty asa.log file  :
    create 0644 root root
    # don't issue an error if the log-file is missing :
    missingok
    # don't rotate if log-file is empty :
    notifempty
    # use one postrotate script for all log-files (if more than one) :
    sharedscripts
    # start of the postrotate script :
    postrotate
        # HUP signal to rsyslogd PID (read from syslogd.pid file)
        # (actually bug > must be rsyslogd.pid instead of syslogd.pid)
        # makes rsyslog close all open files and restart
        # HUP signal make restart or just reread configs
        # (it's based on daemon's itself behaviour)
        /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
    # end of the postrotate script :
    endscript
}

Test logrotate script without actually rotating anything (-d is debug option and it implies -v verbose option):
logrotate -d /etc/logrotate.d/rotate_asa_log.conf

After testing you can force logrotate to rotate logs:
logrotate -f /etc/logrotate.d/rotate_asa_log.conf

To see last rotation of the log-file:
cat /var/lib/logrotate/logrotate.status | grep asa
"/var/log/asa.log" 2018-10-4-19:9:35

So the nex rotation will be done at time in logrotate.status + specified rotation interval (in out case it's "daily").

Wednesday, August 29, 2018

Python 2. Iterator, Generator.

When you create a list, you can read list elements one-by-one - this is called iteration.
>>> test_list = [1,2,3]
>>> for element in test_list:
...   print element
... 
1
2
3

test_list is iterable object. In other words any object which can be used with "for ... in ..." is iterable. Iterable objects are good until they become too big, because iterable object is fully saved in memory.

Generator objects are also iterable objects but they can be read only once and they don't save values but generate needed values on the fly. So you can use generator only one time, because values are not saved in memory.

Tuesday, August 28, 2018

Scikit-learn 6. Hands-on python scikit-learn: Cross-Validation.

train_test_split helps us to measure quality of model predictions but we have better approach - cross_val_score. cross_val_score measures more reliable. How it's working:
  1. data is automatically split in parts (so you don't need to have separate train and test datasets
  2. on each iteration (iteration count is equal to the count of parts) all parts besides current part are used for model training and current part is used as test set (for example on 3rd iteration 3rd part will be used as test set and all other parts well be used as train sets)

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

>>> import pandas as pd
>>> test_file_path = "~/rest.scv"
>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.select_dtypes(exclude='object').drop('Price',axis=1)
>>> from sklearn.preprocessing import Imputer
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.pipeline import make_pipeline
>>> test_pipeline = make_pipeline(Imputer(), RanfomForestRegressor())
>>> # cross_val_score uses negative metrics (sklearn uses convention that the higher the metrics value the better)
>>> scores = cross_val_score(test_pipeline,X,y,scoring='neg_mean_absolute_error')
>>> scores
array([-116.66666667, -205.        ,  -75.        ])
>>> # to get positive values
>>> print("Mean Absolute Error: {}".format(-1 * scores.mean()))
Mean Absolute Error: 132.22222222222223

Scikit-learn 5. Hands-on python scikit-learn: using pipelines.

Pipeline is a way to shorten the code and make it simpler.

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.preprocessing import Imputer
>>> from sklearn.pipeline import make_pipeline
>>>
>>> test_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
>>> # as you see imputation is done automatically
>>> test_pipeline.fit(train_X,train_y) 
>>> predictions = test_pipeline.predict(test_X)

Machine Learning 2. Partial Dependence Plots (PDP).

Sometimes it seems that ML models are something like black-box - you can't see how model is working and how you can view and improve it's logic. To do so partial dependence plots are used.  PDP shows how each variable or predictor (features) affect the model's predictions, they can be interpreted similarly as coefficients in DT models.

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

We'll use PDP to understand relationship between Price and other variables. So that PDP helps to find data insights and also see something you might think being important to be used in model building and prediction. PDP is calculated only after the model has been trained (fit).

>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> test_data.dropna(axis=0,subset=['Price'],inplace=True)
>>> y = test_data.Price
>>> X = test_data.drop(['Price'],axis=1)
>>> X = X.select_dtypes(exclued=['object'])
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X = test_imputer.fit_transform(X)
>>> # for now sklearn supports PDP only for GradientBoostingRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> test_model = GradientBoostingRegressor()
>>> test_model.fit(X,y)
>>> from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
>>> test_plots = plot_partial_dependence(gbrt=test_model,X=X,features=[0,1,2],feature_names=['Rooms', 'Floors', 'Area'],grid_resolution=10)
Options described:

  • gbrt - which GBR model to use
  • X - which dataset used to train model specified in gbrt option
  • features - index of columns of the dataset specified in X option which will be used in plotting (each index/column will create 1 PDP)
  • feature_names - how to name columns selected in features option
  • grid_resolution - number of values to plot on x axis
Negative values mean that Price would be less than average Price for that variable. 

Monday, August 27, 2018

XGBoost 1.

XGBoost states for Gradient Boosted Decision Trees. 
Gradient Boosting is ML technique used for regression and classification problems, which produces a prediction model in the form of ensemble of a weak prediction models - decision trees:

  • Weak model means that the model predictions is slightly better than guessing
  • After building each weak model we:
    • calculate errors
    • build model predicting errors
    • add last model to ensemble
  • To make a prediction - add predictions from all models in ensemble
XGBoost model is the leader when working with tabular data (data without images and videos, or in other words - data saved in Pandas DataFrame).


Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

To install XGBoost:
pip install xgboost

Using XGBoost Regressor


>>> import pandas as pd
>>> test_file_path = "~/rest.scv"
>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.drop(['Price'], axis=1).select_dtypes(exclude=['object'])
>>> from sklearn.model_selection import train_test_split
>>> # split tests and get result as array, not DataFrame
>>> X_train, X_test, y_train, y_test = train_test_split(X.values,y.values,random_state=0,test_size=0.25)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X_train = test_imputer.fit_transform(X_train)
>>> X_test = test_imputer.fir_transform(X_test)
>>> from xgboost import XGBRegressor
>>> test_model = XGBRegressor()
>>> test_model.fit(X_train,y_train)
>>> predictions = test_model.predict(X_test)
>>> from sklearn.metrics import mean_absolute_error as mae
>>> print("MAE XGBR: " + str(mae(predictions,y_test)))
MAE XGBR: 10.416168212890625
>>> 

XGBoost Regressor parameters


>>> test_model
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

  • n_estimators - how many times to go through XGBoost modelling cycle
    • too low value causes underfitting, too high - overfitting
    • typical values are 100-1000 which depends on the learning_rate
    • to find optimal value, use early_stopping_rounds option. It causes to stop iterations when model stops to improve. Occasionally iterations can stop after 1 iteration, so to avoid such situations, make "early_stopping_rounds=5" this will stop iteration after 5 deteriorations of result
    • It's good to set high n_estimators and also set early_stopping_rounds this will help to find optimal value (eval_set is list of (X,y) tuple pairs used as validation set for early-stopping):
      • model = XGBRegressor(n_estimators=1000)
      • model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
      • after training model use that number of n_estimators to re-train your model (on entire data):
        • for example found values = 97
        • model = XGBRegressor(n_estimators=97)
        • model.fit(X, y)
  • learning_rate - on each iteration we multiply predictions from each component model by a small number, before adding to the ensemble. This means that each DT added to the ensemble helps us less, this reduces (in practice) model to trend to overfit. So you can use higher value for n_estimators without overfitting:
    • model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
    • model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
  • n_jobs - on big datasets set this to be equal to the number of CPU cores on your machine to use multi-threading and thus to fit model quicker. On small datasets this will not help

Tuesday, August 21, 2018

Scikit-learn 4. Hands-on python scikit-learn: using categorical data (encoding, one-hot encoding, hashing).

Categorical data - is data that takes only a predefined number of values. Most of ML models will give you an error if you'll try to use categorical data in your model without any changes. So to use categorical data, first we need to encode those values by corresponding numeric values. For example, if we have names of colors in our data then we can do:
  1. Encoding - give each color its own number: red will be 1, yellow will be 2, green will be 3 etc. This is simple, but the problem is that 3 (green) is bigger than 1(red) but it doesn't mean that 3 must be considered to have more weight than 1 while training or predicting.
  2. One-hot encoding - we have 3 colors (red, yellow, green) in our data set "Color" column, so we create 3 additional columns (Color_red, Color_yellow, Color_green) to save value of each color for that row and then original column with categorical data is removed. So row with red color will have 1 in the first column, 0 in the second and 0 in the third. yellow > 010, green 001. This approach gives us ability to not consider that one categorical feature is having more weight than the other.
  3. Hashing (or hashing trick) - one-hot encoding is good, but when you have huge amount of different values in your data set or if training data is not having all types of categorical feature values or if data is changing and categorical data receives new values, one-hot encoding makes too many additional columns and this makes your data predictions slow or even impossible (when new values can appear in test model). In such a situation hashing is used (is not reviewed here) 

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

Using one-hot encoding

>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> test_data
>>> test_data.describe() # HouseColor is not present
>>> test_data.info() # because HouseColor type is object - non-numerical (categorical data)
>>> test_data.dtypes
>>> # create new data-set without NaN values (we'll use imputation)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> # before imputation - fill dataset only with numerical data
>>> test_data_numerical = test_data.select_dtypes(exclude=['object'])
>>> test_data_imputed = test_imputer.fit_transform(test_data_numerical)
>>> test_data_imputed
>>> # convert imputed dataset into Pandas DataFrame
>>> test_data_imputed = pd.DataFrame(test_data_imputed)
>>> test_data_imputed
>>> test_data_imputed.columns = test_data.select_dtypes(exclude=['object']).columns
>>> test_data_imputed
>>> # add categorical data columns
>>> test_data_categorical = test_data.select_dtypes(include=['object'])
>>> test_data_imputed = test_data_imputed.join(test_data_cetegorical)
>>> test_data_imputed
>>> # use one-hot encoding
>>> test_data_one_hot = pd.get_dummies(test_data_imputed)
>>> test_data_one_hot
>>> # select non-categorical values
>>> test_data_wo_categoricals = test_data_imput.select_dtypes(exclude=['objects'])

Measuring dropping categoricals  vs using one-hot encoding

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> def score_dataset(dataset):
...      y = dataset.Price
...      X = dataset.drop(['Price'], axis=1)
...      y_train, y_test = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
...      X_train, X_test = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
...      model = RandomForestRegressor()
...      model.fit(X_train, y_train)
...      predictions = model.predict(X_test)
...      return mean_absolute_error(y_test, predictions)
>>> print  "MAE when not using categoricals"
>>> score_dataset(test_data_wo_categoricals)
100.0
>>> print  "MAE when using categoricals with one-hot encoding"
>>> score_dataset(test_data_one_hot)
70.0
>>>

Friday, August 17, 2018

Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.

Missing Values overview

There are many reasons why data can have missing values (survey participant didn't answer all questions, data-base have missing values etc.). Most ML libraries will give you an error if your data is with missing values (for example skikit-learn estimators (estimator - finder of a current value in data - can be model, classifier, regressor etc.) assume that all values in a data-set are numerical and, and that all have and hold meaning.

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> # read data
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has missing values
>>> test_data.isnull()
>>> # count missing values per column
>>> test_missing_count = test_data.isnull().sum()
>>> test_missing_count
>>> test_missing_count[test_missing_count > 0]

Dealing with missing values

There are several approaches to missing data (will use all of them and then will compare prediction results):
  1. Delete columns or rows with missing data:
    1. drop rows with missing data:
      1. >>> test_data_dropna_0 = test_data.dropna(axis=0)
      2. >>> test_data_dropna_0
    2. drop columns with missing data:
      1. >>> test_data_dropna_1 = test_data.dropna(axis=1)
      2. >>> test_data_dropna_1
    3. if you have both train and test data sets, then columns must be deleted from the both sets:
      1. >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
      2. any() - return whether any element is True over requested axis. Literally we ask if any() element of the specified column isnull()
      3. >>> train_data_dropna_1 = train_data.dropna(columns_with_missing,axis=1)
      4. >>> test_data_dropna_1 = test_data.dropna(columns_with_missing,axis=1)
    4. Dropping missing values is only good when data in that columns is mostly missing
  2. Impute (in statistics, imputation is the process of replacing missing data with substituted values):
    1. pandas.DataFrame.fillna:
      1. >>> test_data_fillna = test_data.fillna(0)
      2. fillna fills NaN "cells" with 0 (you can use any value you want, seel help(pd.DataFrame.fillna) 
      3. >>> test_data_fillna
    2. sklearn.preprocessing.Imputer:
      1. >>> sklearn.preprocessing import Imputer
      2. >>> test_imputer = Imputer()
      3. >>> test_data_imputed = test_imputer.fit_transform(test_data) 
      4. By default missing (NaN) values are replaces by mean along the axis, default axis is 0 (impute along columns - test_data.describe() to see mean values)
      5. >>> test_data_imputed
      6. After imputation array is created, we'll convert this array to the pandas DataFrame:
        1. >>> test_data_imputed = pd.DataFrame(test_data_imputed)
        2. >>> test_data_imputed.columns = test_data.columns
        3. >>> test_data_imputed
  3. Extended Imputation - before imputation, we'll create new column indicating which values were changed:
    1. >>> test_data_ex_imputed = test_data.copy()
    2. >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
    3. >>> columns_with_missing
    4. Create column corresponding to each column with missing values and fill that new column with Boolean values (was missing in original data set or wasn't missing):
      1. >>> for col in columns_with_missing:
        ...  test_data_ex_imputed[col + '_was_missing'] = test_data_ex_imputed[col].isnull()
      2. >>> test_data_ex_imputed
      3. >>> test_data_ex_imputed_columns = test_data_ex_imputed.columns
    5. impute:
      1. >>> test_imputer = Imputer()
      2. >>> test_data_ex_imputed = test_imputer.fit_transform(test_data_ex_imputed)
    6. Convert to DataFrame:
      1. >>> test_data_ex_imputed = pd.DataFrame(test_data_ex_imputed)
      2. >>> test_data_ex_imputed
      3. >>> test_data_ex_imputed.columns = test_data_ex_imputed_columns
      4. Now previously missing data in out "_was_missing" columns is 1, and previously not missing is 0:
        1. >>> test_data_ex_imputed

Checking with method is the best

>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.model_selection import traint_test_split
>>> def score_dataset(dataset):
...       y = dataset.Price
...       X = dataset.drop(['Price'], axis=1)
...       train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
...       train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
...       model = RandomForestRegressor()
...       model.fit(train_X, train_y)
...       predictions = model.predict(test_X)
...       return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476

As is common imputation gives better result than dropping missing values and extended imputation can give better (than simple imputation) result or not give any improvement.



Tuesday, August 14, 2018

Scikit-learn 2. Hands-on python scikit-learn: RandomForest.

To read about what is Random Forest go to -  it-tuff.blogspot.com/machine-learning-1

If you can't remember where skikit-learn models, metrics etc. are:

  1. locate sklearn | grep utils | cut -d"/" -f 1-6 | uniq
  2. cd to the found directory:
    1. cd /usr/lib64/python2.7/site-packages/sklearn
  3. to view packages:
    1. ll | grep ^d
>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has NaN values
>>> test_data.isnull()
>>> test_data = test_data.dropna(axis=0)
>>> test_data.columns.values
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> y = test_data.Price
>>> from sklearn.model_selection import train_test_split
>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> from sklearn.ensemble import RandomForestRegressor
>>> # rfr states for RandomForestRegressor (by default RFR creates 10 trees)
>>> test_rfr_model = RandomForestRegressor(random_state=1)
>>> test_rfr_model.fit(train_X, train_y)
>>> test_rfr_preds = test_rfr_model.predict(val_X)
>>> from sklearn.metrics import mean_absolute_error
>>> mean_absolute_error(val_y, test_rfr_preds)
40.0

As you even with default values Random Forest gives better results (in it-tuff.blogspot.com/scikit-learn-1 MAE of CART DT was 150.0).

Monday, August 13, 2018

Scikit-learn 1. Hands-on python scikit-learn: intro, using Decision Tree Regression model, MAE, overfitting, cross-validation, underfitting.

Scikit-learn is Python machine-learning library

To install it:
pip install scipy
pip install sklearn

To use Scikit-learn, we must go through several steps:

  1. Prepare data - choose appropriate data to use in model and predisction 
  2. Define - choose appropriate model (decision tree, random forest etc.)
  3. Fit - capture patterns from provided data 
  4. Predict 
  5. Evaluate - make decisions on how are predictions accurate

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

Prepare Data

To learn how to use pandas, go to it-tuff.blogspot.com/pandas-1

python
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # our data is having NaN values, the simpleat approach is to remove all rows with NaN data
>>> test_data = test_data.dropna(axis=0)
>>> # we need to select prediction target, by convention called "y"
>>> test_data.columns.values
>>> y = test_data.Price
>>> # we need to choose "prediction input" - features -  columns (except prediction target) which will be inputted in our model and used to make predictions
>>> # by convention features called "X"
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> # verify data (may be something weird is out there
>>> X.head()
>>> X.describe()
>>> X.info()

Define

Our prediction target is price, price can be theoretically any real number, so we'll use Decision Tree Regression model (to read more about Decision Tree, go to - it-tuff.blogspot.com/machine-learning-1).

>>> from sklearn.tree import DecisionTreeRegressor
>>> # make our test_model to be of DecisionTreeRegressor class
>>> # DT heuristic algorithm makes optimal random decisions on each leaf (node), so result can be different for every algorithm iteration, to achieve the same result on all iterations, random_state seed must be used
>>> test_model = DecisionTreeRegressor(random_state=1)

Fit

>>> # Build a decision tree regressor from the training set (X,y) - in this step we make our Decision Tree to find patterns in the training set
>>> test_model.fit(X,y)

Predict

>>> First we'll make prediction for our training set/data to check how good model is
>>> # Making prediction for the features
>>> X
>>> # Real values are
>>> y
>>> # Model predictions are
>>> test_model.predict(X)
array([300., 400., 400., 200., 700.])
>>> 

Evaluate

If you want, you can view your decision tree model:

First export model in DOT format
>>> from sklearn.tree import export_graphviz
>>> export_graphviz(test_model,out_file="test_model.dot",feature_names=test_data_features)

Install Graphviz:
yum install graphviz
dot -Tpng test_model.dot -o test_model.png

Description of parameters in PNG file:
  1. samples - how many object are in a leaf and waiting for prediction (first leaf is having samples=5 because all 5 flats prices are waiting to be predicted)
  2. mse - several functions are available in order to measure quality of a split, mse is a default value - mean squared error - it is always non-negative, and values closer to zero are better.
  3. value - is predicted price
To evaluate our predictions, we can use many metrics, here we'll use MAE (Mean Absolute Error). To calculate MAE:
  1. Find Absolute Accuracy Error - absolute difference of price: 
    1. |actual_price - predicted_price| 
    2. This  is done for every actual price and prediction pares in training set
  2. Find mean of all errors (sum up absolute accuracy errors and divide by count of the errors)
>>> from sklearn.metrics import mean_absolute error
>>> y_true = y
>>> y_predicted = test_model.predict(X)
>>> mean_absolute_error(y_true,y_predicted)
0.0

This measure is called "in-sample" measure, because we used the same sample for both training and validating. It is bad because, for example all apartments with red door mats (if this parameter were in the data) in our sample are expensive ones, so this parameter "door mat color" will be considered while predicting apartment rent price, but it's incorrect (door mat color is not having any relation to the apartment rent price). 
In-sample prediction and validation will show that our model is ideal or close to be ideal. This is called overfitting - a model matches training data almost perfectly but does poorly on a new data. It is because each next decision tree split is having less and less corresponding values (apartments in our case). Leaves with a few apartments will  make very accurate predictions close to the actual values and this makes model perfect for training data and unreliable for new data. This is because all parameters in training model are considered to be perfect indicators for predicted value which is not the case.

On the contrary if we'll make only a few splits (low tree depth), our model will not catch important patterns in the data, so it performs poorly even in training data, this is called underfitting.

So to validate predictions correctly, we need to use different samples for prediction and validation. The simplest way to do that is to split data into prediction and validation parts (so called cross-validation):
>>> from sklearn.model_selection import train_test_split
>>> # this function splits sample data into training (by default 25% of sample size) and validating portions (mnemonics - this is TRAIN and TEST SPLIT)
>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> # now we'll use this split data to make training, prediction and validation
>>> test_model = DecisionTreeRegressor(random_state=1)
>>> test_data.fit(train_X,train_y)
>>> y_predicted = test_model.predict(val_X)
>>> mean_absolute_error(val_y,y_predicted)
150.0

As you see MAE for the in-sample data was 0.0 and for out-of-sample data is 150.0 In our data average price is 400, so error in new data (data not used during fitting) is about 37%

So we need to find compromise between overfitting and underfitting (lower MAE between training and validation data). To do that we can experiment with DecisionTreeRegressor max_leaf_nodes parameter (maximum number of leaves in our model):
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):   
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)  
  model.fit(train_X, train_y)  
  preds_val = model.predict(val_X)  
  mae = mean_absolute_error(val_y, preds_val)  
  return(mae)  

for max_leaf_nodes in [2, 3, 4, 5]:  
  my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)  
  print("max_leaf_nodes: %d \t MAE: %d" %(max_leaf_nodes, my_mae))  

Our data will show the same result for all max_leaf_nodes values because our test data set is too small but I think you understand importance of the above code (get_mae and for loop).
After finding the best value for max_leaf_nodes , train your model on data in-sample:
>>> test_model.fit(X,y)