IT Stuff: 2018

Thursday, December 20, 2018

Installing Asterisk on CentOS7 (step-by-step)

Shrink home volume size and extend root volume size (if needed):

umount /home

df -h # verify that /home is unmounted

parted -l # to find type of the home LV filesystem (fxs in my case)

lvremove /dev/centos/home # remove LV from the VG

lvcreate -L 5G -n home centos

mkfs.xfs /dev/mapper/centos-home

mount -a

lvs

lsblk

df -h

lvextend -l +100%FREE /dev/centos/root

xfs_growfs /dev/centos/root

df -h

lsblk

lvs

Setup networking:

systemctl stop NetworkManager

systemctl disable NetworkManager

chkconfig network on

systemctl start network

vi /etc/sysconfig/network-scripts/ifcfg-eth0 # assign IPADDR / PREFIX / GATEWAY / DNS1 / DNS2

vi /etc/sysconfig/network and add below two lines:

NETWORKING=yes

NOZEROCONF=true

systemctl restart network

Disable SELinux:

vi /etc/sysconfig/selinuix and change:

SELINUX=enabled to SELINUX=disabled

Setup NTP:

yum install -y ntp && ntpdate pool.ntp.org && chkconfig ntpd && service ntpd start

systemctl enable ntpd.service

systemctl status ntpd.service

systemctl start ntpd.service

systemctl status ntpd.service

Preinstall steps:

adduser asteriskpbx && passwd asteriskpbx && yum install sudo && visudo # uncomment #%wheel ALL=(ALL) ALL line

vi /etc/group # add asteriskpbx to wheel group: wheel:x:10:root,asteriskpbx

usermod -L asteriskpbx #make asteriskpbx user nologin

Before Asterisk installation, install all prerequisite packages:

yum -y install make gcc gcc-c++ make subversion libxml2-devel ncurses-devel openssl-devel vim-enhanced man glibc-devel autoconf libnewt kernel-devel kernel-headers linux-headers openssl-devel zlib-devel libsrtp libsrtp-devel uuid libuuid-devel mariadb-server jansson-devel libsqlite3x libsqlite3x-devel epel-release.noarch bash-completion bash-completion-extras unixODBC unixODBC-devel libtool-ltdl libtool-ltdl-devel mysql-connector-odbc mlocate libiodbc

Initial setup MariaDB:

systemctl status mariadb

systemctl enable mariadb

systemctl start mariadb

systemctl status mariadb

/usr/bin/mysql_secure_installation

Asterisk Installation:
yum update -y

yum install wget -y

mkdir -p ~/src/asterisk/

cd src/asterisk/

wget http://downloads.asterisk.org/pub/telephony/asterisk/asterisk-13-current.tar.gz

tar -xzvf asterisk-13-current.tar.gz

cd asterisk-13.23.1/

If your system is 64 bit and you want PJSIP:

./configure --libdir=/usr/lib64 --with-pjproject-bundled

Otherwise:

./configure

menuselect/menuselect --list-options

make

make install

make samples
make config
safe_asterisk
systemctl status asterisk

After install steps:

chown -R asteriskpbx:asteriskpbx /usr/lib/asterisk/

chown -R asteriskpbx:asteriskpbx /usr/lib64/asterisk/

chown -R asteriskpbx:asteriskpbx /var/lib/asterisk

chown -R asteriskpbx:asteriskpbx /var/spool/asterisk/

chown -R asteriskpbx:asteriskpbx /var/log/asterisk/

chown -R asteriskpbx:asteriskpbx /var/run/asterisk/

chown -R asteriskpbx:asteriskpbx /usr/sbin/ast

chown -R asteriskpbx:asteriskpbx /usr/sbin/asterisk

chown -R asteriskpbx:asteriskpbx /var/log/asterisk

asterisk -r

exit

If want - load modules needed for PJSIP

vi /etc/asterisk/modules.conf

[modules]
;AUTOLOAD
autoload=yes
;LOAD
preload => res_odbc.so
preload => res_config_odbc.so
load => res_pjsip.so
load => res_pjsip_pubsub.so
load => res_pjsip_session.so
load => chan_pjsip.so
load => res_pjsip_exten_state.so
load => res_pjsip_authenticator_digest.so
load => res_timing_timerfd.so

vi /etc/asterisk/asterisk.conf and uncomment and change this 2 lines:
runuser = asteriskpbx ; The user to run as.
rungroup = asteriskpbx ; The group to run as.

asterisk -rx "core restart now"

setup CEL with ODBC

odbcinst -j # verify that odbcinst.ini & odbc.ini are created
mysql -u root -p mysql
> CREATE USER 'asterisk'@'%' IDENTIFIED BY 'some_secret_password';
> CREATE DATABASE asterisk;
> GRANT ALL PRIVILEGES ON asterisk.* TO 'asterisk'@'%';

mysql -u asterisk -p asterisk # check access with asterisk user
odbcinst -q -d # verify that [MySQL] driver is seen
vi /etc/odbc.ini and add following:
[asterisk-connector] # connector-name / DSN
Description =MySQL connection to 'asterisk' database
Driver =MySQL # driver name from the odbcinst.ini
Database =asterisk # database to connect
Server =localhost # we'll connect to the server itself
charset =UTF8
UserName =asterisk # DB user-name for asterisk DB
Password =some_secret_password # asterisk DB-user password
Port =3306
Socket =/var/lib/mysql/mysql.sock

echo "select 1" | isql -v asterisk-connector asterisk 'some_secret_password' # check connection

Add connector to the asterisk res_odbc.conf:
[asterisk]
enabled => yes
dsn => asterisk-connector ; DSN is Data Source Name (from odbc.ini)
username => asterisk
password => some_secret_password
;pooling => no ; old option, replaced by max_connections
;limit => 1 ; old option, replaced by max_connections
pre-connect => yes
max_connections => 1

asterisk -rx "core restart now"
asterisk -rx "odbc show" # must show "Number of active connections: 1 (out of 1)"

vi cel.conf add following:
[general]
enable=yes
apps=all
events=all

vi cel_odbc.conf add following:

[mycel]

connection=asterisk

table=cel

vi /etc/my.cnf add following under [mysqld]:

character_set_server=utf8

mysql -u asterisk -p asterisk

CREATE TABLE cel
(
id INT(20) NOT NULL AUTO_INCREMENT,
eventtype INT(11) COLLATE utf8_general_ci NOT NULL,
eventtime TIMESTAMP NOT NULL,
userdeftype VARCHAR(30) COLLATE utf8_general_ci NULL,
cid_name VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_num VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_ani VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_rdnis VARCHAR(80) COLLATE utf8_general_ci NULL,
cid_dnid VARCHAR(80) COLLATE utf8_general_ci NULL,
exten VARCHAR(30) COLLATE utf8_general_ci NULL,
context VARCHAR(30) COLLATE utf8_general_ci NULL,
channame VARCHAR(30) COLLATE utf8_general_ci NULL,
appname VARCHAR(30) COLLATE utf8_general_ci NULL,
appdata VARCHAR(150) COLLATE utf8_general_ci NULL,
accountcode VARCHAR(30) COLLATE utf8_general_ci NULL,
peeraccount VARCHAR(30) COLLATE utf8_general_ci NULL,
uniqueid VARCHAR(30) COLLATE utf8_general_ci NULL,
linkedid VARCHAR(30) COLLATE utf8_general_ci NULL,
amaflags INT(11) NULL,
userfield VARCHAR(30) COLLATE utf8_general_ci NULL,
peer VARCHAR(30) COLLATE utf8_general_ci NULL,
PRIMARY KEY (id),
KEY uniqueid (uniqueid)
)
ENGINE=INNODB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;

asterisk -r asterisk

*CLI> module reload res_odbc.so asterisk

*CLI> module reload cel_odbc.so asterisk

*CLI> cel show status

Monday, November 26, 2018

SSH Tunnel (Local, Remote)

ssh-host is host where "ssh" command is executed to create ssh-tunnel

ssh-peer is host to which ssh-host connects via SSH to form an ssh-tunnel

destination-host is a host we want to access over the ssh-tunnel

For both Local and Remote SSH tunnels:

connection from ssh-host to ssh-peer is allowed
only traffic between ssh-host ans ssh-peer is encrypted (this traffic is in ssh-tunnel), traffic after ssh-tunnel (connection to destination-host itself) is not encrypted and security here based on TCP protocol being used (HTTP, FTP will remain unencrypted / HTTPS, SSH will be encrypted)

Local

Generally:

created on ssh-host
accessed from ssh-host
port is listened on ssh-host

Local - tunnel is created and accessed on the <ssh-host>, <destination-host>:<destination-port> must be accessible from the <ssh-peer> (see below explanation):
[root@ssh-host ~] ssh -L <port-to-listen-on-ssh-host>:<destination-host>:<destination-port> <ssh-peer>

<ssh-host> connects to <ssh-peer> with SSH protocol to form an ssh-tunnel. When <ssh-host> connects to <localhost>:<port-to-listen-on-ssh-host> on itself, <ssh-peer> connects to the <destination-host>:<destination-port> and sends this connection over SSH to the <ssh-host> which listens to this connection traffic at the <port-to-listen-on-ssh-host>
Or in other words:
<destination-host>:<destination-port> is accessed as <localhost>:<port-to-listen-on-ssh-host> from <ssh-host>

PS if <destination-host>:<destination-port> will be for example localhost:80, then ssh-peer will connect to itself on 80 port

Remote

Generally:

created on ssh-host
accessed from ssh-peer
port is listened on ssh-peer

Remote - tunnel is created on the <ssh-host> and accessed on the <ssh-peer>, <destination-host>:<destination-port> must be accessible from the <ssh-host> (see below explanation):

[root@ssh-host ~] ssh -R <port-to-listen-on-ssh-peer>:<destination-host>:<destination-port> <ssh-peer>

<ssh-host> connects to <ssh-peer> with SSH protocol to form an ssh-tunnel. When <ssh-peer> connects to <localhost>:<port-to-listen-on-ssh-peer> on itself, <ssh-host> connects to the <destination-host>:<destination-port> and sends this connection over SSH to the <ssh-peer> which listens to this connection traffic at the <port-to-listen-on-ssh-peer>

Or in other words:
<destination-host>:<destination-port> is accessed as <localhost>:<port-to-listen-on-ssh-peer> from <ssh-peer>

PS if <destination-host>:<destination-port> will be for example localhost:80, then ssh-host will connect to itself on 80 port

Check

To view existent SSH tunnels (IPv4 (option -i4), IPv6 (option -i6) or both IPv4 and IPv6 (option -i) tunnels, don't do IP resolving - option -n; show numerical ports - option -P):

[admin@localhost ~]$ lsof -i4 -n -P | grep ssh

As background process

If you want create tunnels in the background and don't want to send any commands through SSH tunnel, then use options "-f" (forces going to the background and sending stdin to /dev/null but asks for passwords) and "-N" (do not execute a remote command), use below syntax:
[root@ssh-host ~] ssh -f -N -L <port-to-listen-on-ssh-host>:<destination-host>:<destination-port> <ssh-peer>
[root@ssh-host ~] ssh -f -N -R <port-to-listen-on-ssh-peer>:<destination-host>:<destination-port> <ssh-peer>

Address binding

If you want, you can use address binding to more easily identify SSH tunnels:

Tunnel to 10.10.10.100:

ssh -L 127.0.0.100:2100:localhost:22 root@10.10.10.100

Tunnel to 10.11.11.200:

ssh -L 127.0.0.200:2200:localhost:22 root@10.11.11.200

Now you have two "links" to access these SSH tunnels:

ssh root@127.0.0.100 -p 2100 for 10.10.10.100
ssh root@127.0.0.200 -p 2200 for 10.11.11.200

Accessing one host over another

If you have 2 servers you want to interconnect (the servers can't access each other directly) and have third server (can access both 1st and 2nd server) and have all traffic in the route being encrypted:

Setup

1st server IP 10.10.10.1
2nd server IP 11.11.11.2
3rd server IP 12.12.12.3
you want to access 10.10.10.1 from 11.11.11.2

Configure

[admin@12.12.12.3 ~] ssh -L 2001:localhost:22 root@10.10.10.1
[admin@12.12.12.3 ~] ssh -R 2003:localhost:2001 root@11.11.11.2

Access 10.10.10.1 from 11.11.11.2

Access with SSH

ssh root@localhost -p 2003

rsync

rsync -av -e "ssh -p2003" /some-dir/some-file root@localhost:/sync-dest-dir/

rsync with synchronized files (directories are not deleted) deletion on source server:

rsync -av -e "ssh -p2003" --remove-source-files /some-dir/some-file root@localhost:/sync-dest-dir/

Cisco ASA IPSec VPN (IKEv1 / IKEv2) with pre-shared key

Setup - we'll interconnect two branches:

Peers use VLAN 56:

one branch has interface with an IP address 10.10.10.1/24 (we'll call these "their-side")
the other branch has an interface IP address 10.10.10.2/24 (we'll call these "our-side")

Encryption domains (network which we want to interconnect via VPN):

their-side has LAN net 192.168.1.0/24
our-side has LAN net 192.168.2.0/24

IKEv1 or IKEv2 can be used
Also assume that both branches use dedicated interface for VPN connection and this is not interface facing the Interntet (this made for simplicity and you can use the same setup to configure already functioning interfaces)

Aggressive or main mode

Normally main mode is used, so check, that aggressive mode is disabled globally:

sh run | grep crypto ikev1 am-disable

Phase1

Check if needed IKE Phase1 policy is already created (choose IKE version you need 1 or 2):

IKEv1 Phase1 policy:

sh run crypto ikev1 | grep crypto ikev1 policy|pre-share|aes-256| sha|group 5|86400

IKEv2 Phase1 policy (for IKEv2 integrity=hash, prf (Pseudo-Random Function must be = integrity):

sh run crypto ikev2 | grep crypto ikev2 policy|aes-256| sha|group 5|sha|86400

If need is not found, create Phase1 policy (choose IKE version you need 1 or 2):

for IKEv1:

crypto ikev1 policy 160

authentication pre-share
encryption aes-256
hash sha
group 5
lifetime 86400

for IKEv2:

crypto ikev2 policy 40

encryption des
integrity sha
group 5 2
prf sha
lifetime seconds 86400

Phase2

Check if needed IKE Phase1 policy is already created (choose IKE version you need 1 or 2):

IKEv1 Phase2 policy:

sh run crypto ipsec | grep ikev1.+esp-aes-256.+sha

IKEv1 Phase2 policy:

sh run crypto ipsec | grep ikev2|aes-256|sha-1

If need is not found, create Phase1 policy (choose IKE version you need 1 or 2):

for IKEv1:

crypto ipsec ikev1 transform-set ESP-AES-256-SHA esp-aes-256 esp-sha-hmac

for IKEv2:

crypto ipsec ikev2 ipsec-proposal AES256-SHA1

protocol esp encryption aes-256
protocol esp integrity sha-1

Interface, & route

Setup interface which will be used for IPSec VPN initiation (this interface is one peer and the other side is also peer of the VPN tunnel), I suppose that VLAN is used:

interface GigabitEthernet1/10.56

vlan 56

nameif TEST

security-level 1

ip address 10.10.10.2 255.255.255.0 # the other peer is 10.10.10.1/24

Check reverse-path:

ip verify reverse-path interface TEST

Set fragment-chain length:

fragment chain 1 TEST

If you don't use proxy-ARP, disable it:

sysopt noproxyarp TEST

If you use ASA cluster and don't want this interfeace link to be monitored:

no monitor-interface TEST

Create route to the other side (other side encryption domain):

route TEST 192.168.1.0 255.255.255.0 10.10.10.1 1

Group-Policy

VPN Group-Policy (peer IP address is used in naming):

group-policy GP_10.10.10.1 internal

group-policy GP_10.10.10.1 attributes

vpn-tunnel-protocol ikev1

vpn-tunnel-protocol ikev2

vpn-tunnel-protocol ikev1 ikev2

Tunnel-Group

VPN Tunnel-Group (peer IP address is used in naming):

tunnel-group 10.10.10.1 type ipsec-l2l

tunnel-group 10.10.10.1 general-attributes

default-group-policy GP_10.10.10.1

tunnel-group 10.10.10.1 ipsec-attributes

Then:

for IKEv1:

ikev1 pre-shared-key PSK-KEY-GOES-HERE

for IKEv2:

ikev2 local-authentication pre-shared-key PSK-KEY-GOES-HERE
ikev2 remote-authentication pre-shared-key PSK-KEY-GOES-HERE

If keepalive is needed (normally this doesn't create a problem even if peer doesn't use this option)

isakmp keepalive threshold 10 retry 2

Objects

Object local encryption domain (our LAN network - our network which will be seen from the other side of the VPN):

object network TEST_VPN_our_ED

subnet 192.168.2.0 255.255.255.0

Object remote local encryption domain (their LAN network - their network which will be seen by our side):

object network TEST_VPN_their_ED

subnet 192.168.1.0 255.255.255.0

If you have another LAN network and want this network to access VPN too (but don't want or don't allowed to add this network to the VPN setup as another encryption domain), you can achieve this using NAT. For simplicity use VLAN ID as network NAT identifier (it will you to more easily identify NAT-ted traffic in log files):

object network TEST_VPN_our_NET57_NAT

host 192.168.2.57

VPN ACL & enable protocol on an interface (note here we first write "our IP" and then "their IP")

VPN ACL:

access-list TEST-VPN line 1 extended permit ip object TEST_VPN_our_ED object TEST_VPN_their_ED

Enable VPN IKEV1 protocol on an interface (only once):

crypto ikev1 enable TEST

crypto ikev2 enable TEST

Crypto-Map & add map to the interface

TEST_map crypto-map creation:

crypto map TEST_map 1 match address TEST-VPN

crypto map TEST_map 1 set peer 10.10.10.1

Then:

for IKEv1:

crypto map TEST_map 1 set ikev1 transform-set ESP-AES-256-SHA

for IKEv2:

crypto map TEST_map 1 set ikev2 ipsec-proposal AES256-SHA1

crypto map TEST_map 1 set security-association lifetime seconds 28800

crypto map TEST_map 1 set security-association lifetime kilobytes unlimited

If PFS is needed:
crypto map TEST_map 1 set pfs group5

Add map to the interface (only once - when creating TEST_map):

crypto map TEST_map interface TEST

Interface ACL & access-group

Interface ACL:

access-list TEST_access_in extended permit ip object TEST_VPN_their_ED object TEST_VPN_our_ED

access-list TEST_access_in extended permit icmp host 10.10.10.1 host 10.10.10.2

access-list TEST_access_in extended permit esp any4 interface TEST

access-list TEST_access_in extended permit udp any4 interface TEST eq isakmp

access-list TEST_access_in extended permit icmp any4 interface TEST

access-list TEST_access_in extended deny ip any any

access-group TEST_access_in in interface TEST

NAT & no-NAT (NAT exemption) examples/templates

Host 192.168.3.2 VPN-traffic NAT exemption (no-NAT):

nat (LAN57,TEST) source static lan57.srv.3.2 TEST_VPN_our_NET57_NAT destination static TEST_VPN_their_ED TEST_VPN_their_ED no-proxy-arp

Allowing Host 192.168.3.2

access-list TEST_access_in extended permit ip object TEST_VPN_their_ED object lan57.srv.3.2

Group-Policy ACL (note here we first write "their IP" and then "our IP")

You can setup VPN with simple rules as TEST-VPN above and then make restrictions for ports, source IP etc:

Create group-policy ACL (we'll permit access from their net IP 192.168.1.10 to our net IP 192.168.2.10 port 443 and deny access for all others):

access-list TEST-VPN_GP_FILTER extended permit tcp host 192.168.1.10 host 192.168.2.10 eq 443

access-list TEST-VPN_GP_FILTER extended deny ip any any

group-policy GP_10.10.10.1 attributes

vpn-filter value TEST-VPN_GP_FILTER

Monday, November 19, 2018

Cluster 26. Renaming pcs resource.

Below procedure found in:
https://bugzilla.redhat.com/show_bug.cgi?id=1126835 and tested by changing of the name of the resource type "ocf:heartbeat:VirtualDomain":

First:

make resource unmanaged: pcs resource unmanage resource-old-name
If this is VirtualDomain resource:

change old-name to the new-name in XML definition files of your VM.

Backup existing config:

pcs cluster cib /tmp/cib.xml

Globally (not only first occurrence) change old-name to the new-name:

sed 's/resource-old-name/resource-new-name/g' -i /tmp/cib.xml

Verify changes:

vi /tmp/cib.xml

Push changed config to the cluster:

pcs cluster cib-push /tmp/cib.xml

Verify name change:

pcs status

Verify name change in the config dump:

pcs config | grep resource-new-name

Make resource managed again:
pcs resource manage resource-new-name

Thursday, November 8, 2018

Cluster 25. Restoring failed node.

In case of hardware failure and need to restore one of the node (ex. agrp-c01n02):

go through all steps in Cluster 1 - Cluster 11 blog-posts (do only stuff related to the failed node)
Cluster 12 blog-post - go through steps till "Login to any of the cluster node and authenticate hacluster user." part (do only stuff related to the failed node), then:

passwd hacluster
from an active node:

pcs node maintenance agrp-c01n02
pcs cluster auth agrp-c01n02

from agrp-c01n02:

pcs cluster auth
pcs cluster start
pcs cluster status # node must be in maintenance mode with many errors due to absence of drbd / virsh and other packages

then go through Cluster 12, starting at "Check cluster is functioning properly (on both nodes)" till "Quorum:" part

go through all steps in Cluster 14 blog-post (do only stuff related to the failed node)
Cluster 16 blog-post - go through steps till "Setup common DRBD options" part (do only stuff related to the failed node), then:

from agrp-c01n01:

rsync -av /etc/drbd.d root@agrp-c01n02:/etc/

from agrp-c01n02:

drbdadm create-md r{0,1}
drbdadm up r0; drbdadm secondary r0
drbd-overview
drbdadm up r1; drbdadm secondary r1
drbd-overview
wait till full synchronisation
reboot failed node

Cluster 17 blog-post - go through steps till "Setup DLM and CLVM" (do only stuff related to the failed node), then:

drbdadm up all
cat /proc/drbd

Cluster 19 blog-post - only do check of the SNMP from the failed node:

snmpwalk -v 2c -c agrp-c01-community 10.10.53.12
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel3 --action list
fence_ifmib --ip agrp-stack01 --community agrp-c01-community --plug Port-channel2 --action list

Cluster 20 blog-post - go through steps till "Provision Planning" (do only stuff related to the failed node), then:

rsync -av /etc/libvirt/qemu/networks/ovs-network.xml root@agrp-c01n02:/root
systemctl start libvirtd
virsh net-define /root/ovs-network.xml
virsh net-list --all
virsh net-start ovs-network
virsh net-autostart ovs-network
virsh net-list
systemctl stop libvirtd
rm /root/ovs-network.xml

For each VM add constraint to ban VM start on failed node (I assume n02 to fail). Below command adds -INFINITY location constraint for specified resource and node:

pcs resource ban vm01-rntp agrp-c01n02
pcs resource ban vm02-rftp agrp-c01n02

Unmaintenance failed node from survived one and start cluster on the failed node:

pcs node unmaintenance agrp-c01n02
pcs cluster start
pcs status
wait till r0 & r1 DRBD resources are masters on both nodes and all resources (besides all VMs) are started on both nodes

Cluster 18 blog-post, do only:

yum install gfs2-utils -y
tunegfs2 -l /dev/agrp-c01n01_vg0/shared # to view shared LV
dlm_tool ls # name clvmd z& shared / members 1 2
pvs # should only show drbd and sdb devices
lvscan # List all logical volumes in all volume groups (3 OS LV, shared & 1 LV per VM)

Cluster 21 blog-post:

do "Firewall setup to support KVM Live Migration" (do only stuff related to the failed node)
crm_simulate -sL | grep " vm[0-9]"
SELunux related:

ls -laZ /shared # must show "virt_etc_t" in all lines except related to ".."
if above line is not true, do stuff in "SELinux related issues" (do only stuff related to the failed node)

One by one (for each VM):

remove ban constraint for the first VM:

pcs resource clear vm01-rntp

verify that constraints are removed:

pcs constraint location

if this VM must be started on the restored node - wait till live migration is performed

Congratulations your cluster is restored into normal operation

Tuesday, October 9, 2018

CentOS 7 Apache, Nginx, PHP-FPM, PrestaShop

OS related

Install CentOS7

Create admin user and add make it Administrator (group=wheel)

Give a root-password

I assume that your server IP address is 192.168.1.1

setenforce 0

getenforce

sed -i 's/enforcing/permissive/' /etc/sysconfig/selinux

sed -i 's/enforcing/permissive/' /etc/selinux/config

reboot

yum -y install wget unzip epel-release mlocate

updatedb

yum clean all

yum update -y

reboot

Install Apache & test default page

yum install httpd

sed -i 's/Listen 80/Listen 8080/' /etc/httpd/conf/httpd.conf

systemctl start httpd.service

systemctl enable httpd.service

systemctl status httpd

httpd -S

ss -tlpn | grep 8080

firewall-cmd --zone=public --permanent --add-port=8080/tcp

firewall-cmd --reload

192.168.1.1:8080 - test Apache welcome page

Disallow Apache to display directories and files within the web root directory /var/www/html:

sudo sed -i "s/Options Indexes FollowSymLinks/Options FollowSymLinks/" /etc/httpd/conf/httpd.conf

systemctl restart httpd

Install MariaDB

Choose database-name, user-name and pasword for your PrestaShop DB

Install MariaDB and set it to automatically start after system reboot:

yum install mariadb mariadb-server -y

systemctl start mariadb.service

systemctl enable mariadb.service

Execute the secure MySQL installation process:

/usr/bin/mysql_secure_installation

Go through the process in accordance with the instructions below:

Enter current password for root (enter for none): Press the Enter key

Set root password? [Y/n]: Input Y, then press the Enter key

New password: Input a new root password, then press the Enter key

Re-enter new password: Input the same password again, then press the Enter key

Remove anonymous users? [Y/n]: Input Y, then press the Enter key

Disallow root login remotely? [Y/n]: Input Y, then press the Enter key

Remove test database and access to it? [Y/n]: Input Y, then press the Enter key

Reload privilege tables now? [Y/n]: Input Y, then press the Enter key

Now, log into the MySQL shell so that you can create a dedicated database for PrestaShop:

mysql -u root -p

CREATE DATABASE pshop-db-name;

GRANT ALL PRIVILEGES ON pshop-db-name.* TO 'pshop-db-username'@'localhost' IDENTIFIED BY 'pshop-db-password' WITH GRANT OPTION;

FLUSH PRIVILEGES;

EXIT;

Install PHP

Install PHP and required extensions using YUM:

yum -y install php php-fpm php-mysql php-gd php-ldap php-odbc php-pear php-xml php-xmlrpc php-mbstring php-snmp php-soap php-mcrypt php-curl php-cli curl zlib

Editing php.ini for optimal performance.

sed -i '/memory_limit/c\memory_limit = 128M' /etc/php.ini

sed -i '/upload_max_filesize/c\upload_max_filesize = 16M' /etc/php.ini

sed -i '/max_execution_time/c\max_execution_time = 60' /etc/php.ini

vi /var/www/html/info.php add: <?php phpinfo(); ?>

systemctl restart httpd

192.168.1.1:8080/info.php review:

Server API Apache 2.0 Handler

_SERVER["SERVER_SOFTWARE"] Apache/2.4.6 (CentOS) PHP/5.4.16

grep -E "mod_proxy.so|mod_proxy_fcgi.so" /etc/httpd/conf.modules.d/* => if no result:

vi /etc/httpd/conf/httpd.conf => find LoadModule and add:

LoadModule proxy_module modules/mod_proxy.so

LoadModule proxy_fcgi_module modules/mod_proxy_fcgi.so

Adding PHP-FPM (FastCGI Process Manager) support to Apache (all php scripts will be processed by PHP-FPM):

vi /etc/httpd/conf.d/php.conf find <FilesMatch \.php$> change:

#SetHandler application/x-httpd-php

SetHandler "proxy:fcgi://127.0.0.1:9000" #PHP-FPM uses port 9000

systemctl start php-fpm.service

systemctl enable php-fpm.service

systemctl status php-fpm.service -l

systemctl restart httpd

192.168.1.1:8080/info.php review:

Server API FPM/FastCGI

rm /var/www/html/info.php

systemctl restart httpd

Creating PrestaShop Virtual Host for Apache

Disable (comment all lines) Apache's default welcome page:

sed -i 's/^/#&/g' /etc/httpd/conf.d/welcome.conf

Change "pshop-domain-name" to the name you bought:

mkdir -v /var/www/pshop-domain-name

Create index.html test page:

echo "<h1 style='color: green;'>Presta Shop</h1>" | sudo tee /var/www/pshop-domain-name/index.html

Then create a phpinfo() file for each site so we can test PHP is configured properly:

echo "<?php phpinfo(); ?>" | sudo tee /var/www/pshop-domain-name/info.php

Make directory for available sites (sites configs will be here):

mkdir /etc/httpd/sites-available

This directory will contain links to the active sites (links to the files in sites-available):

mkdir /etc/httpd/sites-enabled

vi /etc/httpd/conf/httpd.conf

Add this line to the end of the file (this will allow us to quickly enable

and disable sites by adding and removing links to their config files):

IncludeOptional sites-enabled/*.conf

vi /etc/httpd/sites-available/pshop-domain-name.conf

ServerName pshop-domain-name

ServerAlias www.pshop-domain-name

DocumentRoot /var/www/pshop-domain-name

AllowOverride All

</Directory>

</VirtualHost>

AllowOverride All enables .htaccess support

Make site available:

ln -s /etc/httpd/sites-available/pshop-domain-name.conf /etc/httpd/sites-enabled/pshop-domain-name.conf

Execute below to check that httpd.conf files are ok (for now AH00558 warning is ok):

apachectl -t

systemctl restart httpd

Check that green "Presta Shop" string is displayed (if you don't use Public DNS

add server IP with the corresponding site name to the /etc/hosts file):

http://pshop-domain-name:8080/

Check that php uses FPM/FastAGI:

http://pshop-domain-name:8080/info.php

Installing and Configuring Nginx

yum install nginx

systemctl start nginx

systemctl status nginx -l
firewall-cmd --permanent --zone=public --add-service=http
firewall-cmd --reload

Test nginx deafult page:

http://192.16.8.1.1/

systemctl enable nginx

vi /etc/nginx/nginx.conf and comment all lines between "server {" and closing "}"

systemctl restart nginx -l

Check that default site is unavailable:

http://192.168.1.1/

mkdir /etc/nginx/sites-available

mkdir /etc/nginx/sites-enabled

vi /etc/nginx/nginx.conf

find "http {" block

Add these lines to the end of the http {} block, then save the file:

include /etc/nginx/sites-enabled/*.conf;

server_names_hash_bucket_size 64;

Create nginx test site:

mkdir -v /usr/share/nginx/sample.org

As we did with Apache's virtual hosts, we'll again create

index and phpinfo() files for testing after setup is complete:

echo "<h1 style='color: red;'>Sample.org</h1>" | sudo tee /usr/share/nginx/sample.org/inde

echo "<?php phpinfo(); ?>" | sudo tee /usr/share/nginx/sample.org/info.php

Now create a virtual host file for the domain sample.org

Nginx calls server {. . .} areas of a configuration file server blocks.

Create a server block for the primary virtual host, sample.org.

The default_server configuration directive makes this the default

virtual host which processes HTTP requests that do not match any other virtual host:

vi /etc/nginx/sites-available/sample.org.conf

server {

listen 80 default_server;

root /usr/share/nginx/sample.org;

index index.php index.html index.htm;

server_name www.sample.org;

location / {

try_files $uri $uri/ /index.php;

}

location ~ \.php$ {

# if the file is not there show a error : mynonexistingpage.php -> 404

try_files $uri =404;

# pass to the php-fpm server

fastcgi_pass 127.0.0.1:9000;

# also for fastcgi try index.php

fastcgi_index index.php;

# some tweaking

fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

fastcgi_param SCRIPT_NAME $fastcgi_script_name;

fastcgi_buffer_size 128k;

fastcgi_buffers 256 16k;

fastcgi_busy_buffers_size 256k;

fastcgi_temp_file_write_size 256k;

include fastcgi_params;

}

Enable sample.org site:

ln -s /etc/nginx/sites-available/sample.org.conf /etc/nginx/sites-enabled/sample.org.conf

Check nginx config files for syntax:

nginx -t

systemctl reload nginx -l

Check that sample.org is working:

sample.org

Check nginx is working:

sample.org/info.php

_SERVER["SERVER_SOFTWARE"] nginx/1.12.2

_SERVER["DOCUMENT_ROOT"] /usr/share/nginx/sample.org

Disable sample.org :

rm /etc/nginx/sites-enabled/sample.org.conf

nginx -t

systemctl reload nginx -l

Configuring Nginx for Apache's Virtual Hosts (Proxy to Apache then to FPM)

Let's create an additional Nginx virtual host with multiple domain names

in the server_name directives. Requests for these domain names will be proxied to Apache:

vi /etc/nginx/sites-available/apache.conf

Add the code block below. The try_files directive makes Nginx look for files in the document root and directly serve them. If the file has a .php extension, the request is passed to Apache. Even if the file is not found in the document root, the request is passed on to Apache so that application features like permalinks work without problems:

server {

listen 80;

root /var/www/pshop-domain-name;

index index.php index.html index.htm;

server_name pshop-domain-name www.pshop-domain-name;

location / {

try_files $uri $uri/ /index.php;

}

location ~ \.php$ {

proxy_set_header X-Real-IP $remote_addr;

proxy_set_header X-Forwarded-For $remote_addr;

proxy_set_header Host $host;

proxy_pass http://127.0.0.1:8080;

}

location ~ /\.ht {

deny all;

}

ln -s /etc/nginx/sites-available/apache.conf /etc/nginx/sites-enabled/apache.org.conf

nginx -t

systemctl -l reload nginx

Make Apache pshop-domain-name accessible only from localhost:

vi /etc/httpd/sites-enabled/pshop-domain-name.conf

ServerName pshop-domain-name

ServerAlias www.pshop-domain-name

DocumentRoot /var/www/pshop-domain-name

AllowOverride All

</Directory>

</VirtualHost>

systemctl restart httpd

Configuring Nginx for Apache's Virtual Hosts (Proxy to FPM, no Apache)

systemctl stop httpd

systemctl disable httpd
firewall-cmd --zone=public --remove-port=8080/tcp
firewall-cmd --reload

Warning: The location ~ /\. directive is very important; this prevents Nginx from printing the contents of files like .htaccess and .htpasswd which contain sensitive information.

vi /etc/nginx/sites-available/pshop-domain-name.conf

server {

listen 80 default_server;

root /var/www/pshop-domain-name;

index index.php index.html index.htm;

server_name www.pshop-domain-name pshop-domain-name;

location / {

try_files $uri $uri/ /index.php;

}

location ~ \.php$ {

# if the file is not there show a error : mynonexistingpage.php -> 404

try_files $uri =404;

# pass to the php-fpm server

fastcgi_pass 127.0.0.1:9000;

# also for fastcgi try index.php

fastcgi_index index.php;

# some tweaking

fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;

fastcgi_param SCRIPT_NAME $fastcgi_script_name;

fastcgi_buffer_size 128k;

fastcgi_buffers 256 16k;

fastcgi_busy_buffers_size 256k;

fastcgi_temp_file_write_size 256k;

include fastcgi_params;

}

location ~ /\. {

deny all;

}

rm /var/www/pshop-domain-name/info.php

systemctl restart nginx

Installing PHP7 after installing old PHP

yum install http://rpms.remirepo.net/enterprise/remi-release-7.rpm -y

yum install yum-utils -y

yum-config-manager --enable remi-php72

yum update php php-zip

yum update

reboot

Installing PrestaShop

Download the latest stable version of PrestaShop from prestashop.com:

mkdir prestashop

Extract all to the prestashop directory:

unzip prestashop_1.7.4.3.zip -d prestashop

mv prestashop/* /var/www/pshop-domain-name/

Check user name used in pshop-domain-name (it must be apache):

echo "<?php echo exec('whoami'); ?>" | sudo tee /var/www/pshop-domain-name/whoami.php

192.168.1.1/whoami.php

rm /var/www/pshop-domain-name/whoami.php

chown -R apache: /var/www/pshop-domain-name/prestashop/

systemctl restart nginx

I don't know why but I couldn't install PrestaShop using GoogleChrome, so use MozillaFirefox (I didn't try other browser) to install PrestaShop:

192.168.1.1/prestashop

If you have any troubles accessing 192.168.1.1/prestashop , you can use file which comes with PrestaShop (vi docs/server_config/nginx.conf.dist) - change content of your /etc/nginx/sites-available/pshop-domain-name.conf to the content of /var/www/pshop-domain-name/docs/server_config/nginx.conf.dist and making changes appropriate to you shop:

server_name
root
fastcgi_pass 127.0.0.1:9000;
#fastcgi_pass unix:/run/php/php7.0-fpm.sock;

systemctl restart nginx

After progress bar goes from 0% to 100%, you'll see 192.168.1.1/install

Now if you want, you can switch to the Chrome:

192.168.1.1/install
chown -R apache: /var/www/pshop-domain-name/

systemctl restart nginx

After several steps you'll be suggested to install php-intl (PHP internationalization)

and PHP accelerator:

yum install php-intl

Check that php-intl is enabled:

php --ri intl

systemctl restart php-fpm

To view all components needed by PrestaShop or suggested to be installed on the server:
wget https://github.com/PierreRambaud/phppsinfo/archive/master.zip
unzip master.zip
cp phppsinfo-master/phppsinfo.php /var/www/pshop-domain-name/
chown -R apache: /var/www/pshop-domain-name/
systemctl restart nginx -l
login/pass are the same - prestashop
http://192.168.1.1/phppsinfo.php
Change all you find to the recomended values (vi /etc/php.ini) and install additional PHP extensions (yum install php-NeededExtensionName)

systemctl restart php-fpm
systemctl restart nginx -l

check all parameters again:
http://192.168.1.1/phppsinfo.php
If everything is ok:
rm /var/www/pshop-domain-name/phppsinfo.php
systemctl restart php-fpm
systemctl restart nginx -l

http://192.168.1.1/install/index.php :
Installation is very straight-forward, the only note - use previously created database credentials when specifying database related stuff.

rm -rf /var/www/pshop-domain-name/install/
rm -f /var/www/pshop-domain-name/Install_PrestaShop.html
rm -f /var/www/pshop-domain-name/INSTALL.txt
ll /var/www/pshop-domain-name/ | grep admin
Enter admin panel with the found name:
192.168.1.1/admin985ftb6s2

Monday, October 8, 2018

Cisco ASA logging to CentOS 7 rsyslog & logrotate

First of all install CentOS 7 and yum update it.

systemctl status rsyslog.service

If rsyslog is not installed:

yum install rsyslog

Edit rsyslog config (we'll use UDP for messages logging):

vi /etc/rsyslog.conf

search for imudp and uncomment:

$ModLoad imudp

$UDPServerRun 514

systemctl restart rsyslog

systemctl status rsyslog.service

For SELinux semanage packet:

yum install policycoreutils-python

To view which port are allowed by SELinux:

semanage port -l | grep syslog

See if rsyslog is listening to any ports:

ss -nlp | grep rsyslog

firewall-cmd --list-all # find zone name (mine is public)

Allow traffic for rsyslog in that zone:

firewall-cmd --permanent --zone=public --add-port=514/udp

systemctl restart firewalld.service

firewall-cmd --list-all

Creating files for ASA log:

cd /var/log

touch asa.log

vi /etc/syslog.conf

Log severity levels

There are eight in total as per Cisco’s definitions below:

0 = Emergencies => Extremely critical “system unusable” messages
1 = Alerts => Messages that require immediate administrator action
2 = Critical => A critical condition
3 = Errors => An error message (also the level of many access list deny messages)
4 = Warnings => A warning message (also the level of many other access list deny messages)
5 = Notifications => A normal but significant condition (such as an interface coming online)
6 = Informational => An informational message (such as a session being created or torn down)
7 = Debugging => A debug message or detailed accounting message

Facility - term used to properly identify devices syslog messages. To find ASA facility:

sh log set | grep Fac|fac

Default ASA facility is 20, which is corresponding to rsyslog local4 facility (facility 21 = syslog local5, facility 22 = syslog local6 etc.).

Create a new comment that fits your needs (below lines must be inserted right after #### RULES #### in rsyslog.conf otherwise all messages will be duplicated into messages and boot.log):

# Logs sent from the ASA IP 10.10.10.10 are saved to /var/log/asa.log file here we have 2 options:
# 1 Use facility to identify message (each equipment has predefined log facility, for example :
local4.info /var/log/asa.log
# 2 Use an IP address to identify message:
if $fromhost-ip=='10.10.10.10' then /var/log/asa.log
{
/var/log/asa.log
stop
}
# you can use any of two, but only one of them, otherwise all messages will be written twice to the same file

In order for the changes to take effect we need to restart the syslog service.

systemctl restart rsyslog

Configure clock on an ASA (NTP or manual):

clock timezone AZS 4

clock set 12:33:00 10 Sep 2018
show clock

ASA logging destinations (ASA CLI parameters to logging command):

console – logs are viewed in realtime while connecteng via Serial console
asdm – logs can be viewed in the ASDM GUI.
monitor – logs to a Telnet or SSH session.
buffered – this is the internal memory buffer
host – a remote syslog server IP and interface
trap – severity for remote syslog
mail – send generated logs via SMTP
flow-export-syslogs – send event messages via NetFlow v9

Configure ASA logging to remote rsyslog server (also configuring buffer):

enabling logging:

logging enable

enable timestamping of log messages:

logging timestamp

confgure buffer (when buffer filled up - oldest messages are overwritten):

logging buffer-size 128000

severity level for buffered logging:

logging buffered warnings

using informational severity:

logging trap informational

IP of the rsyslog server:

logging host inside 10.10.10.20

Verify logging settings:

show logging setting

Set up message logging queue (default is 512 messages, max queue size on ASA-5505 is 1024, on ASA-5510 is 2048 and 8192 on all other platforms):

logging queue 1024
show logging queue

Configure logrotate for asa.log:
cat /etc/logrotate.d/rotate_asa_log.conf
# name of the log-file :
/var/log/asa.log {
# rotate log daily :
daily
# keep 400 old log-files :
rotate 400
# compress old log-file after postscript execution :
compress
# rotate if log-file size equals or larger than 2GB :
size 2G
# add %Y%m%d to the end of the old log-file :
dateext
# use -%d%m%Y instead of the default %Y%m%d :
dateformat -%d%m%Y
# create empty asa.log file :
create 0644 root root
# don't issue an error if the log-file is missing :
missingok
# don't rotate if log-file is empty :
notifempty
# use one postrotate script for all log-files (if more than one) :
sharedscripts
# start of the postrotate script :
postrotate
# HUP signal to rsyslogd PID (read from syslogd.pid file)
# (actually bug > must be rsyslogd.pid instead of syslogd.pid)
# makes rsyslog close all open files and restart
# HUP signal make restart or just reread configs
# (it's based on daemon's itself behaviour)
/bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
# end of the postrotate script :
endscript
}

Test logrotate script without actually rotating anything (-d is debug option and it implies -v verbose option):

logrotate -d /etc/logrotate.d/rotate_asa_log.conf

After testing you can force logrotate to rotate logs:
logrotate -f /etc/logrotate.d/rotate_asa_log.conf

To see last rotation of the log-file:
cat /var/lib/logrotate/logrotate.status | grep asa
"/var/log/asa.log" 2018-10-4-19:9:35

So the nex rotation will be done at time in logrotate.status + specified rotation interval (in out case it's "daily").

Wednesday, August 29, 2018

Python 2. Iterator, Generator.

When you create a list, you can read list elements one-by-one - this is called iteration.

>>> test_list = [1,2,3]

>>> for element in test_list:

... print element

...

test_list is iterable object. In other words any object which can be used with "for ... in ..." is iterable. Iterable objects are good until they become too big, because iterable object is fully saved in memory.

Generator objects are also iterable objects but they can be read only once and they don't save values but generate needed values on the fly. So you can use generator only one time, because values are not saved in memory.

Tuesday, August 28, 2018

Scikit-learn 6. Hands-on python scikit-learn: Cross-Validation.

train_test_split helps us to measure quality of model predictions but we have better approach - cross_val_score. cross_val_score measures more reliable. How it's working:

data is automatically split in parts (so you don't need to have separate train and test datasets
on each iteration (iteration count is equal to the count of parts) all parts besides current part are used for model training and current part is used as test set (for example on 3rd iteration 3rd part will be used as test set and all other parts well be used as train sets)

Our data will be:

[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

>>> import pandas as pd
>>> test_file_path = "~/rest.scv"

>>> from sklearn.preprocessing import Imputer

>>> from sklearn.ensemble import RandomForestRegressor

>>> from sklearn.model_selection import cross_val_score

>>> from sklearn.pipeline import make_pipeline

>>> test_pipeline = make_pipeline(Imputer(), RanfomForestRegressor())
>>> # cross_val_score uses negative metrics (sklearn uses convention that the higher the metrics value the better)
>>> scores = cross_val_score(test_pipeline,X,y,scoring='neg_mean_absolute_error')

>>> scores
array([-116.66666667, -205. , -75. ])
>>> # to get positive values
>>> print("Mean Absolute Error: {}".format(-1 * scores.mean()))
Mean Absolute Error: 132.22222222222223

Scikit-learn 5. Hands-on python scikit-learn: using pipelines.

Pipeline is a way to shorten the code and make it simpler.

>>> from sklearn.model_selection import train_test_split

>>> from sklearn.ensemble import RandomForestRegressor

>>> from sklearn.preprocessing import Imputer

>>> from sklearn.pipeline import make_pipeline

>>>

>>> test_pipeline = make_pipeline(Imputer(), RandomForestRegressor())

>>> # as you see imputation is done automatically

>>> test_pipeline.fit(train_X,train_y)

>>> predictions = test_pipeline.predict(test_X)

Machine Learning 2. Partial Dependence Plots (PDP).

Sometimes it seems that ML models are something like black-box - you can't see how model is working and how you can view and improve it's logic. To do so partial dependence plots are used. PDP shows how each variable or predictor (features) affect the model's predictions, they can be interpreted similarly as coefficients in DT models.

Our data will be:

[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

We'll use PDP to understand relationship between Price and other variables. So that PDP helps to find data insights and also see something you might think being important to be used in model building and prediction. PDP is calculated only after the model has been trained (fit).

>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> test_data.dropna(axis=0,subset=['Price'],inplace=True)
>>> y = test_data.Price
>>> X = test_data.drop(['Price'],axis=1)
>>> X = X.select_dtypes(exclued=['object'])
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X = test_imputer.fit_transform(X)
>>> # for now sklearn supports PDP only for GradientBoostingRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> test_model = GradientBoostingRegressor()
>>> test_model.fit(X,y)
>>> from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
>>> test_plots = plot_partial_dependence(gbrt=test_model,X=X,features=[0,1,2],feature_names=['Rooms', 'Floors', 'Area'],grid_resolution=10)
Options described:

gbrt - which GBR model to use
X - which dataset used to train model specified in gbrt option
features - index of columns of the dataset specified in X option which will be used in plotting (each index/column will create 1 PDP)
feature_names - how to name columns selected in features option
grid_resolution - number of values to plot on x axis

Negative values mean that Price would be less than average Price for that variable.

Monday, August 27, 2018

XGBoost 1.

XGBoost states for Gradient Boosted Decision Trees.
Gradient Boosting is ML technique used for regression and classification problems, which produces a prediction model in the form of ensemble of a weak prediction models - decision trees:

Weak model means that the model predictions is slightly better than guessing
After building each weak model we:

calculate errors
build model predicting errors
add last model to ensemble

To make a prediction - add predictions from all models in ensemble

XGBoost model is the leader when working with tabular data (data without images and videos, or in other words - data saved in Pandas DataFrame).

Our data will be:

pip install xgboost

Using XGBoost Regressor

>>> import pandas as pd
>>> test_file_path = "~/rest.scv"

>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.drop(['Price'], axis=1).select_dtypes(exclude=['object'])
>>> from sklearn.model_selection import train_test_split
>>> # split tests and get result as array, not DataFrame
>>> X_train, X_test, y_train, y_test = train_test_split(X.values,y.values,random_state=0,test_size=0.25)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X_train = test_imputer.fit_transform(X_train)
>>> X_test = test_imputer.fir_transform(X_test)

>>> from xgboost import XGBRegressor

>>> test_model = XGBRegressor()

>>> test_model.fit(X_train,y_train)

>>> predictions = test_model.predict(X_test)

>>> from sklearn.metrics import mean_absolute_error as mae

>>> print("MAE XGBR: " + str(mae(predictions,y_test)))

MAE XGBR: 10.416168212890625

>>>

XGBoost Regressor parameters

>>> test_model

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,

colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,

max_depth=3, min_child_weight=1, missing=None, n_estimators=100,

n_jobs=1, nthread=None, objective='reg:linear', random_state=0,

reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,

silent=True, subsample=1)

n_estimators - how many times to go through XGBoost modelling cycle

too low value causes underfitting, too high - overfitting
typical values are 100-1000 which depends on the learning_rate
to find optimal value, use early_stopping_rounds option. It causes to stop iterations when model stops to improve. Occasionally iterations can stop after 1 iteration, so to avoid such situations, make "early_stopping_rounds=5" this will stop iteration after 5 deteriorations of result
It's good to set high n_estimators and also set early_stopping_rounds this will help to find optimal value (eval_set is list of (X,y) tuple pairs used as validation set for early-stopping):

model = XGBRegressor(n_estimators=1000)
model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
after training model use that number of n_estimators to re-train your model (on entire data):

for example found values = 97
model = XGBRegressor(n_estimators=97)
model.fit(X, y)

learning_rate - on each iteration we multiply predictions from each component model by a small number, before adding to the ensemble. This means that each DT added to the ensemble helps us less, this reduces (in practice) model to trend to overfit. So you can use higher value for n_estimators without overfitting:

model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])

n_jobs - on big datasets set this to be equal to the number of CPU cores on your machine to use multi-threading and thus to fit model quicker. On small datasets this will not help

Tuesday, August 21, 2018

Scikit-learn 4. Hands-on python scikit-learn: using categorical data (encoding, one-hot encoding, hashing).

Categorical data - is data that takes only a predefined number of values. Most of ML models will give you an error if you'll try to use categorical data in your model without any changes. So to use categorical data, first we need to encode those values by corresponding numeric values. For example, if we have names of colors in our data then we can do:

Encoding - give each color its own number: red will be 1, yellow will be 2, green will be 3 etc. This is simple, but the problem is that 3 (green) is bigger than 1(red) but it doesn't mean that 3 must be considered to have more weight than 1 while training or predicting.
One-hot encoding - we have 3 colors (red, yellow, green) in our data set "Color" column, so we create 3 additional columns (Color_red, Color_yellow, Color_green) to save value of each color for that row and then original column with categorical data is removed. So row with red color will have 1 in the first column, 0 in the second and 0 in the third. yellow > 010, green 001. This approach gives us ability to not consider that one categorical feature is having more weight than the other.
Hashing (or hashing trick) - one-hot encoding is good, but when you have huge amount of different values in your data set or if training data is not having all types of categorical feature values or if data is changing and categorical data receives new values, one-hot encoding makes too many additional columns and this makes your data predictions slow or even impossible (when new values can appear in test model). In such a situation hashing is used (is not reviewed here)

Our data will be:

[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

Using one-hot encoding

>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> test_data
>>> test_data.describe() # HouseColor is not present
>>> test_data.info() # because HouseColor type is object - non-numerical (categorical data)
>>> test_data.dtypes
>>> # create new data-set without NaN values (we'll use imputation)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> # before imputation - fill dataset only with numerical data
>>> test_data_numerical = test_data.select_dtypes(exclude=['object'])
>>> test_data_imputed = test_imputer.fit_transform(test_data_numerical)
>>> test_data_imputed
>>> # convert imputed dataset into Pandas DataFrame
>>> test_data_imputed = pd.DataFrame(test_data_imputed)
>>> test_data_imputed
>>> test_data_imputed.columns = test_data.select_dtypes(exclude=['object']).columns
>>> test_data_imputed
>>> # add categorical data columns
>>> test_data_categorical = test_data.select_dtypes(include=['object'])
>>> test_data_imputed = test_data_imputed.join(test_data_cetegorical)
>>> test_data_imputed
>>> # use one-hot encoding
>>> test_data_one_hot = pd.get_dummies(test_data_imputed)
>>> test_data_one_hot
>>> # select non-categorical values
>>> test_data_wo_categoricals = test_data_imput.select_dtypes(exclude=['objects'])

Measuring dropping categoricals vs using one-hot encoding

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> def score_dataset(dataset):
... y = dataset.Price
... X = dataset.drop(['Price'], axis=1)
... y_train, y_test = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
... X_train, X_test = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
... model = RandomForestRegressor()
... model.fit(X_train, y_train)
... predictions = model.predict(X_test)
... return mean_absolute_error(y_test, predictions)
>>> print "MAE when not using categoricals"
>>> score_dataset(test_data_wo_categoricals)
100.0
>>> print "MAE when using categoricals with one-hot encoding"
>>> score_dataset(test_data_one_hot)
70.0
>>>

Friday, August 17, 2018

Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.

Missing Values overview

There are many reasons why data can have missing values (survey participant didn't answer all questions, data-base have missing values etc.). Most ML libraries will give you an error if your data is with missing values (for example skikit-learn estimators (estimator - finder of a current value in data - can be model, classifier, regressor etc.) assume that all values in a data-set are numerical and, and that all have and hold meaning.

Our data will be:

[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> # read data
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has missing values
>>> test_data.isnull()
>>> # count missing values per column
>>> test_missing_count = test_data.isnull().sum()
>>> test_missing_count
>>> test_missing_count[test_missing_count > 0]

Dealing with missing values

There are several approaches to missing data (will use all of them and then will compare prediction results):

Delete columns or rows with missing data:

drop rows with missing data:

>>> test_data_dropna_0 = test_data.dropna(axis=0)
>>> test_data_dropna_0

drop columns with missing data:

>>> test_data_dropna_1 = test_data.dropna(axis=1)
>>> test_data_dropna_1

if you have both train and test data sets, then columns must be deleted from the both sets:

>>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
any() - return whether any element is True over requested axis. Literally we ask if any() element of the specified column isnull()
>>> train_data_dropna_1 = train_data.dropna(columns_with_missing,axis=1)
>>> test_data_dropna_1 = test_data.dropna(columns_with_missing,axis=1)

Dropping missing values is only good when data in that columns is mostly missing

Impute (in statistics, imputation is the process of replacing missing data with substituted values):

pandas.DataFrame.fillna:

>>> test_data_fillna = test_data.fillna(0)
fillna fills NaN "cells" with 0 (you can use any value you want, seel help(pd.DataFrame.fillna)
>>> test_data_fillna

sklearn.preprocessing.Imputer:

>>> sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> test_data_imputed = test_imputer.fit_transform(test_data)
By default missing (NaN) values are replaces by mean along the axis, default axis is 0 (impute along columns - test_data.describe() to see mean values)
>>> test_data_imputed
After imputation array is created, we'll convert this array to the pandas DataFrame:

>>> test_data_imputed = pd.DataFrame(test_data_imputed)
>>> test_data_imputed.columns = test_data.columns
>>> test_data_imputed

Extended Imputation - before imputation, we'll create new column indicating which values were changed:

>>> test_data_ex_imputed = test_data.copy()
>>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
>>> columns_with_missing
Create column corresponding to each column with missing values and fill that new column with Boolean values (was missing in original data set or wasn't missing):

>>> for col in columns_with_missing:

... test_data_ex_imputed[col + '_was_missing'] = test_data_ex_imputed[col].isnull()
>>> test_data_ex_imputed
>>> test_data_ex_imputed_columns = test_data_ex_imputed.columns

impute:

>>> test_imputer = Imputer()
>>> test_data_ex_imputed = test_imputer.fit_transform(test_data_ex_imputed)

Convert to DataFrame:

>>> test_data_ex_imputed = pd.DataFrame(test_data_ex_imputed)
>>> test_data_ex_imputed
>>> test_data_ex_imputed.columns = test_data_ex_imputed_columns
Now previously missing data in out "_was_missing" columns is 1, and previously not missing is 0:

>>> test_data_ex_imputed

Checking with method is the best

>>> from sklearn.ensemble import RandomForestRegressor

>>> from sklearn.metrics import mean_absolute_error

>>> from sklearn.model_selection import traint_test_split

>>> def score_dataset(dataset):
... y = dataset.Price
... X = dataset.drop(['Price'], axis=1)
... train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
... train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
... model = RandomForestRegressor()
... model.fit(train_X, train_y)
... predictions = model.predict(test_X)
... return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476

As is common imputation gives better result than dropping missing values and extended imputation can give better (than simple imputation) result or not give any improvement.

Tuesday, August 14, 2018

Scikit-learn 2. Hands-on python scikit-learn: RandomForest.

To read about what is Random Forest go to - it-tuff.blogspot.com/machine-learning-1

If you can't remember where skikit-learn models, metrics etc. are:

locate sklearn | grep utils | cut -d"/" -f 1-6 | uniq
cd to the found directory:

cd /usr/lib64/python2.7/site-packages/sklearn

to view packages:

ll | grep ^d

>>> import pandas as pd

>>> test_file_path = "~/test.csv"

>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has NaN values
>>> test_data.isnull()
>>> test_data = test_data.dropna(axis=0)
>>> test_data.columns.values
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> y = test_data.Price
>>> from sklearn.model_selection import train_test_split
>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> from sklearn.ensemble import RandomForestRegressor
>>> # rfr states for RandomForestRegressor (by default RFR creates 10 trees)
>>> test_rfr_model = RandomForestRegressor(random_state=1)
>>> test_rfr_model.fit(train_X, train_y)
>>> test_rfr_preds = test_rfr_model.predict(val_X)

>>> from sklearn.metrics import mean_absolute_error
>>> mean_absolute_error(val_y, test_rfr_preds)
40.0

As you even with default values Random Forest gives better results (in it-tuff.blogspot.com/scikit-learn-1 MAE of CART DT was 150.0).

Monday, August 13, 2018

Scikit-learn 1. Hands-on python scikit-learn: intro, using Decision Tree Regression model, MAE, overfitting, cross-validation, underfitting.

Scikit-learn is Python machine-learning library

To install it:
pip install scipy
pip install sklearn

To use Scikit-learn, we must go through several steps:

Prepare data - choose appropriate data to use in model and predisction
Define - choose appropriate model (decision tree, random forest etc.)
Fit - capture patterns from provided data
Predict
Evaluate - make decisions on how are predictions accurate

Our data will be:

[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

Prepare Data

To learn how to use pandas, go to it-tuff.blogspot.com/pandas-1

python
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # our data is having NaN values, the simpleat approach is to remove all rows with NaN data
>>> test_data = test_data.dropna(axis=0)
>>> # we need to select prediction target, by convention called "y"
>>> test_data.columns.values
>>> y = test_data.Price
>>> # we need to choose "prediction input" - features - columns (except prediction target) which will be inputted in our model and used to make predictions
>>> # by convention features called "X"
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> # verify data (may be something weird is out there
>>> X.head()
>>> X.describe()
>>> X.info()

Define

Our prediction target is price, price can be theoretically any real number, so we'll use Decision Tree Regression model (to read more about Decision Tree, go to - it-tuff.blogspot.com/machine-learning-1).

>>> from sklearn.tree import DecisionTreeRegressor
>>> # make our test_model to be of DecisionTreeRegressor class
>>> # DT heuristic algorithm makes optimal random decisions on each leaf (node), so result can be different for every algorithm iteration, to achieve the same result on all iterations, random_state seed must be used
>>> test_model = DecisionTreeRegressor(random_state=1)

Fit

>>> # Build a decision tree regressor from the training set (X,y) - in this step we make our Decision Tree to find patterns in the training set

>>> test_model.fit(X,y)

Predict

>>> First we'll make prediction for our training set/data to check how good model is

>>> # Making prediction for the features

>>> X

>>> # Real values are

>>> y

>>> # Model predictions are

>>> test_model.predict(X)
array([300., 400., 400., 200., 700.])
>>>

Evaluate

If you want, you can view your decision tree model:

First export model in DOT format

>>> from sklearn.tree import export_graphviz

>>> export_graphviz(test_model,out_file="test_model.dot",feature_names=test_data_features)

Install Graphviz:

yum install graphviz

dot -Tpng test_model.dot -o test_model.png

Description of parameters in PNG file:

samples - how many object are in a leaf and waiting for prediction (first leaf is having samples=5 because all 5 flats prices are waiting to be predicted)
mse - several functions are available in order to measure quality of a split, mse is a default value - mean squared error - it is always non-negative, and values closer to zero are better.
value - is predicted price

To evaluate our predictions, we can use many metrics, here we'll use MAE (Mean Absolute Error). To calculate MAE:

Find Absolute Accuracy Error - absolute difference of price:

|actual_price - predicted_price|
This is done for every actual price and prediction pares in training set

Find mean of all errors (sum up absolute accuracy errors and divide by count of the errors)

>>> from sklearn.metrics import mean_absolute error

>>> y_true = y

>>> y_predicted = test_model.predict(X)

>>> mean_absolute_error(y_true,y_predicted)

0.0

This measure is called "in-sample" measure, because we used the same sample for both training and validating. It is bad because, for example all apartments with red door mats (if this parameter were in the data) in our sample are expensive ones, so this parameter "door mat color" will be considered while predicting apartment rent price, but it's incorrect (door mat color is not having any relation to the apartment rent price).

In-sample prediction and validation will show that our model is ideal or close to be ideal. This is called overfitting - a model matches training data almost perfectly but does poorly on a new data. It is because each next decision tree split is having less and less corresponding values (apartments in our case). Leaves with a few apartments will make very accurate predictions close to the actual values and this makes model perfect for training data and unreliable for new data. This is because all parameters in training model are considered to be perfect indicators for predicted value which is not the case.

On the contrary if we'll make only a few splits (low tree depth), our model will not catch important patterns in the data, so it performs poorly even in training data, this is called underfitting.

So to validate predictions correctly, we need to use different samples for prediction and validation. The simplest way to do that is to split data into prediction and validation parts (so called cross-validation):

>>> from sklearn.model_selection import train_test_split

>>> # this function splits sample data into training (by default 25% of sample size) and validating portions (mnemonics - this is TRAIN and TEST SPLIT)

>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)

>>> # now we'll use this split data to make training, prediction and validation

>>> test_model = DecisionTreeRegressor(random_state=1)

>>> test_data.fit(train_X,train_y)

>>> y_predicted = test_model.predict(val_X)

>>> mean_absolute_error(val_y,y_predicted)

150.0

As you see MAE for the in-sample data was 0.0 and for out-of-sample data is 150.0 In our data average price is 400, so error in new data (data not used during fitting) is about 37%

So we need to find compromise between overfitting and underfitting (lower MAE between training and validation data). To do that we can experiment with DecisionTreeRegressor max_leaf_nodes parameter (maximum number of leaves in our model):

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):   
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)  
  model.fit(train_X, train_y)  
  preds_val = model.predict(val_X)  
  mae = mean_absolute_error(val_y, preds_val)  
  return(mae)

for max_leaf_nodes in [2, 3, 4, 5]:  
  my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)  
  print("max_leaf_nodes: %d \t MAE: %d" %(max_leaf_nodes, my_mae))

Our data will show the same result for all max_leaf_nodes values because our test data set is too small but I think you understand importance of the above code (get_mae and for loop).
After finding the best value for max_leaf_nodes , train your model on data in-sample:
>>> test_model.fit(X,y)