Wednesday, August 29, 2018

Python 2. Iterator, Generator.

When you create a list, you can read list elements one-by-one - this is called iteration.
>>> test_list = [1,2,3]
>>> for element in test_list:
...   print element
... 
1
2
3

test_list is iterable object. In other words any object which can be used with "for ... in ..." is iterable. Iterable objects are good until they become too big, because iterable object is fully saved in memory.

Generator objects are also iterable objects but they can be read only once and they don't save values but generate needed values on the fly. So you can use generator only one time, because values are not saved in memory.

Tuesday, August 28, 2018

Scikit-learn 6. Hands-on python scikit-learn: Cross-Validation.

train_test_split helps us to measure quality of model predictions but we have better approach - cross_val_score. cross_val_score measures more reliable. How it's working:
  1. data is automatically split in parts (so you don't need to have separate train and test datasets
  2. on each iteration (iteration count is equal to the count of parts) all parts besides current part are used for model training and current part is used as test set (for example on 3rd iteration 3rd part will be used as test set and all other parts well be used as train sets)

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

>>> import pandas as pd
>>> test_file_path = "~/rest.scv"
>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.select_dtypes(exclude='object').drop('Price',axis=1)
>>> from sklearn.preprocessing import Imputer
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.pipeline import make_pipeline
>>> test_pipeline = make_pipeline(Imputer(), RanfomForestRegressor())
>>> # cross_val_score uses negative metrics (sklearn uses convention that the higher the metrics value the better)
>>> scores = cross_val_score(test_pipeline,X,y,scoring='neg_mean_absolute_error')
>>> scores
array([-116.66666667, -205.        ,  -75.        ])
>>> # to get positive values
>>> print("Mean Absolute Error: {}".format(-1 * scores.mean()))
Mean Absolute Error: 132.22222222222223

Scikit-learn 5. Hands-on python scikit-learn: using pipelines.

Pipeline is a way to shorten the code and make it simpler.

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.preprocessing import Imputer
>>> from sklearn.pipeline import make_pipeline
>>>
>>> test_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
>>> # as you see imputation is done automatically
>>> test_pipeline.fit(train_X,train_y) 
>>> predictions = test_pipeline.predict(test_X)

Machine Learning 2. Partial Dependence Plots (PDP).

Sometimes it seems that ML models are something like black-box - you can't see how model is working and how you can view and improve it's logic. To do so partial dependence plots are used.  PDP shows how each variable or predictor (features) affect the model's predictions, they can be interpreted similarly as coefficients in DT models.

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

We'll use PDP to understand relationship between Price and other variables. So that PDP helps to find data insights and also see something you might think being important to be used in model building and prediction. PDP is calculated only after the model has been trained (fit).

>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> test_data.dropna(axis=0,subset=['Price'],inplace=True)
>>> y = test_data.Price
>>> X = test_data.drop(['Price'],axis=1)
>>> X = X.select_dtypes(exclued=['object'])
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X = test_imputer.fit_transform(X)
>>> # for now sklearn supports PDP only for GradientBoostingRegressor
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> test_model = GradientBoostingRegressor()
>>> test_model.fit(X,y)
>>> from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
>>> test_plots = plot_partial_dependence(gbrt=test_model,X=X,features=[0,1,2],feature_names=['Rooms', 'Floors', 'Area'],grid_resolution=10)
Options described:

  • gbrt - which GBR model to use
  • X - which dataset used to train model specified in gbrt option
  • features - index of columns of the dataset specified in X option which will be used in plotting (each index/column will create 1 PDP)
  • feature_names - how to name columns selected in features option
  • grid_resolution - number of values to plot on x axis
Negative values mean that Price would be less than average Price for that variable. 

Monday, August 27, 2018

XGBoost 1.

XGBoost states for Gradient Boosted Decision Trees. 
Gradient Boosting is ML technique used for regression and classification problems, which produces a prediction model in the form of ensemble of a weak prediction models - decision trees:

  • Weak model means that the model predictions is slightly better than guessing
  • After building each weak model we:
    • calculate errors
    • build model predicting errors
    • add last model to ensemble
  • To make a prediction - add predictions from all models in ensemble
XGBoost model is the leader when working with tabular data (data without images and videos, or in other words - data saved in Pandas DataFrame).


Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

To install XGBoost:
pip install xgboost

Using XGBoost Regressor


>>> import pandas as pd
>>> test_file_path = "~/rest.scv"
>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.drop(['Price'], axis=1).select_dtypes(exclude=['object'])
>>> from sklearn.model_selection import train_test_split
>>> # split tests and get result as array, not DataFrame
>>> X_train, X_test, y_train, y_test = train_test_split(X.values,y.values,random_state=0,test_size=0.25)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X_train = test_imputer.fit_transform(X_train)
>>> X_test = test_imputer.fir_transform(X_test)
>>> from xgboost import XGBRegressor
>>> test_model = XGBRegressor()
>>> test_model.fit(X_train,y_train)
>>> predictions = test_model.predict(X_test)
>>> from sklearn.metrics import mean_absolute_error as mae
>>> print("MAE XGBR: " + str(mae(predictions,y_test)))
MAE XGBR: 10.416168212890625
>>> 

XGBoost Regressor parameters


>>> test_model
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

  • n_estimators - how many times to go through XGBoost modelling cycle
    • too low value causes underfitting, too high - overfitting
    • typical values are 100-1000 which depends on the learning_rate
    • to find optimal value, use early_stopping_rounds option. It causes to stop iterations when model stops to improve. Occasionally iterations can stop after 1 iteration, so to avoid such situations, make "early_stopping_rounds=5" this will stop iteration after 5 deteriorations of result
    • It's good to set high n_estimators and also set early_stopping_rounds this will help to find optimal value (eval_set is list of (X,y) tuple pairs used as validation set for early-stopping):
      • model = XGBRegressor(n_estimators=1000)
      • model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
      • after training model use that number of n_estimators to re-train your model (on entire data):
        • for example found values = 97
        • model = XGBRegressor(n_estimators=97)
        • model.fit(X, y)
  • learning_rate - on each iteration we multiply predictions from each component model by a small number, before adding to the ensemble. This means that each DT added to the ensemble helps us less, this reduces (in practice) model to trend to overfit. So you can use higher value for n_estimators without overfitting:
    • model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
    • model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
  • n_jobs - on big datasets set this to be equal to the number of CPU cores on your machine to use multi-threading and thus to fit model quicker. On small datasets this will not help

Tuesday, August 21, 2018

Scikit-learn 4. Hands-on python scikit-learn: using categorical data (encoding, one-hot encoding, hashing).

Categorical data - is data that takes only a predefined number of values. Most of ML models will give you an error if you'll try to use categorical data in your model without any changes. So to use categorical data, first we need to encode those values by corresponding numeric values. For example, if we have names of colors in our data then we can do:
  1. Encoding - give each color its own number: red will be 1, yellow will be 2, green will be 3 etc. This is simple, but the problem is that 3 (green) is bigger than 1(red) but it doesn't mean that 3 must be considered to have more weight than 1 while training or predicting.
  2. One-hot encoding - we have 3 colors (red, yellow, green) in our data set "Color" column, so we create 3 additional columns (Color_red, Color_yellow, Color_green) to save value of each color for that row and then original column with categorical data is removed. So row with red color will have 1 in the first column, 0 in the second and 0 in the third. yellow > 010, green 001. This approach gives us ability to not consider that one categorical feature is having more weight than the other.
  3. Hashing (or hashing trick) - one-hot encoding is good, but when you have huge amount of different values in your data set or if training data is not having all types of categorical feature values or if data is changing and categorical data receives new values, one-hot encoding makes too many additional columns and this makes your data predictions slow or even impossible (when new values can appear in test model). In such a situation hashing is used (is not reviewed here) 

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

Using one-hot encoding

>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> test_data
>>> test_data.describe() # HouseColor is not present
>>> test_data.info() # because HouseColor type is object - non-numerical (categorical data)
>>> test_data.dtypes
>>> # create new data-set without NaN values (we'll use imputation)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> # before imputation - fill dataset only with numerical data
>>> test_data_numerical = test_data.select_dtypes(exclude=['object'])
>>> test_data_imputed = test_imputer.fit_transform(test_data_numerical)
>>> test_data_imputed
>>> # convert imputed dataset into Pandas DataFrame
>>> test_data_imputed = pd.DataFrame(test_data_imputed)
>>> test_data_imputed
>>> test_data_imputed.columns = test_data.select_dtypes(exclude=['object']).columns
>>> test_data_imputed
>>> # add categorical data columns
>>> test_data_categorical = test_data.select_dtypes(include=['object'])
>>> test_data_imputed = test_data_imputed.join(test_data_cetegorical)
>>> test_data_imputed
>>> # use one-hot encoding
>>> test_data_one_hot = pd.get_dummies(test_data_imputed)
>>> test_data_one_hot
>>> # select non-categorical values
>>> test_data_wo_categoricals = test_data_imput.select_dtypes(exclude=['objects'])

Measuring dropping categoricals  vs using one-hot encoding

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> def score_dataset(dataset):
...      y = dataset.Price
...      X = dataset.drop(['Price'], axis=1)
...      y_train, y_test = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
...      X_train, X_test = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
...      model = RandomForestRegressor()
...      model.fit(X_train, y_train)
...      predictions = model.predict(X_test)
...      return mean_absolute_error(y_test, predictions)
>>> print  "MAE when not using categoricals"
>>> score_dataset(test_data_wo_categoricals)
100.0
>>> print  "MAE when using categoricals with one-hot encoding"
>>> score_dataset(test_data_one_hot)
70.0
>>>

Friday, August 17, 2018

Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.

Missing Values overview

There are many reasons why data can have missing values (survey participant didn't answer all questions, data-base have missing values etc.). Most ML libraries will give you an error if your data is with missing values (for example skikit-learn estimators (estimator - finder of a current value in data - can be model, classifier, regressor etc.) assume that all values in a data-set are numerical and, and that all have and hold meaning.

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> # read data
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has missing values
>>> test_data.isnull()
>>> # count missing values per column
>>> test_missing_count = test_data.isnull().sum()
>>> test_missing_count
>>> test_missing_count[test_missing_count > 0]

Dealing with missing values

There are several approaches to missing data (will use all of them and then will compare prediction results):
  1. Delete columns or rows with missing data:
    1. drop rows with missing data:
      1. >>> test_data_dropna_0 = test_data.dropna(axis=0)
      2. >>> test_data_dropna_0
    2. drop columns with missing data:
      1. >>> test_data_dropna_1 = test_data.dropna(axis=1)
      2. >>> test_data_dropna_1
    3. if you have both train and test data sets, then columns must be deleted from the both sets:
      1. >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
      2. any() - return whether any element is True over requested axis. Literally we ask if any() element of the specified column isnull()
      3. >>> train_data_dropna_1 = train_data.dropna(columns_with_missing,axis=1)
      4. >>> test_data_dropna_1 = test_data.dropna(columns_with_missing,axis=1)
    4. Dropping missing values is only good when data in that columns is mostly missing
  2. Impute (in statistics, imputation is the process of replacing missing data with substituted values):
    1. pandas.DataFrame.fillna:
      1. >>> test_data_fillna = test_data.fillna(0)
      2. fillna fills NaN "cells" with 0 (you can use any value you want, seel help(pd.DataFrame.fillna) 
      3. >>> test_data_fillna
    2. sklearn.preprocessing.Imputer:
      1. >>> sklearn.preprocessing import Imputer
      2. >>> test_imputer = Imputer()
      3. >>> test_data_imputed = test_imputer.fit_transform(test_data) 
      4. By default missing (NaN) values are replaces by mean along the axis, default axis is 0 (impute along columns - test_data.describe() to see mean values)
      5. >>> test_data_imputed
      6. After imputation array is created, we'll convert this array to the pandas DataFrame:
        1. >>> test_data_imputed = pd.DataFrame(test_data_imputed)
        2. >>> test_data_imputed.columns = test_data.columns
        3. >>> test_data_imputed
  3. Extended Imputation - before imputation, we'll create new column indicating which values were changed:
    1. >>> test_data_ex_imputed = test_data.copy()
    2. >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
    3. >>> columns_with_missing
    4. Create column corresponding to each column with missing values and fill that new column with Boolean values (was missing in original data set or wasn't missing):
      1. >>> for col in columns_with_missing:
        ...  test_data_ex_imputed[col + '_was_missing'] = test_data_ex_imputed[col].isnull()
      2. >>> test_data_ex_imputed
      3. >>> test_data_ex_imputed_columns = test_data_ex_imputed.columns
    5. impute:
      1. >>> test_imputer = Imputer()
      2. >>> test_data_ex_imputed = test_imputer.fit_transform(test_data_ex_imputed)
    6. Convert to DataFrame:
      1. >>> test_data_ex_imputed = pd.DataFrame(test_data_ex_imputed)
      2. >>> test_data_ex_imputed
      3. >>> test_data_ex_imputed.columns = test_data_ex_imputed_columns
      4. Now previously missing data in out "_was_missing" columns is 1, and previously not missing is 0:
        1. >>> test_data_ex_imputed

Checking with method is the best

>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.model_selection import traint_test_split
>>> def score_dataset(dataset):
...       y = dataset.Price
...       X = dataset.drop(['Price'], axis=1)
...       train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
...       train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
...       model = RandomForestRegressor()
...       model.fit(train_X, train_y)
...       predictions = model.predict(test_X)
...       return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476

As is common imputation gives better result than dropping missing values and extended imputation can give better (than simple imputation) result or not give any improvement.



Tuesday, August 14, 2018

Scikit-learn 2. Hands-on python scikit-learn: RandomForest.

To read about what is Random Forest go to -  it-tuff.blogspot.com/machine-learning-1

If you can't remember where skikit-learn models, metrics etc. are:

  1. locate sklearn | grep utils | cut -d"/" -f 1-6 | uniq
  2. cd to the found directory:
    1. cd /usr/lib64/python2.7/site-packages/sklearn
  3. to view packages:
    1. ll | grep ^d
>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has NaN values
>>> test_data.isnull()
>>> test_data = test_data.dropna(axis=0)
>>> test_data.columns.values
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> y = test_data.Price
>>> from sklearn.model_selection import train_test_split
>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> from sklearn.ensemble import RandomForestRegressor
>>> # rfr states for RandomForestRegressor (by default RFR creates 10 trees)
>>> test_rfr_model = RandomForestRegressor(random_state=1)
>>> test_rfr_model.fit(train_X, train_y)
>>> test_rfr_preds = test_rfr_model.predict(val_X)
>>> from sklearn.metrics import mean_absolute_error
>>> mean_absolute_error(val_y, test_rfr_preds)
40.0

As you even with default values Random Forest gives better results (in it-tuff.blogspot.com/scikit-learn-1 MAE of CART DT was 150.0).

Monday, August 13, 2018

Scikit-learn 1. Hands-on python scikit-learn: intro, using Decision Tree Regression model, MAE, overfitting, cross-validation, underfitting.

Scikit-learn is Python machine-learning library

To install it:
pip install scipy
pip install sklearn

To use Scikit-learn, we must go through several steps:

  1. Prepare data - choose appropriate data to use in model and predisction 
  2. Define - choose appropriate model (decision tree, random forest etc.)
  3. Fit - capture patterns from provided data 
  4. Predict 
  5. Evaluate - make decisions on how are predictions accurate

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

Prepare Data

To learn how to use pandas, go to it-tuff.blogspot.com/pandas-1

python
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # our data is having NaN values, the simpleat approach is to remove all rows with NaN data
>>> test_data = test_data.dropna(axis=0)
>>> # we need to select prediction target, by convention called "y"
>>> test_data.columns.values
>>> y = test_data.Price
>>> # we need to choose "prediction input" - features -  columns (except prediction target) which will be inputted in our model and used to make predictions
>>> # by convention features called "X"
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> # verify data (may be something weird is out there
>>> X.head()
>>> X.describe()
>>> X.info()

Define

Our prediction target is price, price can be theoretically any real number, so we'll use Decision Tree Regression model (to read more about Decision Tree, go to - it-tuff.blogspot.com/machine-learning-1).

>>> from sklearn.tree import DecisionTreeRegressor
>>> # make our test_model to be of DecisionTreeRegressor class
>>> # DT heuristic algorithm makes optimal random decisions on each leaf (node), so result can be different for every algorithm iteration, to achieve the same result on all iterations, random_state seed must be used
>>> test_model = DecisionTreeRegressor(random_state=1)

Fit

>>> # Build a decision tree regressor from the training set (X,y) - in this step we make our Decision Tree to find patterns in the training set
>>> test_model.fit(X,y)

Predict

>>> First we'll make prediction for our training set/data to check how good model is
>>> # Making prediction for the features
>>> X
>>> # Real values are
>>> y
>>> # Model predictions are
>>> test_model.predict(X)
array([300., 400., 400., 200., 700.])
>>> 

Evaluate

If you want, you can view your decision tree model:

First export model in DOT format
>>> from sklearn.tree import export_graphviz
>>> export_graphviz(test_model,out_file="test_model.dot",feature_names=test_data_features)

Install Graphviz:
yum install graphviz
dot -Tpng test_model.dot -o test_model.png

Description of parameters in PNG file:
  1. samples - how many object are in a leaf and waiting for prediction (first leaf is having samples=5 because all 5 flats prices are waiting to be predicted)
  2. mse - several functions are available in order to measure quality of a split, mse is a default value - mean squared error - it is always non-negative, and values closer to zero are better.
  3. value - is predicted price
To evaluate our predictions, we can use many metrics, here we'll use MAE (Mean Absolute Error). To calculate MAE:
  1. Find Absolute Accuracy Error - absolute difference of price: 
    1. |actual_price - predicted_price| 
    2. This  is done for every actual price and prediction pares in training set
  2. Find mean of all errors (sum up absolute accuracy errors and divide by count of the errors)
>>> from sklearn.metrics import mean_absolute error
>>> y_true = y
>>> y_predicted = test_model.predict(X)
>>> mean_absolute_error(y_true,y_predicted)
0.0

This measure is called "in-sample" measure, because we used the same sample for both training and validating. It is bad because, for example all apartments with red door mats (if this parameter were in the data) in our sample are expensive ones, so this parameter "door mat color" will be considered while predicting apartment rent price, but it's incorrect (door mat color is not having any relation to the apartment rent price). 
In-sample prediction and validation will show that our model is ideal or close to be ideal. This is called overfitting - a model matches training data almost perfectly but does poorly on a new data. It is because each next decision tree split is having less and less corresponding values (apartments in our case). Leaves with a few apartments will  make very accurate predictions close to the actual values and this makes model perfect for training data and unreliable for new data. This is because all parameters in training model are considered to be perfect indicators for predicted value which is not the case.

On the contrary if we'll make only a few splits (low tree depth), our model will not catch important patterns in the data, so it performs poorly even in training data, this is called underfitting.

So to validate predictions correctly, we need to use different samples for prediction and validation. The simplest way to do that is to split data into prediction and validation parts (so called cross-validation):
>>> from sklearn.model_selection import train_test_split
>>> # this function splits sample data into training (by default 25% of sample size) and validating portions (mnemonics - this is TRAIN and TEST SPLIT)
>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> # now we'll use this split data to make training, prediction and validation
>>> test_model = DecisionTreeRegressor(random_state=1)
>>> test_data.fit(train_X,train_y)
>>> y_predicted = test_model.predict(val_X)
>>> mean_absolute_error(val_y,y_predicted)
150.0

As you see MAE for the in-sample data was 0.0 and for out-of-sample data is 150.0 In our data average price is 400, so error in new data (data not used during fitting) is about 37%

So we need to find compromise between overfitting and underfitting (lower MAE between training and validation data). To do that we can experiment with DecisionTreeRegressor max_leaf_nodes parameter (maximum number of leaves in our model):
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):   
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)  
  model.fit(train_X, train_y)  
  preds_val = model.predict(val_X)  
  mae = mean_absolute_error(val_y, preds_val)  
  return(mae)  

for max_leaf_nodes in [2, 3, 4, 5]:  
  my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)  
  print("max_leaf_nodes: %d \t MAE: %d" %(max_leaf_nodes, my_mae))  

Our data will show the same result for all max_leaf_nodes values because our test data set is too small but I think you understand importance of the above code (get_mae and for loop).
After finding the best value for max_leaf_nodes , train your model on data in-sample:
>>> test_model.fit(X,y)

Thursday, August 9, 2018

Machine Learning 1. Decision Tree (also Classification Tree or Regression Tree).

Decision Tree (DT) - is a decision support tool which consists of "leaves" (also called nodes) and "branches". Two branches form a "split". Each split corresponds to two branches with one leaf at the end of the branch. Each leaf is one of the possible values for that split:
As you can understand, DT uses heuristic algorithm, which means that DT algorithm uses practical method and this method is not guaranteed accurate or optimal but is sufficient to solve that problem.
Decision Tree Classification - helps to find class of an object using available characteristics of that object, i.e. we know which classes are existing and know parameters used to classify objects. For example: 
  1. to predict if incoming e-mail is spam or not, we use DT classification model, because we have multiple characteristics (object parameters) and want to learn class of the object and we have two classes - spam and not-spam. 
  2. to predict which number is on a sign (for simplicity think that each sign can have only numbers from 0 to 9) we'll use DT classification model, because we have characteristics of each object (pixel matrix with each pixel having it's own placement and color). So we'll try to predict to which class (10 classes - because we have 10 numbers from 0 to 9) our sign is belonging to.


Decision Tree Regression - helps to find parameters of an object using known object characteristics. In contrast to the classification, the parameter value is not a finite set of classes, but a set of real numbers. For example:
  1. to predict house price, we have a set of parameters, such as house size, floors count, placement etc. Our prediction is not predefined set of classes but a number. So we'll use DT regression model. 
There are several methods (algorithms) used to "draw" decision tree:
  1. C&RT (also CART - Classification and Regression Tree) :
    1. this method is used to draw only binary trees (each split has only two branches)
    2. on each iteration, for selected parameters (set):
      1. root of the DT is a set of all members of the model (all homes)
      2. select rule which is forming leaf (eg - home having one room or more than one rooms)
      3. we find right-branch (true - having rule - "having one room") and left-branch (false - not having rule - "having more than one room")
    3. iterations are done until we have only one branch in split or until given depth
    4. This algorithm is good for initial data analysis
  2. Random Forest - built forest is consisting of CART trees, training uses Bootstrap Aggregation or bagging method (a combination of learning models is put into the "bag" which increases the overall result). In statistics bootstrap - is any test or metric that relies on random sampling with replacement (an element may appear multiple times in a sample - this helps to estimate each element weight). Random forest can be used both for classification and regression problems. Logic of random forest:
    1. build many CART trees using different set of parameters for each DT
    2. choose most often predicted value
    3. Random Forest also automatically measures importance of the parameters (assigns score to the parameter), the sum of all scores equals 1. 

Wednesday, August 8, 2018

Pandas 1. Hands-on python pandas intro.

To install pandas:
pip install pandas

[admin@localhost ~]$ cat >  test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> type(test_data)
<class 'pandas.core.frame.DataFrame'>
>>> help(pd.DataFrame)
class DataFrame(pandas.core.generic.NDFrame)
 |  Two-dimensional size-mutable, potentially heterogeneous tabular data
 |  structure with labeled axes (rows and columns). Arithmetic operations
 |  align on both row and column labels. Can be thought of as a dict-like
 |  container for Series objects. The primary pandas data structure.
>>> test_data
<output omitted - NaN means - missing value (Not a Number))>
>>> test_data.describe()
<output omitted - to read about mean, std, percentile go to it-tuff.blogspot.com/math-1 >
>>>  test_data.columns
Index([u'Rooms', u'Price', u'Floors', u'Area'], dtype='object')
>>> test_data.columns.values
array(['Rooms', 'Price', 'Floors', 'Area'], dtype=object)
>>> help(test_data.dropna)
dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
>>> test_data = test_data.dropna(axis=0)
>>> test_data
<<output omitted - rows with NaN values are removed>
>>> test_data.describe()
<output omitted>
>>> test_data.columns.values
array(['Rooms', 'Price', 'Floors', 'Area'], dtype=object)
>>> price = test_data.Price
>>> type(price)
<class 'pandas.core.series.Series'>
>>> help(pd.Series)
class Series(pandas.core.base.IndexOpsMixin, pandas.core.generic.NDFrame)
 |  One-dimensional ndarray with axis labels (including time series).
>>> price
<output omitted>
>>> price.describe()
<output omitted - only copy of the Price column is shown>
>>> test_data.columns.values
array(['Rooms', 'Price', 'Floors', 'Area'], dtype=object)
>>> test_data_features=['Rooms','Price']
>>> features = test_data[test_data_features]
>>> type(features)
<class 'pandas.core.frame.DataFrame'>
>>> features
<output omitted - only copies of Rooms and Price columns are shown>
>>> features.describe()
<output omitted>
>>> features.head(n=3)
<output omitted - only first 3 rows are shown>
>>> help(features.head)
head(self, n=5) method of pandas.core.frame.DataFrame instance
    Return the first `n` rows.
>>> >>> test_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 4 columns):
Rooms     5 non-null float64
Price     5 non-null float64
Floors    5 non-null int64
Area      5 non-null int64
dtypes: float64(2), int64(2)
memory usage: 200.0 bytes
>>> 





Math 1.  Mean, Sigma Notation, Standard Deviation and Variance, Percentile.

Mean

Mean is average - to find it - just sum and then divide to the count of summarized:
We want to find average days per months of the leap year:
First we find sum (add numbers):
31 + 29 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31 + 30 + 31 = 366
We now that we added 12 numbers, now divide sum to the count of numbers:
366 / 12 = 30.5
So average months in leap year is having 30.5 days (Check: 30.5 * 12 = 366).

Find mean lowest temperature in Celsius in the 2017 year in Azerbaijan Ganja city:
Added lowest monthly temperatures:
- 2 - 1 + 2 + 7 + 12 + 17 + 20 + 19 + 15 + 10 + 4 + 0 = 103
Mean:
103 / 12 = 8.583 Celsius (Check: 8.583 * 12 = 102.996 ~ 103)

As you see we found mean of numbers which are of the same nature (days of the month in the first example and temperature in the month in the second example).

Sigma Notation

Σ this is sigma and it means - sum up what goes after sigma:

Σn - sum up all n's 
OK and where are n values, here there are:

    





This means sum n's and n's values are from n=1 to n=5, so:







Standard Deviation

Standard Deviation (STD) - is a measure of spread between numbers (how far are our numbers from the mean). 
To find STD (eg we have 5 flats in our building and have number of humans living in each flat (7,3,1,5,6) and we want to find STD):
  1. find mean: (7 + 3 + 1 + 5 + 6) / 5 = 4.4
  2. find differences: for each number - subtract the mean. This shows how far is this number from the mean and also shows if number lower or higher than mean:
    • 7 - 4.4 = 2.6
    • 3 - 4.4 =  -1.4
    • 1 - 4.4 = -3.4
    • 5 - 4.4 = 0.6
    • 6 - 4.4 = 1.6
  3. find squared differences: square each difference. Without this step the same negative and positive values (if any) will cancel each other and overall measure will be wrong:
    • 2.6 * 2.6 = 6.76
    • -1.4 * -1.4 = 1.96
    • -3.4 * -3.4 = 11.56
    • 0.6 * 0.6 = 0.36
    • 1.6 * 1.6 = 2.56
  4. find variance (mean of the squared differences):
    1. (6.76 + 1.96 + 11.56 + 0.36 + 2.56)/5 = 4.64
  5. find standard deviation: square root of variance:
    1. √4.64 =2.154065923 ~ 2.1541
  6. STD gives us a measure to think which number is normal (is between mean+STD & mean-STD), which is low (lower than mean-STD) or high (higher than mean+STD):
    • mean + STD = 4.4 + 2.1541 = 6.5541
    • mean - STD =  4.4 - 2.1541  = 2.2459
    • 2.2459 < 6.5541 < 7   => 7 is higher than normal for that building
    • 2.2459 < 3 < 6.5541   => 3 is normal for that building
    • 1 < 2.2459 < 6.5541   => 1 is lower than normal for that building
    • 2.2459 < 5 < 6.5541   => 5 is normal for that building
    • 2.2459 < 6 < 6.5541   => 6 is normal for that building

Percentile

Percentile - indicating the value below which a given percentage of data falls. Data itself is ordered form lower to the higher. So 95th percentile for men height is 187 cm (statistical measure), this means that 95% of men is lower than 187 cm and 5% of men is higher than 187 cm.

To find percentiles and corresponding values using nearest-rank method:

  1. order list of values, eg having list of number of humans living in 5 flats (7,3,1,5,6) : 
    • Ordered list: 1, 3, 5, 6, 7
    • Number of values N = 5
  2. find minimum, it will be 1st percentile: 1st is 1
  3. find maximum, it will be 100th percentile: 100th is 7
  4. to find  n-th percentile: n-th / 100 * N and then if not integer, round to the first higher number:
    • 25th = 25 / 100 * 5 = 1.25 ~ 2 
      • means 2nd number in list
      • so 25th percentile value is 3
    • 50th = 50 / 100 * 5 = 2.5 ~ 3 
      • means 3rd number in list
      • so 50th percentile value is 5
    • 75th = 75 / 100 * 5 = 3.75 ~ 4
      • means 4th number in list
      • so 75th percentile value is 6