Friday, August 17, 2018

Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.

Missing Values overview

There are many reasons why data can have missing values (survey participant didn't answer all questions, data-base have missing values etc.). Most ML libraries will give you an error if your data is with missing values (for example skikit-learn estimators (estimator - finder of a current value in data - can be model, classifier, regressor etc.) assume that all values in a data-set are numerical and, and that all have and hold meaning.

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> # read data
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has missing values
>>> test_data.isnull()
>>> # count missing values per column
>>> test_missing_count = test_data.isnull().sum()
>>> test_missing_count
>>> test_missing_count[test_missing_count > 0]

Dealing with missing values

There are several approaches to missing data (will use all of them and then will compare prediction results):
  1. Delete columns or rows with missing data:
    1. drop rows with missing data:
      1. >>> test_data_dropna_0 = test_data.dropna(axis=0)
      2. >>> test_data_dropna_0
    2. drop columns with missing data:
      1. >>> test_data_dropna_1 = test_data.dropna(axis=1)
      2. >>> test_data_dropna_1
    3. if you have both train and test data sets, then columns must be deleted from the both sets:
      1. >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
      2. any() - return whether any element is True over requested axis. Literally we ask if any() element of the specified column isnull()
      3. >>> train_data_dropna_1 = train_data.dropna(columns_with_missing,axis=1)
      4. >>> test_data_dropna_1 = test_data.dropna(columns_with_missing,axis=1)
    4. Dropping missing values is only good when data in that columns is mostly missing
  2. Impute (in statistics, imputation is the process of replacing missing data with substituted values):
    1. pandas.DataFrame.fillna:
      1. >>> test_data_fillna = test_data.fillna(0)
      2. fillna fills NaN "cells" with 0 (you can use any value you want, seel help(pd.DataFrame.fillna) 
      3. >>> test_data_fillna
    2. sklearn.preprocessing.Imputer:
      1. >>> sklearn.preprocessing import Imputer
      2. >>> test_imputer = Imputer()
      3. >>> test_data_imputed = test_imputer.fit_transform(test_data) 
      4. By default missing (NaN) values are replaces by mean along the axis, default axis is 0 (impute along columns - test_data.describe() to see mean values)
      5. >>> test_data_imputed
      6. After imputation array is created, we'll convert this array to the pandas DataFrame:
        1. >>> test_data_imputed = pd.DataFrame(test_data_imputed)
        2. >>> test_data_imputed.columns = test_data.columns
        3. >>> test_data_imputed
  3. Extended Imputation - before imputation, we'll create new column indicating which values were changed:
    1. >>> test_data_ex_imputed = test_data.copy()
    2. >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
    3. >>> columns_with_missing
    4. Create column corresponding to each column with missing values and fill that new column with Boolean values (was missing in original data set or wasn't missing):
      1. >>> for col in columns_with_missing:
        ...  test_data_ex_imputed[col + '_was_missing'] = test_data_ex_imputed[col].isnull()
      2. >>> test_data_ex_imputed
      3. >>> test_data_ex_imputed_columns = test_data_ex_imputed.columns
    5. impute:
      1. >>> test_imputer = Imputer()
      2. >>> test_data_ex_imputed = test_imputer.fit_transform(test_data_ex_imputed)
    6. Convert to DataFrame:
      1. >>> test_data_ex_imputed = pd.DataFrame(test_data_ex_imputed)
      2. >>> test_data_ex_imputed
      3. >>> test_data_ex_imputed.columns = test_data_ex_imputed_columns
      4. Now previously missing data in out "_was_missing" columns is 1, and previously not missing is 0:
        1. >>> test_data_ex_imputed

Checking with method is the best

>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.model_selection import traint_test_split
>>> def score_dataset(dataset):
...       y = dataset.Price
...       X = dataset.drop(['Price'], axis=1)
...       train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
...       train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
...       model = RandomForestRegressor()
...       model.fit(train_X, train_y)
...       predictions = model.predict(test_X)
...       return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476

As is common imputation gives better result than dropping missing values and extended imputation can give better (than simple imputation) result or not give any improvement.



No comments:

Post a Comment