IT Stuff

Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.

Missing Values overview

There are many reasons why data can have missing values (survey participant didn't answer all questions, data-base have missing values etc.). Most ML libraries will give you an error if your data is with missing values (for example skikit-learn estimators (estimator - finder of a current value in data - can be model, classifier, regressor etc.) assume that all values in a data-set are numerical and, and that all have and hold meaning.

Our data will be:

[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95

>>> # read data
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has missing values
>>> test_data.isnull()
>>> # count missing values per column
>>> test_missing_count = test_data.isnull().sum()
>>> test_missing_count
>>> test_missing_count[test_missing_count > 0]

Dealing with missing values

There are several approaches to missing data (will use all of them and then will compare prediction results):

Delete columns or rows with missing data:

drop rows with missing data:

>>> test_data_dropna_0 = test_data.dropna(axis=0)
>>> test_data_dropna_0

drop columns with missing data:

>>> test_data_dropna_1 = test_data.dropna(axis=1)
>>> test_data_dropna_1

if you have both train and test data sets, then columns must be deleted from the both sets:

>>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
any() - return whether any element is True over requested axis. Literally we ask if any() element of the specified column isnull()
>>> train_data_dropna_1 = train_data.dropna(columns_with_missing,axis=1)
>>> test_data_dropna_1 = test_data.dropna(columns_with_missing,axis=1)

Dropping missing values is only good when data in that columns is mostly missing

Impute (in statistics, imputation is the process of replacing missing data with substituted values):

pandas.DataFrame.fillna:

>>> test_data_fillna = test_data.fillna(0)
fillna fills NaN "cells" with 0 (you can use any value you want, seel help(pd.DataFrame.fillna)
>>> test_data_fillna

sklearn.preprocessing.Imputer:

>>> sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> test_data_imputed = test_imputer.fit_transform(test_data)
By default missing (NaN) values are replaces by mean along the axis, default axis is 0 (impute along columns - test_data.describe() to see mean values)
>>> test_data_imputed
After imputation array is created, we'll convert this array to the pandas DataFrame:

>>> test_data_imputed = pd.DataFrame(test_data_imputed)
>>> test_data_imputed.columns = test_data.columns
>>> test_data_imputed

Extended Imputation - before imputation, we'll create new column indicating which values were changed:

>>> test_data_ex_imputed = test_data.copy()
>>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
>>> columns_with_missing
Create column corresponding to each column with missing values and fill that new column with Boolean values (was missing in original data set or wasn't missing):

>>> for col in columns_with_missing:

... test_data_ex_imputed[col + '_was_missing'] = test_data_ex_imputed[col].isnull()
>>> test_data_ex_imputed
>>> test_data_ex_imputed_columns = test_data_ex_imputed.columns

impute:

>>> test_imputer = Imputer()
>>> test_data_ex_imputed = test_imputer.fit_transform(test_data_ex_imputed)

Convert to DataFrame:

>>> test_data_ex_imputed = pd.DataFrame(test_data_ex_imputed)
>>> test_data_ex_imputed
>>> test_data_ex_imputed.columns = test_data_ex_imputed_columns
Now previously missing data in out "_was_missing" columns is 1, and previously not missing is 0:

>>> test_data_ex_imputed

Checking with method is the best

>>> from sklearn.ensemble import RandomForestRegressor

>>> from sklearn.metrics import mean_absolute_error

>>> from sklearn.model_selection import traint_test_split

>>> def score_dataset(dataset):
... y = dataset.Price
... X = dataset.drop(['Price'], axis=1)
... train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
... train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
... model = RandomForestRegressor()
... model.fit(train_X, train_y)
... predictions = model.predict(test_X)
... return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476

As is common imputation gives better result than dropping missing values and extended imputation can give better (than simple imputation) result or not give any improvement.

IT Stuff

Friday, August 17, 2018

Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.

Missing Values overview

Dealing with missing values

Checking with method is the best

No comments:

Post a Comment