Scikit-learn 3. Hands-on python scikit-learn: handling missing values: dropping, imputing, imputing with state preservation.
Missing Values overview
There are many reasons why data can have missing values (survey participant didn't answer all questions, data-base have missing values etc.). Most ML libraries will give you an error if your data is with missing values (for example skikit-learn estimators (estimator - finder of a current value in data - can be model, classifier, regressor etc.) assume that all values in a data-set are numerical and, and that all have and hold meaning.
Our data will be:
[admin@localhost ~]$ cat > test.csvRooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95
>>> # read data
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # check if data has missing values
>>> test_data.isnull()
>>> # count missing values per column
>>> test_missing_count = test_data.isnull().sum()
>>> test_missing_count
>>> test_missing_count[test_missing_count > 0]
Dealing with missing values
There are several approaches to missing data (will use all of them and then will compare prediction results):
- Delete columns or rows with missing data:
- drop rows with missing data:
- >>> test_data_dropna_0 = test_data.dropna(axis=0)
- >>> test_data_dropna_0
- drop columns with missing data:
- >>> test_data_dropna_1 = test_data.dropna(axis=1)
- >>> test_data_dropna_1
- if you have both train and test data sets, then columns must be deleted from the both sets:
- >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
- any() - return whether any element is True over requested axis. Literally we ask if any() element of the specified column isnull()
- >>> train_data_dropna_1 = train_data.dropna(columns_with_missing,axis=1)
- >>> test_data_dropna_1 = test_data.dropna(columns_with_missing,axis=1)
- Dropping missing values is only good when data in that columns is mostly missing
- Impute (in statistics, imputation is the process of replacing missing data with substituted values):
- pandas.DataFrame.fillna:
- >>> test_data_fillna = test_data.fillna(0)
- fillna fills NaN "cells" with 0 (you can use any value you want, seel help(pd.DataFrame.fillna)
- >>> test_data_fillna
- sklearn.preprocessing.Imputer:
- >>> sklearn.preprocessing import Imputer
- >>> test_imputer = Imputer()
- >>> test_data_imputed = test_imputer.fit_transform(test_data)
- By default missing (NaN) values are replaces by mean along the axis, default axis is 0 (impute along columns - test_data.describe() to see mean values)
- >>> test_data_imputed
- After imputation array is created, we'll convert this array to the pandas DataFrame:
- >>> test_data_imputed = pd.DataFrame(test_data_imputed)
- >>> test_data_imputed.columns = test_data.columns
- >>> test_data_imputed
- Extended Imputation - before imputation, we'll create new column indicating which values were changed:
- >>> test_data_ex_imputed = test_data.copy()
- >>> columns_with_missing = [col for col in test_data.columns if test_data[col].isnull().any()]
- >>> columns_with_missing
- Create column corresponding to each column with missing values and fill that new column with Boolean values (was missing in original data set or wasn't missing):
- >>> for col in columns_with_missing:... test_data_ex_imputed[col + '_was_missing'] = test_data_ex_imputed[col].isnull()
- >>> test_data_ex_imputed
- >>> test_data_ex_imputed_columns = test_data_ex_imputed.columns
- impute:
- >>> test_imputer = Imputer()
- >>> test_data_ex_imputed = test_imputer.fit_transform(test_data_ex_imputed)
- Convert to DataFrame:
- >>> test_data_ex_imputed = pd.DataFrame(test_data_ex_imputed)
- >>> test_data_ex_imputed
- >>> test_data_ex_imputed.columns = test_data_ex_imputed_columns
- Now previously missing data in out "_was_missing" columns is 1, and previously not missing is 0:
- >>> test_data_ex_imputed
Checking with method is the best
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> from sklearn.model_selection import traint_test_split
>>> def score_dataset(dataset):
... y = dataset.Price
... X = dataset.drop(['Price'], axis=1)
... train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
... train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
... model = RandomForestRegressor()
... model.fit(train_X, train_y)
... predictions = model.predict(test_X)
... return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476
As is common imputation gives better result than dropping missing values and extended imputation can give better (than simple imputation) result or not give any improvement.
... y = dataset.Price
... X = dataset.drop(['Price'], axis=1)
... train_y, test_y = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
... train_X, test_X = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
... model = RandomForestRegressor()
... model.fit(train_X, train_y)
... predictions = model.predict(test_X)
... return mean_absolute_error(test_y, predictions)
>>> print "MAE from dropping rows with missing values:"
>>> score_dataset(test_data_dropna_0)
70.0
>>> print "MAE from dropping columns with missing values:"
>>> score_dataset(test_data_dropna_1)
AttributeError: 'DataFrame' object has no attribute 'Price'
>>> #This is because 'Price' column has missing values and we dropped it, so nothing left to train and predict
>>> print "MAE from pandas fillna imputation:"
>>> score_dataset(test_data_fillna)
80.0
>>> print "MAE from sklearn Imputer() imputation:"
>>> score_dataset(test_data_imputed)
97.61904761904763
>>> print "MAE from sklearn Imputer() extended imputation:"
>>> score_dataset(test_data_ex_imputed)
67.6190476190476
No comments:
Post a Comment