Tuesday, August 21, 2018

Scikit-learn 4. Hands-on python scikit-learn: using categorical data (encoding, one-hot encoding, hashing).

Categorical data - is data that takes only a predefined number of values. Most of ML models will give you an error if you'll try to use categorical data in your model without any changes. So to use categorical data, first we need to encode those values by corresponding numeric values. For example, if we have names of colors in our data then we can do:
  1. Encoding - give each color its own number: red will be 1, yellow will be 2, green will be 3 etc. This is simple, but the problem is that 3 (green) is bigger than 1(red) but it doesn't mean that 3 must be considered to have more weight than 1 while training or predicting.
  2. One-hot encoding - we have 3 colors (red, yellow, green) in our data set "Color" column, so we create 3 additional columns (Color_red, Color_yellow, Color_green) to save value of each color for that row and then original column with categorical data is removed. So row with red color will have 1 in the first column, 0 in the second and 0 in the third. yellow > 010, green 001. This approach gives us ability to not consider that one categorical feature is having more weight than the other.
  3. Hashing (or hashing trick) - one-hot encoding is good, but when you have huge amount of different values in your data set or if training data is not having all types of categorical feature values or if data is changing and categorical data receives new values, one-hot encoding makes too many additional columns and this makes your data predictions slow or even impossible (when new values can appear in test model). In such a situation hashing is used (is not reviewed here) 

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

Using one-hot encoding

>>> import pandas as pd
>>> test_file_path = "~/test.csv"
>>> test_data = pd.read_csv(test_file_path)
>>> test_data
>>> test_data.describe() # HouseColor is not present
>>> test_data.info() # because HouseColor type is object - non-numerical (categorical data)
>>> test_data.dtypes
>>> # create new data-set without NaN values (we'll use imputation)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> # before imputation - fill dataset only with numerical data
>>> test_data_numerical = test_data.select_dtypes(exclude=['object'])
>>> test_data_imputed = test_imputer.fit_transform(test_data_numerical)
>>> test_data_imputed
>>> # convert imputed dataset into Pandas DataFrame
>>> test_data_imputed = pd.DataFrame(test_data_imputed)
>>> test_data_imputed
>>> test_data_imputed.columns = test_data.select_dtypes(exclude=['object']).columns
>>> test_data_imputed
>>> # add categorical data columns
>>> test_data_categorical = test_data.select_dtypes(include=['object'])
>>> test_data_imputed = test_data_imputed.join(test_data_cetegorical)
>>> test_data_imputed
>>> # use one-hot encoding
>>> test_data_one_hot = pd.get_dummies(test_data_imputed)
>>> test_data_one_hot
>>> # select non-categorical values
>>> test_data_wo_categoricals = test_data_imput.select_dtypes(exclude=['objects'])

Measuring dropping categoricals  vs using one-hot encoding

>>> from sklearn.model_selection import train_test_split
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.metrics import mean_absolute_error
>>> def score_dataset(dataset):
...      y = dataset.Price
...      X = dataset.drop(['Price'], axis=1)
...      y_train, y_test = train_test_split(y,random_state=0,train_size=0.7,test_size=0.3)
...      X_train, X_test = train_test_split(X,random_state=0,train_size=0.7,test_size=0.3)
...      model = RandomForestRegressor()
...      model.fit(X_train, y_train)
...      predictions = model.predict(X_test)
...      return mean_absolute_error(y_test, predictions)
>>> print  "MAE when not using categoricals"
>>> score_dataset(test_data_wo_categoricals)
100.0
>>> print  "MAE when using categoricals with one-hot encoding"
>>> score_dataset(test_data_one_hot)
70.0
>>>

No comments:

Post a Comment