Tuesday, August 28, 2018

Scikit-learn 6. Hands-on python scikit-learn: Cross-Validation.

train_test_split helps us to measure quality of model predictions but we have better approach - cross_val_score. cross_val_score measures more reliable. How it's working:
  1. data is automatically split in parts (so you don't need to have separate train and test datasets
  2. on each iteration (iteration count is equal to the count of parts) all parts besides current part are used for model training and current part is used as test set (for example on 3rd iteration 3rd part will be used as test set and all other parts well be used as train sets)

Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

>>> import pandas as pd
>>> test_file_path = "~/rest.scv"
>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.select_dtypes(exclude='object').drop('Price',axis=1)
>>> from sklearn.preprocessing import Imputer
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.pipeline import make_pipeline
>>> test_pipeline = make_pipeline(Imputer(), RanfomForestRegressor())
>>> # cross_val_score uses negative metrics (sklearn uses convention that the higher the metrics value the better)
>>> scores = cross_val_score(test_pipeline,X,y,scoring='neg_mean_absolute_error')
>>> scores
array([-116.66666667, -205.        ,  -75.        ])
>>> # to get positive values
>>> print("Mean Absolute Error: {}".format(-1 * scores.mean()))
Mean Absolute Error: 132.22222222222223

No comments:

Post a Comment