Scikit-learn 1. Hands-on python scikit-learn: intro, using Decision Tree Regression model, MAE, overfitting, cross-validation, underfitting.
Scikit-learn is Python machine-learning libraryTo install it:
pip install scipy
pip install sklearn
To use Scikit-learn, we must go through several steps:
- Prepare data - choose appropriate data to use in model and predisction
- Define - choose appropriate model (decision tree, random forest etc.)
- Fit - capture patterns from provided data
- Predict
- Evaluate - make decisions on how are predictions accurate
Our data will be:
[admin@localhost ~]$ cat > test.csvRooms,Price,Floors,Area
1,300,1,30
1,400,1,50
3,400,1,65
2,200,1,45
5,700,3,120
,400,2,70
,300,1,40
4,,2,95
Prepare Data
To learn how to use pandas, go to it-tuff.blogspot.com/pandas-1python
>>> test_file_path = "~/test.csv"
>>> import pandas as pd
>>> test_data = pd.read_csv(test_file_path)
>>> # our data is having NaN values, the simpleat approach is to remove all rows with NaN data
>>> test_data = test_data.dropna(axis=0)
>>> # we need to select prediction target, by convention called "y"
>>> test_data.columns.values
>>> y = test_data.Price
>>> # we need to choose "prediction input" - features - columns (except prediction target) which will be inputted in our model and used to make predictions
>>> # by convention features called "X"
>>> test_data_features = ['Rooms','Floors','Area']
>>> X = test_data[test_data_features]
>>> # verify data (may be something weird is out there
>>> X.head()
>>> X.describe()
>>> X.info()
Define
Our prediction target is price, price can be theoretically any real number, so we'll use Decision Tree Regression model (to read more about Decision Tree, go to - it-tuff.blogspot.com/machine-learning-1).>>> from sklearn.tree import DecisionTreeRegressor
>>> # make our test_model to be of DecisionTreeRegressor class
>>> # DT heuristic algorithm makes optimal random decisions on each leaf (node), so result can be different for every algorithm iteration, to achieve the same result on all iterations, random_state seed must be used
>>> test_model = DecisionTreeRegressor(random_state=1)
Fit
>>> # Build a decision tree regressor from the training set (X,y) - in this step we make our Decision Tree to find patterns in the training set
>>> test_model.fit(X,y)
Predict
>>> First we'll make prediction for our training set/data to check how good model is
>>> # Making prediction for the features
>>> X
>>> # Real values are
>>> y
>>> # Model predictions are
>>> test_model.predict(X)
array([300., 400., 400., 200., 700.])
>>>
array([300., 400., 400., 200., 700.])
>>>
Evaluate
If you want, you can view your decision tree model:
First export model in DOT format
>>> from sklearn.tree import export_graphviz
>>> export_graphviz(test_model,out_file="test_model.dot",feature_names=test_data_features)
Install Graphviz:
yum install graphviz
dot -Tpng test_model.dot -o test_model.png
Description of parameters in PNG file:
- samples - how many object are in a leaf and waiting for prediction (first leaf is having samples=5 because all 5 flats prices are waiting to be predicted)
- mse - several functions are available in order to measure quality of a split, mse is a default value - mean squared error - it is always non-negative, and values closer to zero are better.
- value - is predicted price
To evaluate our predictions, we can use many metrics, here we'll use MAE (Mean Absolute Error). To calculate MAE:
- Find Absolute Accuracy Error - absolute difference of price:
- |actual_price - predicted_price|
- This is done for every actual price and prediction pares in training set
- Find mean of all errors (sum up absolute accuracy errors and divide by count of the errors)
>>> from sklearn.metrics import mean_absolute error
>>> y_true = y
>>> y_predicted = test_model.predict(X)
>>> mean_absolute_error(y_true,y_predicted)
0.0
This measure is called "in-sample" measure, because we used the same sample for both training and validating. It is bad because, for example all apartments with red door mats (if this parameter were in the data) in our sample are expensive ones, so this parameter "door mat color" will be considered while predicting apartment rent price, but it's incorrect (door mat color is not having any relation to the apartment rent price).
In-sample prediction and validation will show that our model is ideal or close to be ideal. This is called overfitting - a model matches training data almost perfectly but does poorly on a new data. It is because each next decision tree split is having less and less corresponding values (apartments in our case). Leaves with a few apartments will make very accurate predictions close to the actual values and this makes model perfect for training data and unreliable for new data. This is because all parameters in training model are considered to be perfect indicators for predicted value which is not the case.
On the contrary if we'll make only a few splits (low tree depth), our model will not catch important patterns in the data, so it performs poorly even in training data, this is called underfitting.
So to validate predictions correctly, we need to use different samples for prediction and validation. The simplest way to do that is to split data into prediction and validation parts (so called cross-validation):
>>> from sklearn.model_selection import train_test_split
>>> # this function splits sample data into training (by default 25% of sample size) and validating portions (mnemonics - this is TRAIN and TEST SPLIT)
>>> train_X, val_X = train_test_split(X, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> train_y, val_y = train_test_split(y, random_state=0)
>>> # now we'll use this split data to make training, prediction and validation
>>> test_model = DecisionTreeRegressor(random_state=1)
>>> test_data.fit(train_X,train_y)
>>> y_predicted = test_model.predict(val_X)
>>> mean_absolute_error(val_y,y_predicted)
150.0
As you see MAE for the in-sample data was 0.0 and for out-of-sample data is 150.0 In our data average price is 400, so error in new data (data not used during fitting) is about 37%
So we need to find compromise between overfitting and underfitting (lower MAE between training and validation data). To do that we can experiment with DecisionTreeRegressor max_leaf_nodes parameter (maximum number of leaves in our model):
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
for max_leaf_nodes in [2, 3, 4, 5]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("max_leaf_nodes: %d \t MAE: %d" %(max_leaf_nodes, my_mae))
Our data will show the same result for all max_leaf_nodes values because our test data set is too small but I think you understand importance of the above code (get_mae and for loop).
After finding the best value for max_leaf_nodes , train your model on data in-sample:
>>> test_model.fit(X,y)
No comments:
Post a Comment