Monday, August 27, 2018

XGBoost 1.

XGBoost states for Gradient Boosted Decision Trees. 
Gradient Boosting is ML technique used for regression and classification problems, which produces a prediction model in the form of ensemble of a weak prediction models - decision trees:

  • Weak model means that the model predictions is slightly better than guessing
  • After building each weak model we:
    • calculate errors
    • build model predicting errors
    • add last model to ensemble
  • To make a prediction - add predictions from all models in ensemble
XGBoost model is the leader when working with tabular data (data without images and videos, or in other words - data saved in Pandas DataFrame).


Our data will be:
[admin@localhost ~]$ cat > test.csv
Rooms,Price,Floors,Area,HouseColor
1,300,1,30,red
1,400,1,50,green
3,400,1,65,blue
2,200,1,45,green
5,700,3,120,yellow
,400,2,70,blue
,300,1,40,blue
4,,2,95,brown

To install XGBoost:
pip install xgboost

Using XGBoost Regressor


>>> import pandas as pd
>>> test_file_path = "~/rest.scv"
>>> test_data = pd.read_csv(test_file_path)
>>> # drop row with NaN values for Price column
>>> test_data.dropna(axis=0, subset=['Price'], inplace=True)
>>> test_data
>>> y = test_data.Price
>>> X = test_data.drop(['Price'], axis=1).select_dtypes(exclude=['object'])
>>> from sklearn.model_selection import train_test_split
>>> # split tests and get result as array, not DataFrame
>>> X_train, X_test, y_train, y_test = train_test_split(X.values,y.values,random_state=0,test_size=0.25)
>>> from sklearn.preprocessing import Imputer
>>> test_imputer = Imputer()
>>> X_train = test_imputer.fit_transform(X_train)
>>> X_test = test_imputer.fir_transform(X_test)
>>> from xgboost import XGBRegressor
>>> test_model = XGBRegressor()
>>> test_model.fit(X_train,y_train)
>>> predictions = test_model.predict(X_test)
>>> from sklearn.metrics import mean_absolute_error as mae
>>> print("MAE XGBR: " + str(mae(predictions,y_test)))
MAE XGBR: 10.416168212890625
>>> 

XGBoost Regressor parameters


>>> test_model
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

  • n_estimators - how many times to go through XGBoost modelling cycle
    • too low value causes underfitting, too high - overfitting
    • typical values are 100-1000 which depends on the learning_rate
    • to find optimal value, use early_stopping_rounds option. It causes to stop iterations when model stops to improve. Occasionally iterations can stop after 1 iteration, so to avoid such situations, make "early_stopping_rounds=5" this will stop iteration after 5 deteriorations of result
    • It's good to set high n_estimators and also set early_stopping_rounds this will help to find optimal value (eval_set is list of (X,y) tuple pairs used as validation set for early-stopping):
      • model = XGBRegressor(n_estimators=1000)
      • model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
      • after training model use that number of n_estimators to re-train your model (on entire data):
        • for example found values = 97
        • model = XGBRegressor(n_estimators=97)
        • model.fit(X, y)
  • learning_rate - on each iteration we multiply predictions from each component model by a small number, before adding to the ensemble. This means that each DT added to the ensemble helps us less, this reduces (in practice) model to trend to overfit. So you can use higher value for n_estimators without overfitting:
    • model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
    • model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X,test_y)])
  • n_jobs - on big datasets set this to be equal to the number of CPU cores on your machine to use multi-threading and thus to fit model quicker. On small datasets this will not help

No comments:

Post a Comment