Hyperparameter Tuning GDBT

Hyperparameters tuning represents a necessity while working with AI models in order to optimize the performances. There are different approaches to this: in this little article I am going to cover the bruteforce one, GridSearchCV.

What is GridSearchCV?

  • GridSearch : process of performing hyperparameter tuning in order to determine the optimal values for a given model.
  • CV : Cross-Valication
  • How does it work? GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.

Parameters to tune

XGBoost

  • subsample : Each tree will only get a % of the training examples and can be values between 0 and 1. Lowering this value stops subsets of training examples dominating the model and allows greater generalisation.
  • colsample_bytree : Similar to subsample but for columns rather than rows. Again you can set values between 0 and 1 where lower values can make the model generalise better by stopping any one field having too much prominence, a prominence that might not exist in the test data.
  • colsample_bylevel : Denotes the fraction of columns to be randomly samples for each tree.
  • n_estimators : is the number of iterations the model will perform or in other words the number of trees that will be created
  • learning_rate : in layman’s terms it is how much the weights are adjusted each time a tree is built. Set the learning rate too high and the algorithm might miss the optimum weights but set it too low and it might converge to suboptimal values.

LighGBM

  • subsample [0-1] : Each tree will only get a % of the training examples and can be values between 0 and 1. Lowering this value stops subsets of training examples dominating the model and allows greater generalisation.
  • colsample_bytree [0-1] : Similar to subsample but for columns rather than rows. Again you can set values between 0 and 1 where lower values can make the model generalise better by stopping any one field having too much prominence, a prominence that might not exist in the test data.
  • colsample_bylevel [0-1] : Denotes the fraction of columns to be randomly samples for each tree.
  • n_estimators : is the number of iterations the model will perform or in other words the number of trees that will be created
  • learning_rate : in layman’s terms it is how much the weights are adjusted each time a tree is built. Set the learning rate too high and the algorithm might miss the optimum weights but set it too low and it might converge to suboptimal values.
  • num_leaves :you set the maximum number of leaves each weak learner has: large num_leaves increases accuracy on the training set and also the chance of getting hurt by overfitting
  • feature_fraction : deals with column sampling and can be used to speed up training or deal with overfitting
  • reg_alpha, reg_alpha : regularization

Code

import xgboost as xgb
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.preprocessing import normalize
import ujson as json
from settings import config
import pickle
from sklearn.feature_extraction import DictVectorizer
import numpy as np
import lightgbm as lgb

def pipeline_GridSearch(X_train_data , X_test_data , y_train_data ,
                       model , param_grid , cv=10 , scoring_fit='roc_auc' ,
                       do_probabilities=False) :
    gs = GridSearchCV (
        estimator=model ,
        param_grid=param_grid ,
        cv=cv ,
        n_jobs=-1 ,
        scoring=scoring_fit ,
        verbose=2
    )
    fitted_model = gs.fit ( X_train_data , y_train_data )

    if do_probabilities :
        pred = fitted_model.predict_proba ( X_test_data )
    else :
        pred = fitted_model.predict ( X_test_data )

    return fitted_model , pred





X,y,X_train, X_test, y_train, y_test, X_validation, y_validation, vec =load_features_drebin(config['Drebin_X_file'],config['Drebin_Y_file'])
xgb_Classifier = xgb.XGBClassifier(
    objective= 'binary:logistic',
    nthread=4,
    seed=42
)
#GridSearch
gbm_param_grid = {
    'colsample_bytree': np.linspace(0.1, 0.5, 5),
    'subsamples':np.linspace(0.1, 1.0, 10),
    'colsample_bylevel':np.linspace(0.1, 0.5, 5),
    'n_estimators': list(range(60, 340, 40)),
    'max_depth': list(range(2,16)),
    'learning_rate':np.logspace(-3, -0.8, 5)
}
model_xgb,preds = pipeline_GridSearch(X_train, X_test, y_train, xgb_Classifier,gbm_param_grid, cv=5, scoring_fit='roc_auc')

fixed_params = {'objective': 'binary',
             'metric': 'auc',
             'is_unbalance':True,
             'bagging_freq':5,
             'boosting':'dart',
             'num_boost_round':300,
             'early_stopping_rounds':30}
lgb_classifier = lgb.LGBMClassifier()
lgb_param_grid = {
    'n_estimators': list(range(60, 340, 40)),
    'colsample_bytree': np.linspace(0.1, 0.5, 5),
    'max_depth': list(range(2,16)),
    'num_leaves':list(range(50, 200, 50)),
    'reg_alpha': np.logspace(-3, -2, 3),
    'reg_lambda': np.logspace(-2, 1, 4),
    'subsample': np.linspace(0.1, 1.0, 10),
    'feature_fraction': np.linspace(0.1, 1.0, 10),
    'learning_rate':np.logspace(-3, -0.8, 5)
}
model_lgbm,preds = pipeline_GridSearch(X_train, X_test, y_train, lgb_classifier,lgb_param_grid, cv=5, scoring_fit='roc_auc')
print("Grid Search Best parameters found LGBM: ", model_lgbm.best_params_)

print("Grid Search Best parameters found XGB: ", model_xgb.best_params_)

There are also other ways to tune hyperparameters of course :) E.g : Bayesian optimization

Happy tuning!