GridSearchCV

tags: #python/data_science/model_selection

What is GridSearchCV?

GridSearchCV is a method in scikit-learn that performs an exhaustive search over a specified parameter grid to determine the best hyperparameters for a given estimator. The method uses cross-validation to estimate the performance of each combination of hyperparameters.

What is a parameter grid?

param_grid is a dictionary that specifies the hyperparameters to be tuned and the values to be searched. The keys in the dictionary are the names of the hyperparameters, and the values are the lists of values to be searched over:

param_grid = {
			  "hyperparameter": [LIST OF VALUES]
			  ...
}

Example: Random Forest

For example, suppose we want to find the best hyperparameters for a RandomForestClassifier estimator. We might specify a param_grid dictionary as follows:

param_grid = {
    'n_estimators': [100, 500, 1000],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

This specifies that we want to search over three hyperparameters: n_estimators, max_depth, and min_samples_split. For each hyperparameter, we provide a list of values to be searched over.

We can then use GridSearchCV to perform the search:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier
rf = RandomForestClassifier()

# Specify the parameter grid to search
param_grid = {
    'n_estimators': [100, 500, 1000],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters
print("Best hyperparameters: ", grid_search.best_params_)

# Print best score
print ("Best Score: {}".format(grid_search.best_score_)

The above code specifies a RandomForestClassifier estimator, a param_grid dictionary, and performs a grid search using GridSearchCV with 5-fold cross-validation. The best hyperparameters are then printed.

Note: we can pass cv as the output of the Kfold method.

The output will be something like:

Best hyperparameters:  {'max_depth': 15, 'min_samples_split': 2, 'n_estimators': 1000}

This means that the best hyperparameters for the RandomForestClassifier are max_depth=15, min_samples_split=2, and n_estimators=1000.

Setting Seed

For GridSearchCV, it's also important to set numpy.random.seed to get reproducible results. This must be done in addition to setting random_state or seed in the estimator.

np.random.seed(42)

Seeding

Setting the random seed using numpy will apply for all subsequent cells in the notebook.