Finding the Best Model

tags: #python/data_science/model_selection

The issue with model building

We do not always which machine learning model will perform the best for a given task and dataset.

Even if one model performs with a high degree of accuracy, it is possible that other estimators are even more accurate.

How can we find the best model?

We can run multiple different models and compare the performance of different models and identify the one with the highest accuracy and lowest variance (standard deviation).

This can be done using K-Fold Cross-Validation.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# create a dictionary of estimator and instance
estimators = {
    'KNNClassifier': KNeighborsClassifier(),
    'SVC': SVC(gamma='scale'),
    'GaussianNB': GaussianNB()
}

# perform k-fold cross-validation and print the mean accuracy and standard deviation for each estimator
for estimator_name, estimator_object in estimators.items():
    kfold = KFold(n_splits=5, random_state=0, shuffle=True)
    scores = cross_val_score(estimator=estimator_object, X=X_train, y=y_train, cv=kfold)
    print(f'{estimator_name:>20}: mean accuracy={scores.mean():.2%}; standard deviation={scores.std():.2%}')