K-Fold Cross-Validation

tags: #python/data_science/model_selection

What is k-fold cross-validation?

When is this preferred to a train/test/validation split?

In general, cross-validation is preferred when the data size is small or the modeling task is complex, while train/test/validation split is a good choice for larger datasets and simpler tasks.

Cross-validation is a technique used to evaluate machine learning models by splitting the data into multiple subsets or "folds".

How does this work?


How does this work in python?

We can implement k-fold cross validation using the KFold and cross_val_score function from the Scikit-Learn library’s model_selection module:

from sklearn.model_selection import KFold, cross_val_score
Python: kfold

The KFold method in scikit-learn allows you to specify/define split the data into k equal-sized folds to be used in cross-validation.

# define k-fold for cross-validation
kf = KFold(n_splits=5, random_state=0, shuffle=True) 

Other variations of k-fold validation is with stratification (i.e., class proportions are preserved in each fold). Examples: RepeatedStratifiedKFold or StratifiedKFold.

from sklearn.model_selection import RepeatedStratifiedKFold

# repeated
rskf = RepeatedStratifiedKFold(n_splits, n_repeats)

# single sampling
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Python: cross_val_score

To actually run the cross-validation, we use the cross_val_score function from scikit-learn. The split specified using the Kfold method is passed as an argument to the cv parameter:

from sklearn.model_selection import cross_val_score

# this returns an array of scores for the estimator for each k run of the validation
scores = cross_val_score(estimator = model, X, y, cv=kf)

# cv - is used to specify the cross-validation strategy -> using kfold 

A performance score is generated for each k- fold.

With each iteration, it trains the model on k1 folds and tests the model on the remaining fold. This process is repeated k times, with each fold used exactly once as the validation data.

This can be incorporated with training dataset as follows:

scores = cross_val_score(estimator = model, X_train, y_train, cv=kf)

Pasted image 20230323234925.png600

Getting the accuracy
scores.mean()

We can specify the scoring method by specifying the score parameter:

scores = cross_val_score(estimator = model, X_train, y_train, cv=kf, score="accuracy")

If the score parameter is set to a string such as 'accuracy', then it will use the accuracy score to evaluate the model's performance. Other possible scoring methods include 'precision', 'recall', 'f1', 'roc_auc', and others, depending on the type of problem being addressed.

Powered by Forestry.md