K-Fold Cross-Validation

tags: #python/data_science/model_selection

What is k-fold cross-validation?

When is this preferred to a train/test/validation split?

In general, cross-validation is preferred when the data size is small or the modeling task is complex, while train/test/validation split is a good choice for larger datasets and simpler tasks.

Cross-validation is a technique used to evaluate machine learning models by splitting the data into multiple subsets or "folds".

How does this work?

K-fold Cross-Validation is when the dataset is split into a K number of folds and is used to evaluate the model's ability when given new data.
i.e., K refers to the number of groups the data sample is split into.
The algorithm repeatedly trains the model with $k - 1$ folds and tests the model with the remaining fold

How does this work in python?

We can implement k-fold cross validation using the KFold and cross_val_score function from the Scikit-Learn library’s model_selection module:

from sklearn.model_selection import KFold, cross_val_score

Python: `kfold`

The KFold method in scikit-learn allows you to specify/define split the data into k equal-sized folds to be used in cross-validation.

# define k-fold for cross-validation
kf = KFold(n_splits=5, random_state=0, shuffle=True)

Other variations of k-fold validation is with stratification (i.e., class proportions are preserved in each fold). Examples: RepeatedStratifiedKFold or StratifiedKFold.

Generally ideal to use stratification when dealing with an imbalanced dataset.
Stratification helps ensure that minority class(es) are not excluded from the training or testing set.

from sklearn.model_selection import RepeatedStratifiedKFold

# repeated
rskf = RepeatedStratifiedKFold(n_splits, n_repeats)

# single sampling
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Repeated Stratification

Both methods aim to address the issue of imbalanced datasets by ensuring the distribution of classes across all folds. The difference is that:

StratifiedkFold performs stratified sampling once to generate the k folds
RepeatedStratifiedKfold repeats the process n number of times (can specify as a parameter). This produces n different sets of k folds, for more robust performance metric

Python: `cross_val_score`

To actually run the cross-validation, we use the cross_val_score function from scikit-learn. The split specified using the Kfold method is passed as an argument to the cv parameter:

from sklearn.model_selection import cross_val_score

# this returns an array of scores for the estimator for each k run of the validation
scores = cross_val_score(estimator = model, X, y, cv=kf)

# cv - is used to specify the cross-validation strategy -> using kfold

A performance score is generated for each $k$ - fold.

With each iteration, it trains the model on $k - 1$ folds and tests the model on the remaining fold. This process is repeated k times, with each fold used exactly once as the validation data.

This can be incorporated with training dataset as follows:

scores = cross_val_score(estimator = model, X_train, y_train, cv=kf)

Pasted image 20230323234925.png600

Getting the accuracy

scores.mean()

We can specify the scoring method by specifying the score parameter:

scores = cross_val_score(estimator = model, X_train, y_train, cv=kf, score="accuracy")

If the score parameter is set to a string such as 'accuracy', then it will use the accuracy score to evaluate the model's performance. Other possible scoring methods include 'precision', 'recall', 'f1', 'roc_auc', and others, depending on the type of problem being addressed.