Partitioning the Dataset

We can evaluate our model based on how well it is able to generalize to previously unseen data.

To evaluate the performance of our model, we can split the dataset into a training and testing dataset using the following command from scklearn.model_selection module:

from sklearn.model_selection import train_test_split

X = df[LIST_OF_FEATURE_COLS]
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1, stratify=y) # y if the dataset is imbalanced

Arguments

This function has the following arguments:

X, y − X is the feature matrix and y is the response vector.
test_size − This represents the ratio of the test data to the total given data; e.g., if we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows.
random_size − Seeding; used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results.

How much should i split my dataset?

A commonly used ratio is 80:20, which means 80% of the data is for training and 20% for testing. Other ratios such as 70:30, 60:40, and even 50:50 are also used in practice.

Training Set

Subset of the dataset used to "train" the model, where it "learns" the underlying patterns and relationships between the input features (X) and the output variable (y).

The algorithm learns by adjusting its parameters or weights to minimize a loss function which quantifies the difference between the predicted and the true target output.

Testing Set

Subset of the dataset used to provide an unbiased evaluation of the model performance on its ability to generalize to previously unseen data.

Parameter: `test_size`

To specify the proportion of the dataset reserved for testing, we can pass an argument to the test_size parameter.

How does this function work?

This function shuffles the data to randomize it before splitting the samples and target array into a training and testing dataset of desired proportion.

Return Values

The function returns a tuple of 4 elements including the feature dataset for training and testing, and the corresponding labels for each partition.

Examining the Set Size

# training
X_train.shape

# testing
X_test.shape

Alternative: Sample Code

import pandas as pd

from fast_ml.model_development import train_valid_test_split

X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = 'TARGET', train_size=0.7, valid_size=0.1, test_size=0.2)

print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)