Dummy Encoding

tags: #python/data_science/preprocessing

What is dummy encoding?

Dummy encoding, also known as one-hot encoding, is used to represent categorical variables as numerical variables in machine learning models.

This works by creating new binary variable for each class of a categorical variable, and setting the value to 1 if the category is present and 0 otherwise.

How many dummy variables do you need?

For a given categorical variable with k-number of classes, you only need $k - 1$ dummy variables.

This is because if all of the existing dummy variables equal 0, then we know that the value should be 1 for the remaining dummy variable.

The class omitted is usually the reference class. Therefore, the reference category is represented by all zero values in the resulting dummy variables, while the other categories are represented by ones.

Which class do you drop?

By default, we would want to drop the class with the highest samples. If you have a balanced dataset and all categories have roughly the same number of samples, then it may not matter as much which category you choose as the reference.

Purpose for LR:
The reason for choosing the most frequent class as the reference category is to make the resulting model more interpretable. This is because the regression coefficients associated with the dummy variables will represent the difference between the mean value of the response variable for each category and the mean value of the response variable for the reference category.

Values of Dummy Variables

Dummy variables (indicator variables) have two possible values: 0 or 1.

A 1 encodes the presence of a category
A 0 encodes the absence of a category

Steps to Dummy Encoding

Convert categorical variables into category data type

df["COLUMN_NAME"] = df["COLUMN_NAME"].astype("category")

# to convert multiple columns at once, we can pass a dictionary of key-value pairs 
dtype_d = {
	 "COL" = "category"
	 "COL" = "category"
	 ...
}

df = df.astype(dtype_d)

Convert categorical variables into dummy variables

df = pd.get_dummies(df, columns = ['COLUMN_NAME'])

# to dummify multiple columns at once
df = pd.get_dummies(df, columns = ['LIST OF COL NAMES'])

By default, pd.get_dummies will create a dummy variable for each of class of the categorical variable.

Therefore, important to remember to manually drop the column(s) to be the reference class.

df = df.drop(['LIST OF COLUMN NAMES'], axis=1)

Otherwise, we can specify the parameter drop_first = True to automatically drop the first class:

df = pd.get_dummies(df, columns, drop_first = True)