Dummy Encoding

tags: #python/data_science/preprocessing

What is dummy encoding?

Dummy encoding, also known as one-hot encoding, is used to represent categorical variables as numerical variables in machine learning models.

This works by creating new binary variable for each class of a categorical variable, and setting the value to 1 if the category is present and 0 otherwise.

How many dummy variables do you need?

For a given categorical variable with k-number of classes, you only need k1 dummy variables.

This is because if all of the existing dummy variables equal 0, then we know that the value should be 1 for the remaining dummy variable.

The class omitted is usually the reference class. Therefore, the reference category is represented by all zero values in the resulting dummy variables, while the other categories are represented by ones.

Values of Dummy Variables

Dummy variables (indicator variables) have two possible values: 0 or 1.

Steps to Dummy Encoding

  1. Convert categorical variables into category data type
df["COLUMN_NAME"] = df["COLUMN_NAME"].astype("category")

# to convert multiple columns at once, we can pass a dictionary of key-value pairs 
dtype_d = {
	 "COL" = "category"
	 "COL" = "category"
	 ...
}

df = df.astype(dtype_d)
  1. Convert categorical variables into dummy variables
df = pd.get_dummies(df, columns = ['COLUMN_NAME'])

# to dummify multiple columns at once
df = pd.get_dummies(df, columns = ['LIST OF COL NAMES'])

By default, pd.get_dummies will create a dummy variable for each of class of the categorical variable.

Therefore, important to remember to manually drop the column(s) to be the reference class.

df = df.drop(['LIST OF COLUMN NAMES'], axis=1)

Otherwise, we can specify the parameter drop_first = True to automatically drop the first class:

df = pd.get_dummies(df, columns, drop_first = True)
Powered by Forestry.md