III. General Data Science

Refer to IV. ML Models for building ML algorithms.

Appendix

File	Comments
Introduction to scikit-learn	Brief overview of usingn scikit-learn as well as the `data representation` used.
Summary of Steps in Building a Model	General Steps to follow when building a model.
Ensemble Learning	-

{ .block-language-dataview}

General Data Science Steps

1. Exploratory Data Analysis (EDA)

What is EDA and Why is it Important?

Data exploration, also known as exploratory data analysis (EDA), is a process used to understand the data with statistical and visualization methods. This step helps identifying patterns and problems in the dataset, as well as deciding which model or algorithm to use in subsequent steps.

For example, if your data breaks the assumption of your model or your data contains errors, you will not be able to get the desired results from your perfect model. Without data exploration, you may even spend most of your time checking your model without realizing the problem in the dataset.

What should you look for?

Outliers
These are observations that have relatively large or small value compared to the majority of observations, may have a large or dominating impact on experiment results. Although techniques to deal with them may vary, it is important to know whether they exist. Boxplot and Cleveland dot plots are tools for outlier detection.
Homogeneity in variance
Homogeneity in variance is an important assumption in analysis of variance, regression-related models and in multivariate techniques. It means the variance of the feature variables needs to be similar. This can be validated by exploring the residuals of the model; i.e. by plotting residuals vs. fitted values, and the variation of the residuals should be similar in the plots.
Normally distributed data
Various statistical techniques assume normality, such as linear regressions and t-tests. Histograms can be used to show data distributions.
Zeros (missing value) in the data
Zero (or null) data will make the analysis more complicated. They might be labeled incorrectly as similar since they are all zeros.
Collinearity in covariates
Colinearity refers to a high degree of correlation between two or more independent variables. Ways to detect collinearity include calculating variance inflation factors (VIF), pairwise scatter-plots comparing covariates, correlation coefficients or a PCA biplot applied on all covariates.
Interaction between variables
Interaction means the relationship of variables will change according to the value of other variables. This type of information can be found by observing the weights of the variables when performing linear regression.
Independence in the dataset
Data points in a dataset should be drawn independently. This can be analyzed by modeling any spatial or temporal relationships, or by nesting data in a hierarchical structure.

File	Comments
1. Displaying Dataset Description and Shape	Getting summary of the DataFrame object (including data types, number of columns and rows in the dataset) and shape (# of features and records).
2. Checking for Duplicate Rows	Checking for duplicated rows (e.g., total number of duplicates and whether there exist duplciated rows in the overall dataset)
3. Checking for Missing Values	Checking for missing values in the dataset (e.g., total number of missing values and whether there are missing values in the overall dataset)
4. Getting Number of Unique Values	The `nunique()` function is a method in pandas, a popular data manipulation library for Python, that is used to calculate the number of unique values in a pandas Series or DataFrame. This method returns the count of distinct or unique values in a specified axis.
5. Checking for Samples and Target Sizes	Want to confirm samples and match number of target values
6. Multicolinearity	If conducting `Linear Regression` or `Binary Logistic Regression` that requires an underlying independence assumption, check for multicolinearity.
7. Getting a Quick Statistical Summary	Getting a quick statistical summary including Count, Mean, Standard Deviation, median, mode, minimum value, maximum value, and range.
8. Univariate Analysis	-
9. Bivariate Analysis	-
10. Multivariate Analysis	-
11. Data Transformations	-

{ .block-language-dataview}

2. Data Preprocessing Techniques

Feature Selection

File	Comments
Importance of Feature Selection	What is features selection. Approaches to feature selection (e.g., univariate)
(Method) Wrapper-Based	Wrapper-based feature selection methods. This includes `recursive feature elimination` and `forward/backward selection`.
(Method) Filter-based	Filter-based method e.g., `SelectKBest`, `VarianceThreshold` in selecting best features using univariate statistical approach.
Finding Optimal n Using RFECV	Using RFE method and cross-validation to find optimal number of features for a given base model. WARNING, COMPUTATIONALLY INTENSIVE!!

Other Processing Steps

File	Comments
Dummy Encoding	Dummy encoding of categorical variables. This should be done before feature selection.
Label Encoding	To encode classes of a target variables. Also ideal for encoding nominal data without inherent order.
Normalizing or Standardizing Features	How to standardize continuous variables. This should be done before features selection. This includes standardizing the entire feature matrix, and select features.
Ordinal Encoding	The OrdinalEncoder` is used for encoding features and assumes an order or ranking among the categories.

Miscellaneous Code Snippets for Processing

File	Comments

3. Model Evaluation: Data Partitioning Intro Train, Test, and Validation Sets

File	Comments
Partitioning the Dataset	Splitting the dataset into a test and training set. Validation dataset is also explained.
K-Fold Cross-Validation	Break down of k-fold cross-validation for model selection and hyperparameter tuning.

Optimization

Model Selection

File	Comments
Finding the Best Model	Finding the best model for a task and dataset using `K-Fold Cross-Validation`.
GridSearchCV	How to use GridSearchCV to fine-tune hyperparameters, and for feature selection. WARNING! COMPUTATIONALLY INTENSIVE.