III. General Data Science
Refer to IV. ML Models for building ML algorithms.
Appendix
| File | Comments |
|---|---|
| Introduction to scikit-learn | Brief overview of usingn scikit-learn as well as the data representation used. |
| Summary of Steps in Building a Model | General Steps to follow when building a model. |
| Ensemble Learning | - |
{ .block-language-dataview}
General Data Science Steps
1. Exploratory Data Analysis (EDA)
Data exploration, also known as exploratory data analysis (EDA), is a process used to understand the data with statistical and visualization methods. This step helps identifying patterns and problems in the dataset, as well as deciding which model or algorithm to use in subsequent steps.
For example, if your data breaks the assumption of your model or your data contains errors, you will not be able to get the desired results from your perfect model. Without data exploration, you may even spend most of your time checking your model without realizing the problem in the dataset.
-
Outliers
These are observations that have relatively large or small value compared to the majority of observations, may have a large or dominating impact on experiment results. Although techniques to deal with them may vary, it is important to know whether they exist. Boxplot and Cleveland dot plots are tools for outlier detection. -
Homogeneity in variance
Homogeneity in variance is an important assumption in analysis of variance, regression-related models and in multivariate techniques. It means the variance of the feature variables needs to be similar. This can be validated by exploring the residuals of the model; i.e. by plotting residuals vs. fitted values, and the variation of the residuals should be similar in the plots. -
Normally distributed data
Various statistical techniques assume normality, such as linear regressions and t-tests. Histograms can be used to show data distributions. -
Zeros (missing value) in the data
Zero (or null) data will make the analysis more complicated. They might be labeled incorrectly as similar since they are all zeros. -
Collinearity in covariates
Colinearity refers to a high degree of correlation between two or more independent variables. Ways to detect collinearity include calculating variance inflation factors (VIF), pairwise scatter-plots comparing covariates, correlation coefficients or a PCA biplot applied on all covariates. -
Interaction between variables
Interaction means the relationship of variables will change according to the value of other variables. This type of information can be found by observing the weights of the variables when performing linear regression. -
Independence in the dataset
Data points in a dataset should be drawn independently. This can be analyzed by modeling any spatial or temporal relationships, or by nesting data in a hierarchical structure.
| File | Comments |
|---|---|
| 1. Displaying Dataset Description and Shape | Getting summary of the DataFrame object (including data types, number of columns and rows in the dataset) and shape (# of features and records). |
| 2. Checking for Duplicate Rows | Checking for duplicated rows (e.g., total number of duplicates and whether there exist duplciated rows in the overall dataset) |
| 3. Checking for Missing Values | Checking for missing values in the dataset (e.g., total number of missing values and whether there are missing values in the overall dataset) |
| 4. Getting Number of Unique Values | The nunique() function is a method in pandas, a popular data manipulation library for Python, that is used to calculate the number of unique values in a pandas Series or DataFrame. This method returns the count of distinct or unique values in a specified axis. |
| 5. Checking for Samples and Target Sizes | Want to confirm samples and match number of target values |
| 6. Multicolinearity | If conducting Linear Regression or Binary Logistic Regression that requires an underlying independence assumption, check for multicolinearity. |
| 7. Getting a Quick Statistical Summary | Getting a quick statistical summary including Count, Mean, Standard Deviation, median, mode, minimum value, maximum value, and range. |
| 8. Univariate Analysis | - |
| 9. Bivariate Analysis | - |
| 10. Multivariate Analysis | - |
| 11. Data Transformations | - |
{ .block-language-dataview}
2. Data Preprocessing Techniques
Feature Selection
| File | Comments |
|---|---|
| Importance of Feature Selection | What is features selection. Approaches to feature selection (e.g., univariate) |
| (Method) Wrapper-Based | Wrapper-based feature selection methods. This includes recursive feature elimination and forward/backward selection. |
| (Method) Filter-based | Filter-based method e.g., SelectKBest, VarianceThreshold in selecting best features using univariate statistical approach. |
| Finding Optimal n Using RFECV | Using RFE method and cross-validation to find optimal number of features for a given base model. WARNING, COMPUTATIONALLY INTENSIVE!! |
Other Processing Steps
| File | Comments |
|---|---|
| Dummy Encoding | Dummy encoding of categorical variables. This should be done before feature selection. |
| Label Encoding | To encode classes of a target variables. Also ideal for encoding nominal data without inherent order. |
| Normalizing or Standardizing Features | How to standardize continuous variables. This should be done before features selection. This includes standardizing the entire feature matrix, and select features. |
| Ordinal Encoding | The OrdinalEncoder` is used for encoding features and assumes an order or ranking among the categories. |
Miscellaneous Code Snippets for Processing
| File | Comments |
|---|
3. Model Evaluation: Data Partitioning Intro Train, Test, and Validation Sets
| File | Comments |
|---|---|
| Partitioning the Dataset | Splitting the dataset into a test and training set. Validation dataset is also explained. |
| K-Fold Cross-Validation | Break down of k-fold cross-validation for model selection and hyperparameter tuning. |
Optimization
Model Selection
| File | Comments |
|---|---|
| Finding the Best Model | Finding the best model for a task and dataset using K-Fold Cross-Validation. |
| GridSearchCV | How to use GridSearchCV to fine-tune hyperparameters, and for feature selection. WARNING! COMPUTATIONALLY INTENSIVE. |