III. General Data Science

Refer to IV. ML Models for building ML algorithms.

Appendix

File Comments
Introduction to scikit-learn Brief overview of usingn scikit-learn as well as the data representation used.
Summary of Steps in Building a Model General Steps to follow when building a model.
Ensemble Learning -

{ .block-language-dataview}

General Data Science Steps

1. Exploratory Data Analysis (EDA)

File Comments
1. Displaying Dataset Description and Shape Getting summary of the DataFrame object (including data types, number of columns and rows in the dataset) and shape (# of features and records).
2. Checking for Duplicate Rows Checking for duplicated rows (e.g., total number of duplicates and whether there exist duplciated rows in the overall dataset)
3. Checking for Missing Values Checking for missing values in the dataset (e.g., total number of missing values and whether there are missing values in the overall dataset)
4. Getting Number of Unique Values The nunique() function is a method in pandas, a popular data manipulation library for Python, that is used to calculate the number of unique values in a pandas Series or DataFrame. This method returns the count of distinct or unique values in a specified axis.
5. Checking for Samples and Target Sizes Want to confirm samples and match number of target values
6. Multicolinearity If conducting Linear Regression or Binary Logistic Regression that requires an underlying independence assumption, check for multicolinearity.
7. Getting a Quick Statistical Summary Getting a quick statistical summary including Count, Mean, Standard Deviation, median, mode, minimum value, maximum value, and range.
8. Univariate Analysis -
9. Bivariate Analysis -
10. Multivariate Analysis -
11. Data Transformations -

{ .block-language-dataview}

2. Data Preprocessing Techniques

Feature Selection
File Comments
Importance of Feature Selection What is features selection. Approaches to feature selection (e.g., univariate)
(Method) Wrapper-Based Wrapper-based feature selection methods. This includes recursive feature elimination and forward/backward selection.
(Method) Filter-based Filter-based method e.g., SelectKBest, VarianceThreshold in selecting best features using univariate statistical approach.
Finding Optimal n Using RFECV Using RFE method and cross-validation to find optimal number of features for a given base model. WARNING, COMPUTATIONALLY INTENSIVE!!
Other Processing Steps
File Comments
Dummy Encoding Dummy encoding of categorical variables. This should be done before feature selection.
Label Encoding To encode classes of a target variables. Also ideal for encoding nominal data without inherent order.
Normalizing or Standardizing Features How to standardize continuous variables. This should be done before features selection. This includes standardizing the entire feature matrix, and select features.
Ordinal Encoding The OrdinalEncoder` is used for encoding features and assumes an order or ranking among the categories.
Miscellaneous Code Snippets for Processing
File Comments

3. Model Evaluation: Data Partitioning Intro Train, Test, and Validation Sets

File Comments
Partitioning the Dataset Splitting the dataset into a test and training set. Validation dataset is also explained.
K-Fold Cross-Validation Break down of k-fold cross-validation for model selection and hyperparameter tuning.

Optimization

Model Selection

File Comments
Finding the Best Model Finding the best model for a task and dataset using K-Fold Cross-Validation.
GridSearchCV How to use GridSearchCV to fine-tune hyperparameters, and for feature selection. WARNING! COMPUTATIONALLY INTENSIVE.
Powered by Forestry.md