Introduction to scikit-learn
tags: #python/data_science/appendix
What is it?
Scikit-learn (Sklearn) is a robust and commonly used library for machine learning in Python.
The library includes function for modelling the data including classification, regression, clustering and dimensionality reduction.
Installing scikit-learn
To install scikit-learn:
pip install -U scikit-learn
Importing scikit-learn
The library different modules that provide different machine learning algorithms, preprocessing tools, and utilities for working with data.
Using scikit-learn linear models
from sklearn.linear_model import LogisticRegression # logit
from sklearn.linear_model import LinearRegression # linear regression
The library also includes modules for evaluating model performance.
# example performance metrics for classification
from sklearn.metrics import classification_report, confusion_matrix
Using scikit-learn datasets
Scikit-learn also has a module contain sample datasets like iris and digits for classification and the Boston house prices for regression.
from sklearn.datasets import load_DATASET
df = load_DATASET()
We can also download datasets in the form of a Pandas DataFrame using the seaborn library:
import seaborn as sns
df=sns.load_dataset('DATASET NAME')
Data Representation in scikit-learn
The data representation in sklearn requires a 2-dimensional matrix of features and an one-dimensional target array:
Feature Matrix (X)
The features matrix is a two-dimensional, with shape [n_samples, n_features], and is most often contained as a Pandas DataFrame. Denoted as
features = df[['LIST OF FEATURE COLS']]
X = features
Target Vector (y)
In addition to the feature matrix X, we also generally work with a label or target array (vector), which by convention we will usually call
The target array is usually one dimensional, with len(n_samples), and is generally contained in a NumPy array or Pandas Series.
y = df['TARGET COL']