Introduction to scikit-learn

tags: #python/data_science/appendix

What is it?

Scikit-learn (Sklearn) is a robust and commonly used library for machine learning in Python.

The library includes function for modelling the data including classification, regression, clustering and dimensionality reduction.

Installing scikit-learn

To install scikit-learn:

pip install -U scikit-learn

Importing scikit-learn

The library different modules that provide different machine learning algorithms, preprocessing tools, and utilities for working with data.

Using scikit-learn linear models
from sklearn.linear_model import LogisticRegression # logit
from sklearn.linear_model import LinearRegression # linear regression

The library also includes modules for evaluating model performance.

# example performance metrics for classification
from sklearn.metrics import classification_report, confusion_matrix
Using scikit-learn datasets

Scikit-learn also has a module contain sample datasets like iris and digits for classification and the Boston house prices for regression.

from sklearn.datasets import load_DATASET
df = load_DATASET()

We can also download datasets in the form of a Pandas DataFrame using the seaborn library:

import seaborn as sns

df=sns.load_dataset('DATASET NAME')

Data Representation in scikit-learn

The data representation in sklearn requires a 2-dimensional matrix of features and an one-dimensional target array:

600

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠

Text Elements

F1
F2
F3
F4
F5
F6
F6
TARGET VAR
denoted as "y"
1D array
2-DIMENSIONAL FEATURE MATRIX
denoted as "X"
sample 1
sample 2

Feature Matrix (X)

The features matrix is a two-dimensional, with shape [n_samples, n_features], and is most often contained as a Pandas DataFrame. Denoted as X, by convention.

features = df[['LIST OF FEATURE COLS']]
X = features

Target Vector (y)

In addition to the feature matrix X, we also generally work with a label or target array (vector), which by convention we will usually call y.

The target array is usually one dimensional, with len(n_samples), and is generally contained in a NumPy array or Pandas Series.

y = df['TARGET COL']
Powered by Forestry.md