Essential Tools: Scikit-Learn

Introduction

In this chapter we discuss another important Python package, Scikit-Learn (sklearn).

Scikit-Learn (sklearn) is a powerful Python package for machine learning. The theoretical underpinnings of the methods introduced in this lecture will be covered in later lectures. The intent of this lecture is to demonstrate

  1. how to implement a machine learning model using Scikit-Learn and
  2. understand the structure of the Scikit-Learn API.

The 2nd point is key, because understanding the generality of the API will allow you to easily work with different sklearn machine learning models.

The general framework for implementing a machine learning models in sklearn is

  1. Import the sklearn objects you need.
  2. Prepare the dataset.
  3. Instantiate a machine learning object.
  4. Train your model on the dataset.
  5. Use the trained model to make predictions.
  6. Evaluate the performance of the model.

We will demonstrate the sklearn machine learning framework by working with the sklearn LinearRegression object. This object can be used to train a linear model that predicts continuous values.

In this example we will work with the california housing dataset. We will see how to predict the median house price based on features, such as the age of the house, average number of bedrooms, etc.

We will cover theoretical details of linear regression in the Linear regression lecture.

We will also use sklearn when we cover

Import sklearn

Let’s first import all sklearn module and submodules that will be used in the following demonstrations.

import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

print(sklearn.__version__)
1.6.1

Fetching Datasets

Scikit-Learn provides a variety of datasets in the datasets that will take care of retrieving.

Function Description
fetch_20newsgroups Load the filenames and data from the 20 newsgroups dataset (classification).
fetch_20newsgroups_vectorized Load and vectorize the 20 newsgroups dataset (classification).
fetch_california_housing Load the California housing dataset (regression).
fetch_covtype Load the covertype dataset (classification).
fetch_kddcup99 Load the kddcup99 network intrusion detection dataset (classification).
fetch_lfw_pairs Load the Labeled Faces in the Wild (LFW) pairs dataset (classification).
fetch_lfw_people Load the Labeled Faces in the Wild (LFW) people dataset (classification).
fetch_olivetti_faces Load the Olivetti faces data-set from AT&T (classification).
fetch_rcv1 Load the Reuters Corpus Volume I multilabel dataset of 800K newswire stories (classification).
fetch_species_distributions Loader for species distribution dataset from Phillips et.

Loading Built-in Toy Datasets

Function Description
load_breast_cancer Load the breast cancer dataset (classification).
load_diabetes Load the diabetes dataset (regression).
load_digits Load the digits dataset (classification).
load_iris Load the iris dataset (classification).
load_linnerud Load the linnerud physiological and exercise dataset for 20 subjects (regression).
load_wine Load the wine dataset (classification).
load_boston Load the boston housing dataset (regression).

Other Dataset Loading Functions

Function Description
fetch_file Fetch a file from the web if not already present in the local folder, otherwise return file path. Check SHA256 checksum when provided.
fetch_openml Fetch datasets from the over 6K available at openml by name or dataset id.
  • Below we import the california housing dataset
    • Predict the median house value of a district based on 8 features
  • We store the data features as a NumPy arrays called
    • X (2-D) and
    • the target (labels) y (1-D).
# Fetch data
housing_dataset = fetch_california_housing()
X = housing_dataset["data"]
y = housing_dataset["target"]

  • We can see a description of the dataset by using the .DESCR attribute.
print(housing_dataset.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297

Visualization?

Can we visualize the dataset?

Not directly because the dataset has 8 features (8-dimensional).

Later we will talk about dimensionality reduction and clustering techniques that can help us visualize the dataset.

  • We are going to train a model to predict the median house value.
  • To train and test the model we are going to split the dataset into 80% training data and 20% test data.
  • We use the train_test_split function from sklearn.
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20 ,random_state=42)
  • As part of the data preparation we will want to scale the dataset.
  • Datasets often contain data with different orders of magnitude.
  • Scaling the dataset prevents data with a large magnitude from potentially dominating the model.

We will use the StandardScaler object which scales data to have zero-mean and unit variance (i.e, standard deviation is 1).

There are of course other scaling objects. See this link for documentation.

For this example we are not going to scale the target variables because they represent a price in dollars. Depending on the application, you may need to scale your target variable.

# Scale the data 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Instantiate and train the machine learning object

With our scaled dataset, we are now in a position to instantiate and train our model. The following code accomplishes this.

# Create regression model
reg = LinearRegression()
# Train
reg.fit(X_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We instantiate a LinearRegression() object and store it in a variable called reg. We then call the fit((X_train, y_train) method to train the model.

The fit() method is a common function to train a variety of models in sklearn.

Prediction with the model

The linear regression model has now been trained. To make predictions we use the predict() method. This function is also shared across many machine learning model classes.

# Predict on test set
y_pred = reg.predict(X_test)

The values in y_pred are the models predictions of the median house prices based on the input features of X_test.

Model Evaluation – \(R^2\)

To evaluate the performance of the model, we can use the score() method, which is also shared across many model objects in sklearn.

It calculates the \(R^2\) value, which is a number between [0, 1] and provides a measure of how good the fit of the model is.

A value of 1 means the model fits the data perfectly, while a value of 0 indicates there is no linear relationship between the observed and predicted values.

\[ R^2 = 1 - \frac{\sum_{i}^{n} (y_i - \hat{y}_i)^2}{\sum_{i}^{n} (y_i - \bar{y})^2} \]

where \(\bar{y}\) is the sample mean of the target vector.

Note: More on this in a later lecture.

Model Evaluation – MSE

The mean squared error (MSE) is given by the formula

\[ \frac{1}{n}\sum_{i}^{n} (y_i - \hat{y}_i)^2, \]

where * \(n\) is the number of data points in the target vector, * \(y_i\) are the true values of the test set (y_test), and * \(\hat{y}_i\) are the predicted values (y_pred).

Computing \(R^2\) and MSE

# R^2 value
r2 = reg.score(X_test, y_test)
print("The R^2 score is : ", r2)

# Report Mean Square Error (mse)
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)
The R^2 score is :  0.5757877060324508
Mean squared error:  0.5558915986952444

Summary

  • We demonstrated how to use Scikit-Learn to train a linear regression model.

We saw how to:

  • evaluate the performance of the model using the score() method and the mean_squared_error() function.
  • use the train_test_split() function to split the dataset into training and test sets.
  • use the StandardScaler() function to scale the dataset.
  • use the LinearRegression() function to train the model.
  • use the predict() function to make predictions.
  • use the score() function to evaluate the performance of the model.

You’ll get a chance to practice this in the homework.

Back to top