Essential Tools: Scikit-Learn

Introduction

In this chapter we discuss another important Python package, Scikit-Learn (sklearn).

Scikit-Learn (sklearn) is a powerful Python package for machine learning. The theoretical underpinnings of the methods introduced in this lecture will be covered in later lectures. The intent of this lecture is to demonstrate

how to implement a machine learning model using Scikit-Learn and
understand the structure of the Scikit-Learn API.

The 2nd point is key, because understanding the generality of the API will allow you to easily work with different sklearn machine learning models.

The general framework for implementing a machine learning models in sklearn is

Import the sklearn objects you need.
Prepare the dataset.
Instantiate a machine learning object.
Train your model on the dataset.
Use the trained model to make predictions.
Evaluate the performance of the model.

We will demonstrate the sklearn machine learning framework by working with the sklearn LinearRegression object. This object can be used to train a linear model that predicts continuous values.

In this example we will work with the california housing dataset. We will see how to predict the median house price based on features, such as the age of the house, average number of bedrooms, etc.

We will cover theoretical details of linear regression in the Linear regression lecture.

We will also use sklearn when we cover

Import sklearn

Let’s first import all sklearn module and submodules that will be used in the following demonstrations.

import sklearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

print(sklearn.__version__)

1.5.2

Prepare the dataset

Scikit-Learn provides a variety of datasets in the datasets submodule. These can be used to train simple models. Below we import the california housing dataset. We store the data features as a NumPy arrays called \(X\) (2-D) and the target (labels) \(y\) (1-D). We can see a description of the dataset by using the .DESCR attribute.

# Fetch data
housing_dataset = fetch_california_housing()
X = housing_dataset["data"]
y = housing_dataset["target"]
print(housing_dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297

We are going to train a model to predict the median house value. To train and test the model we are going to split the dataset into 80% training data and 20% test data. We use the train_test_split function from sklearn.

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20 ,random_state=42)

As part of the data preparation we will want to scale the dataset. Datasets often contain data with different orders of magnitude. Scaling the dataset prevents data with a large magnitude from potentially dominating the model.

We will use the StandardScaler object which scales data to have zero-mean and unit variance (i.e, standard deviation is 1).

There are of course other scaling objects. See this link for documentation.

For this example we are not going to scale the target variables because they represent a price in dollars. Depending on the application, you may need to scale your target variable.

# Scale the data 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Instantiate and train the machine learning object

With our scaled dataset, we are now in a position to instantiate and train our model. The following code accomplishes this.

# Create regression model
reg = LinearRegression()
# Train
reg.fit(X_train, y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We instantiate a LinearRegression() object and store it in a variable called reg. We then call the fit((X_train, y_train) method to train the model.

The fit() method is a common function to train a variety of models in sklearn.

Prediction with the model

The linear regression model has now been trained. To make predictions we use the predict() method. This function is also shared across many machine learning model classes.

# Predict on test set
y_pred = reg.predict(X_test)

The values in y_pred are the models predictions of the median house prices based on the input features of X_test.

Evaluating the performance of the model

To evaluate the performance of the model, we can use the score() method, which is also shared across many model objects in sklearn.

In addition, we can use specific functions to evaluate the error in our predictions.

Next is the code that computes the \(R^2\) value of the model and the mean squared error.

The \(R^2\) value is a number between [0, 1] and provides a measure of how good the fit of the model is. A value of 1 means the model fits the data perfectly, while a value of 0 indicates there is no linear relationship between the observed and predicted values.

The mean squared error (MSE) is given by the formula

\[ \frac{1}{n}\sum_{i}^{n} (y_i - \hat{y}_i)^2, \]

where \(n\) is the number of data points in the target vector, \(y_i\) are the true values of the test set (y_test), and \(\hat{y}_i\) are the predicted values (y_pred).

# R^2 value
r2 = reg.score(X_test, y_test)
print("The R^2 score is : ", r2)

# Report Mean Square Error (mse)
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

The R^2 score is :  0.575787706032451
Mean squared error:  0.5558915986952442