Naive Bayes and Support Vector Machines

Today we’ll look at two more very commonly-used, widely-applicable classification methods.

the Naive Bayes Classifier
the Support Vector Machine

Before we dive into this material we will briefly touch on the topic of validation and testing.

Validation and Testing

Validation and Test Sets

Validation Set

Purpose: Used to tune model parameters and select the best model.
Usage: Helps in model selection and hyperparameter tuning.
Size: Typically 10-20% of the dataset.
Example: Adjusting the learning rate in a neural network.

Test Set

Purpose: Used to evaluate the final model’s performance.
Usage: Provides an unbiased evaluation of the model.
Size: Typically 10-20% of the dataset.
Example: Assessing the accuracy of a trained model on unseen data.

Key Differences

Validation Set: Used during model training.
Test Set: Used after model training is complete.
Goal: Ensure the model generalizes well to new data.

Train/Test Split

Purpose: Simple method to evaluate model performance.
Usage: Split the dataset into two parts: training and testing.
Typical Split: 80% training, 20% testing.
Pros: Easy to implement, quick evaluation.
Cons: May not provide the best model tuning.

Train/Validate/Test Split

Purpose: More robust method for model evaluation and tuning.
Usage: Split the dataset into three parts: training, validation, and testing.
Typical Split: 60% training, 20% validation, 20% testing.
Pros: Allows for model tuning and unbiased evaluation.
Cons: Requires more data, more complex.

Key Differences

Train/Test Split:
- Simpler, faster.
- May lead to overfitting or underfitting.
Train/Validate/Test Sets:
- More comprehensive.
- Better for hyperparameter tuning and model selection.

K-Fold Cross Validation

What is k-Fold Cross-Validation?

Definition: A technique to evaluate the performance of a model by dividing the data into k subsets (folds).
Process:
1. Split the dataset into k equal-sized folds.
2. Train the model on k-1 folds.
3. Validate the model on the remaining fold.
4. Repeat the process k times, each time with a different fold as the validation set.
5. Average the results to get the final performance metric.

Benefits

More Reliable Estimates: Provides a better estimate of model performance compared to a single train/test split.
Efficient Use of Data: Utilizes the entire dataset for both training and validation.
Reduces Overfitting: Helps in detecting overfitting by validating on multiple subsets.

Example

5-Fold Cross-Validation: The dataset is split into 5 folds. The model is trained and validated 5 times, each time with a different fold as the validation set.

Key Points

Choice of k: Common choices are 5 or 10, but it can vary depending on the dataset size.
Computational Cost: More computationally intensive than a single train/test split.

Implementing 5-Fold Cross-Validation in sklearn

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

data = load_iris()
X, y = data.data, data.target

model = RandomForestClassifier()

scores = cross_val_score(model, X, y, cv=5)

print("Cross-Validation Scores:", scores)
print("Mean Score:", scores.mean())

Naive Bayes

The classification problem is given a data point $\mathbf{x}$ predict the corresponding label $y$.

A fundamental way to attack this problem is via probability.

A key tool will be Bayes Rule.

Bayes Rule

Let $A, B$ be events from a sample space $\Omega$. Recall that Bayes’ rule states

\[ P(A \vert B) = \frac{P(B\vert A) \cdot P(A)}{P(B)}. \]

This was derived in our Probability and Statistics Refresher.

We’ve now seen several examples of the importance of Bayes’ rule (e.g., GMMs) where we computed the unknown quantity $P(A \vert B)$ by using the known values for $P(B\vert A)$, $P(A)$, and $P(B)$.

Using Bayes Rule in a Classifier

Let’s start with a simple example

A doctor knows that meningitis causes a stiff neck 75% of the time, i.e., $P(S\vert M) = 0.75$

If a patient comes in with a stiff neck, what is the probability she has meningitis, i.e., what is $P(M\vert S)$?

We only currently know $P(S\vert M)$ so we would need additional information to compute $P(M\vert S)$.

If the doctor also knows the following information

1 in 20 people in the population have a stiff neck at any given time, i.e., $P(S) = 0.05$
1 in 10,000 people in the population have meningitis at any given time, i.e., $P(M) = 0.0001$

Then we can calculate the answer needed

\[ P(M\vert S) = \frac{P(S\vert M)}{P(S)}\cdot P(M)= \frac{0.75}{0.05} \cdot 0.00001 = \fbox{$15 \cdot 1/10000$} = 0.0015. \]

The above expressions shows the essence of Bayesian reasoning

We have prior knowledge about the probability of someone having meningitis, i.e., a random person has probability 1/10000 of having meningitis.
When we learn that the person has a stiff neck, it increases their probability of having meningitis by a factor of 15.

Priors and Posteriors

We give these probabilities names according to their roles.

The random person’s probability (1/10000) is called the prior probability
The specific patient’s probability (15 $\cdot$ 1/10000) is called the posterior probability.

We use this same principle to construct a classifier.

Probabilistic Classification

In classification problems we assume that we are given a set of data points $\mathbf{x}\in\mathbb{R}^{d}$ with $d$ features and corresponding class labels $y$.

In our meningitis example, there is a single attribute which is stiff neck.

The class label is either meningitis or no meningitis.

How does the value of each feature change the probability of the class label?

More generally, consider a data point $\mathbf{x}$ having attributes $x_1, x_2, \dots, x_d$ and various classes (labels) for items: $C_1, \dots, C_k$.

Our goal is to predict the class of $\mathbf{x}$.

To do that, we will compute \[P(C_1\vert\mathbf{x}), P(C_2\vert\mathbf{x}), \ldots, P(C_k\vert\mathbf{x}).\]

These form a soft classification of $\mathbf{x}$.

Maximum A Posteriori

From the soft labels, we can form a hard classification. One way would be to choose the class with the highest probability.

This is the MAP (Maximum A Posteriori) Estimate:

\[\hat{C} = \arg\max_{\{C_i\}} P(C_i\vert\mathbf{x}).\]

Now $P(C_i\vert\mathbf{x}) = P(C_i\vert x_1, x_2, \ldots, x_d)$

How can we approach the problem of computing $P(C_i\vert\mathbf{x})$?

The key idea is that Bayes’ Rule makes clear that

\[ P(C_i\vert\mathbf{x}) = \frac{P(\mathbf{x}\vert C_i)}{P(\mathbf{x})} \cdot P(C_i).\]

When we vary $C_i$ in the above expression, $P(\mathbf{x})$ is not changing.

This implies that the $\hat{C}$ that maximizes

\[P(C_i\vert x_1, x_2, \ldots, x_d)\]

is the same as the $\hat{C}$ that maximizes

\[P(x_1, x_2, \ldots, x_d\vert C_i)\cdot P(C_i).\]

This gives us an possible way to solve the problem.

The issue that remains is how to estimate

\[ P(x_1, x_2, \ldots, x_d\vert C_i).\]

Why is this difficult?

Imagine if we tried to compute $P(x_1, x_2, \ldots, x_d\vert C_i)$ directly from data.

We could use a histogram to estimate the necessary distribution, i.e., count how many times we see each combination of feature values.

If there were 20 features ($d = 20$) and 10 possible labels for each feature, then for each class $C_i$ we need to construct a histogram with $10^{20}$ bins.

This is impossible.

The underlying problem we face is the high dimensionality of the feature space.

The size of our histogram is exponential in the number of features.

So, we need to find a way to reduce the exponential size of the estimation problem.

We will do that by factoring the distribution $P(x_1, x_2, \ldots, x_d\vert C_i)$.

Here is where the naive part comes in.

Naive Bayes

We will assume that attributes are independent in their assignment to items.

That is, for two sets of attributes, the values of the attributes in one set do not affect the values in the other set, i.e., all correlations among attributes are zero.

This is indeed a naive assumption … but it can be surprisingly effective in practice.

The assumption implies

\[ P(x_1, x_2, \ldots, x_d\vert C_i) = P(x_1\vert C_i) \cdot P(x_2\vert C_i) \cdot P(x_d\vert C_i) \]

This is very helpful computationally, because the individual factors $P(x_j\vert C_i)$ reside in a lower-dimensional space than the full distribution.

In a naive Bayes model, the quantity we calculate for each class $C_i$ is

\[ \left(P(x_1\vert C_i) \cdot P(x_2\vert C_i) \cdots P(x_d\vert C_i)\right) \cdot P(C_i).\]

You can see each conditional probability as a “correction factor” to $P(C_i)$.

Each factor $P(x_j\vert C_i)$ tells us how we should update our confidence in $C_i$ based on the value of a particular feature $x_j$.

So, what remains then is to estimate $P(x_j\vert C_i)$ for all $x_j$ and $C_i$.

We will estimate these quantities from our training data.

Steps of Naive Bayes

To summarize the steps of Naive Bayes:

Train

Compute all the per-class attribute probabilities $P(x_j\vert C_i)$ from training data.
Compute all the class probabilities $P(C_i)$ from the training data.

Predict

For a given data point $\mathbf{x} = (x_1, x_2, \dots, x_d)$,
- For each class $C_i,$ compute $P(x_1\vert C_i) \cdot P(x_2\vert C_i) \cdots P(x_d\vert C_i) \cdot P(C_i)$
- For a hard classification return the MAP estimate, i.e., the class that maximizes the above expression

Computing Attribute Probabilities from Data

All that remains is to compute the conditional attribute probabilities from data.

The strategy depends on the attribute type

discrete or
continuous.

Discrete Attributes

Discrete attributes, such as categorical or ordinal attributes, can be handled via histograms.

In the table, to handle the Marital Status attribute for the $\text{Evade} = \text{No}$ class, we need to compute the following

\[ \begin{align*} P(\text{Single}\vert\text{Evade } &=\text{ No}) = 2 / 7 = 0.29,\\ P(\text{Married}\vert\text{Evade } &=\text{ No}) = 4 / 7 = 0.57,\\ P(\text{Divorced}\vert\text{Evade }&=\text{ No}) = 1 / 7 = 0.14. \end{align*} \]

One problem that can arise is when a histogram bin has zero entries. Then the conditional probability for this attribute value is zero, which overrides all the other factors and yields a zero probability.

There are various strategies for making small corrections to the counts that avoid this problem.

Continuous Attributes

Continuous attributes can be handled via histograms as well. This is done by binning up the values.

In the previous example, we could create bins to hold ranges of the continuous values for Taxable Income .

However, another commonly used approach is to assume that the data follow a parametric probability distribution, such as a Normal (Gaussian) distribution.

We can form conditional probabilities for Taxable Income as

\[ P(\text{Taxable Income} = x\vert\text{Evade }=\text{ No}) = \mathcal{N}(\mu_\text{No}, \sigma_\text{No}).\]

Code

from scipy.stats import norm
eno = np.array([125000, 100000, 70000, 120000, 60000, 220000, 75000])
eyes = np.array([95000, 85000, 75000])
mu_no = np.mean(eno)
sig_no = np.std(eno)
mu_yes = np.mean(eyes)
sig_yes = np.mean(eyes)
plt.figure()
x = np.linspace(norm.ppf(0.001, loc = mu_no, scale = sig_no), norm.ppf(0.999, loc = mu_no, scale = sig_no), 100)
plt.plot(x, norm.pdf(x, loc = mu_no, scale = sig_no),'b-', lw = 5, alpha = 0.6, label = 'Evade = No')
x = np.linspace(norm.ppf(0.001, loc = mu_yes, scale = sig_yes), norm.ppf(0.999, loc = mu_yes, scale = sig_yes), 100)
plt.plot(x, norm.pdf(x, loc = mu_yes, scale = sig_yes),'g-', lw = 5, alpha = 0.6, label = 'Evade = Yes')
plt.xlim([0, 200000])
plt.xlabel('Taxable Income', size=14)
plt.legend(loc = 'best')
plt.title('Class-Conditional Distributions')
plt.ylabel(r'$p(x)$', size=14)
plt.show()

NB Summary

In summary:

Naive Bayes solves the classification problem through probability.
Training is simple, based on estimating class-conditional histograms or parametric densities of features.
Naive Bayes can work well in high-dimensional settings
Many times the correct label is the MAP estimate, even if individual probabilities are less accurate.
Its principal drawback is the assumption of independence among the features.

Support Vector Machines

We now turn to the support vector machine (SVM).

The SVM is based on explicit geometric considerations about how best to build a classifier.

As an example, here is a set of training data, considered as points in $\mathbb{R}^d$:

Linear Separator

We will start with the idea of a linear separator. This is a hyperplane that forms a decision boundary.

Here is one possible separator

Here is another possible separator

Which separator is better?

They both perfectly separate the two classes in the training data.

But what we really care about is accuracy on the test data, i.e., generalization ability.

It seems that $B_1$ is a better choice because it is farther from both classes.

New data falling in the region between training classes is more likely to be correctly classified by $B_1$.

Criterion for Best Separator

This leads to a principle for choosing the best separator:

We are concerned with the margin between the separator and the data, and
We prefer the separator that maximizes the margin.

In fact, there are theoretical results suggesting that this is an optimal strategy for choosing a separator that has good generalization ability.

Linear SVM: Separable Case

Let’s see how we can train an SVM.

Consider a training dataset consisting of tuples $\{(\mathbf{x}_i, y_i)\}$, where $\mathbf{x}_i \in \mathbb{R}^d$ and $y_i \in \{-1, 1\}$.

We’re going to assume that our data can be perfectly separated by a hyperplane.

Any hyperplane (such as $B_1$ below) can be written as:

\[ w_1 x_1 + w_2 x_2 + \dots + w_d x_d + b = 0\]

or more concisely:

\[ \mathbf{w}^T\mathbf{x} + b = 0. \]

So our decision boundary (i.e., our classifier) has parameters $\{w_i\}$ and $b$.

For any $\mathbf{x}_+$ from the positive class (circle) located above the decision boundary,

\[ \mathbf{w}^T\mathbf{x}_+ + b = k \]

for some positive $k$.

Likewise for any $\mathbf{x}_-$ from the negative class (square) located below the decision boundary,

\[ \mathbf{w}^T\mathbf{x}_- + b = -k \]

for the same $k$.

Rescaling the parameters $\mathbf{w}$ and $b$ by $k$, we obtain new equations for the same hyperplanes

\[ \begin{align*} b_{11}: \mathbf{w}^T\mathbf{x}_+ + b &= 1 \\ b_{12}: \mathbf{w}^T\mathbf{x}_- + b &= -1 \end{align*} \]

How far apart are these hyperplanes?

The vector $\mathbf{w}$ is orthogonal to the hyperplanes $b_{11}$ and $b_{12}$.

So the distance between the two hyperplanes is the magnitude of the component of $(\mathbf{x}_+ - \mathbf{x}_-)$ that is in the direction of $\mathbf{w}$.

Specifically it is the magnitude of the projection of $(\mathbf{x}_+ - \mathbf{x}_-)$ onto $\mathbf{w}$.

This is given by

\[d = \frac{\mathbf{w}^T}{\Vert\mathbf{w}\Vert}(\mathbf{x}_+ - \mathbf{x}_-).\]

But subtracting the equations for the hyperplanes, we get

\[ \mathbf{w}^T(\mathbf{x}_+ - \mathbf{x}_-) = 2.\]

This implies that

\[ d = \frac{2}{\Vert\mathbf{w}\Vert}.\]

Now, we have a measure for how good a separator is in terms of its parameters $\mathbf{w}$.

We want the separator to maximize $d=\frac{2}{\Vert\mathbf{w}\Vert},$ or equivalently, minimize $\Vert\mathbf{w}\Vert$.

What separators should we consider?

We consider all separators that correctly classify each point.

That is, for all training points $\{(\mathbf{x}_i, y_i)\}$:

\[ \mathbf{w}^T\mathbf{x}_i + b \ge 1 \text{ if } y_i = 1 \]

and

\[ \mathbf{w}^T\mathbf{x}_i + b \le -1 \text{ if } y_i = -1.\]

Maximum Margin Separator

We now formally state the problem of defining the maximum margin separator.

Minimize

\[ \mathbf{w}^* = \arg\min_\mathbf{w} \Vert\mathbf{w}\Vert^{2},\]

subject to

\[ \begin{align*} \mathbf{w}^T\mathbf{x}_i + b \ge 1 \text{ if } y_i &= 1, \\ \mathbf{w}^T\mathbf{x}_i + b \le -1 \text{ if } y_i &= -1. \end{align*} \]

This is a constrained optimization problem with a quadratic objective function.

The quadratic form $\Vert \mathbf{w}\Vert^{2}$ is positive definite, i.e., it is strictly convex and has a unique global minimum.

Quadratic Programs

Such problems are called quadratic programs.

There are standard methods for solving them. The methods are effective but can be slow.

The complexity of the problem grows with the number of constraints, i.e., the number of training points.

Ultimately, only a subset of the training points (constraints) will determine the final solution.

The points that determine the solution support it and are called the support vectors.

Linear SVM: Non-Separable Case

It may well happen that there is no hyperplane that perfectly separates the classes.

In this case, we allow points to fall on the “wrong” side of their separator, but we add a penalty for this occurrence.

To express this formally, we introduce slack variables $\xi_i$.

Each $\xi_i$ measures how far the data point is on the “wrong side” of its separator.

The new problem is minimize

\[ \mathbf{w}^* = \arg\min_\mathbf{w} \left(\Vert\mathbf{w}\Vert^2 + C \left(\sum_{i=1}^N \xi_i\right)\right), \]

subject to

\[ \begin{align*} \mathbf{w}^T\mathbf{x}_i + b \ge 1-\xi_i \text{ if } y_i &= 1, \\ \mathbf{w}^T\mathbf{x}_i + b \le -1+\xi_i \text{ if } y_i &= -1,\\ \xi_i &\ge 0. \end{align*} \]

Notice we have introduced a hyperparameter $C$, which controls the complexity of the SVM. This hyperparameter needs to be set by cross-validation.

A small value of $C$ allows many points to fall on the wrong side of their corresponding hyperplane (having a nonzero $\xi_i$) and results in a large number of support vectors. Small $C$ results in a more stable decision boundary.

Large $C$ values will penalize violations heavily, and will result in fewer nonzero $\xi_i$’s, leading to fewer support vectors.

Nonlinear SVM

Finally, we consider the case in which the decision boundary is strongly nonlinear.

The basic idea here is that we take the data and transform it into another, higher-dimensional space.

Here is the same dataset on transformed coordinates:

\[ \begin{align*} x_1 &\rightarrow x_1 \\ x_2 &\rightarrow (x_1 + x_2)^4 \end{align*} \]

We are not showing more dimensions here, but the principle is the same.

In the higher dimensional space, the points may be (approximately) separable.

Kernels

To achieve this using the framework of the SVM we use a kernel.

A kernel is a function $K(\mathbf{x}, \mathbf{y})$.

There are many ways to define kernels.

The most popular kernels are:

Linear (inner product): $K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T\mathbf{y}$
Polynomial: $K(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T\mathbf{y})^d$
Gaussian: $K(\mathbf{x}, \mathbf{y}) = \text{exp}(-\gamma\Vert\mathbf{x}-\mathbf{y}\Vert^2)$

The Gaussian kernel is also called a Radial Basis Function.

There is an efficient way to train an SVM using a kernel for a similarity function.

The basic idea is that the standard linear SVM is trained using Euclidean distance.

However, the squared Euclidean distance between two vectors is

\[ \Vert\mathbf{x} - \mathbf{y}\Vert^2 = (\mathbf{x} - \mathbf{y})^T(\mathbf{x} - \mathbf{y}) = \mathbf{x}^T\mathbf{x} + \mathbf{y}^T\mathbf{y} - 2\mathbf{x}^T\mathbf{y}.\]

So the Euclidean distance can be defined entirely in terms of the inner product kernel.

To train an SVM with a different kernel, we replace all the inner products with calls to our new kernel function.

The result is that we can obtain highly curved decision boundaries.

In practice, RBF works well in many cases.

SVM: Summary

In practice, SVMs have shown good results on many problems.
In particular, it is effective at dealing with high-dimensional data through the use of kernels.
Since all data is represented as vectors, and we are relying on distance functions like Euclidean distance, it is important to pay attention to feature scaling when using SVMs.

SVM and Naive Bayes in Python

We work with a dataset describing Italian wine samples.

We will use as features the alcohol content of the wine and its Malic acid content to predict the grape type (cultivar).

Code

wine = pd.read_table("data/wine.data", sep=',')

wine.columns = ['region',
                'Alcohol',
                'Malic acid',
                'Ash',
                'Alcalinity of ash',
                'Magnesium',
                'Total phenols',
                'Flavanoids',
                'Nonflavanoid phenols',
                'Proanthocyanins',
                'Color intensity',
                'Hue',
                'OD280/OD315 of diluted wines',
                'Proline']

X = wine[['Alcohol', 'Malic acid']].values
y = wine['region'].values

We’ll first fit a linear SVM to the data.

np.random.seed(0)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,test_size = 0.5)

svc = svm.SVC(kernel = 'linear')
svc.fit(X_train, y_train)
y_pred_test = svc.predict(X_test)
print(f'Accuracy of SVM on test set: {svc.score(X_test, y_test):0.3f}')

Accuracy of SVM on test set: 0.764

Let’s visualize the decision boundaries.

Code

from matplotlib.colors import ListedColormap
# Create color maps for 3-class classification problem, as with iris
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def plot_estimator(estimator, X, y):
    
    try:
        X, y = X.values, y.values
    except AttributeError:
        pass
    
    estimator.fit(X, y)
    x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1
    y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    plt.axis('tight')
    plt.tight_layout()
plt.show()

Code

plot_estimator(svc, X, y)
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.show()

Note that in practice we should pay attention to feature scaling when using SVMs. We haven’t done that here.

As described already, the SVM gets its name from the samples in the dataset from each class that lies closest to the other class.

These training samples are called “support vectors” because changing their position in the $d$-dimensional feature space would change the location of the decision boundary.

In scikit-learn, the indices of the support vectors for each class can be found in the support_vectors_ attribute of the SVC object.

Here, we will use just two of the three classes for clarity.

The support vectors are circled.

Code

# Extract classes 1 and 2
X, y = X[np.in1d(y, [1, 2])], y[np.in1d(y, [1, 2])]

plot_estimator(svc, X, y)
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.scatter(svc.support_vectors_[:, 0], 
           svc.support_vectors_[:, 1], 
           s=120, 
           facecolors='none', 
           edgecolors = 'k',
           linewidths=2,
           zorder=10)
plt.title(f'2-class accuracy on entire dataset: {svc.score(X, y):0.3f}')
plt.show()

Regularization

Since the classes are not linearly separable, there are nonzero slack variables, each of which is associated with a support vector.

Therefore we should consider how regularization is tuned via the $C$ parameter.

In practice, a large $C$ value means that the number of support vectors is small (less regularization, more model complexity), while a small $C$ implies many support vectors (more regularization, less model complexity).

scikit-learn sets a default value of $C=1$.

Large $C$

Code

svc = svm.SVC(kernel='linear', C=1e4)
plot_estimator(svc, X, y)
plt.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, 
            facecolors='none', edgecolors = 'k', linewidths=2, zorder=10)
plt.title(f'C = 10000: small number of support vectors (acc: {svc.score(X, y):0.3f})')
plt.show()

Small $C$

Code

svc = svm.SVC(kernel='linear', C=1e-2)
plot_estimator(svc, X, y)
plt.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, 
            facecolors='none', edgecolors = 'k', linewidths=2, zorder=10)
plt.title(f'C = 0.01: high number of support vectors (acc: {svc.score(X, y):0.3f})')
plt.show()

Kernels

We can also choose from a suite of available kernels:

linear,
poly,
rbf,
sigmoid.

We can also pass in a custom kernel.

Note that the radial basis function (rbf) kernel is just a Gaussian kernel, but with parameter $\gamma = \frac{1}{\sigma^2}$.

Linear Kernel

Code

svc_lin = svm.SVC(kernel = 'linear')
plot_estimator(svc_lin, X, y)
plt.scatter(svc_lin.support_vectors_[:, 0], svc_lin.support_vectors_[:, 1], 
            s=80, facecolors='none', edgecolors = 'k', linewidths=2, zorder=10)
plt.title('Linear kernel')
y_pred_test = svc_lin.predict(X_test)
plt.title(f'Accuracy on test set: {svc.score(X_test, y_pred_test):0.3f}')
plt.show()

Polynomial Kernel

Code

svc_poly = svm.SVC(kernel='poly', degree=4)
plot_estimator(svc_poly, X, y)
plt.scatter(svc_poly.support_vectors_[:, 0], svc_poly.support_vectors_[:, 1], 
           s=80, facecolors='none', edgecolors = 'k', linewidths=2, zorder=10)
plt.title('Polynomial kernel')
y_pred_test = svc_poly.predict(X_test)
plt.title(f'Accuracy on test set: {svc.score(X_test, y_pred_test):0.3f}')
plt.show()

RBF Kernel

Code

svc_rbf = svm.SVC(kernel='rbf', gamma=100, C = 1e2)
plot_estimator(svc_rbf, X, y)
plt.scatter(svc_rbf.support_vectors_[:, 0], svc_rbf.support_vectors_[:, 1], 
           s=80, facecolors='none', edgecolors = 'k', linewidths=2, zorder=10)
plt.title('RBF kernel')
y_pred_test = svc_rbf.predict(X_test)
plt.title(f'Accuracy on test set: {svc.score(X_test, y_pred_test):0.3f}')
plt.show()

Cross-Validation

Let’s evaluate our choice of hyperparameter $C$.

We have seen how to tune hyperparameters already using model_selection.train_test_split().

Now we’ll use a utility model_selection.cross_val_score() which will automatically do $k$-fold cross validation for us, for a single hyperparmeter.

f = svm.SVC(kernel = 'linear', C = 1)
scores = model_selection.cross_val_score(f, 
                                         wine[['Alcohol', 'Malic acid']], 
                                         wine['region'], 
                                         cv = 5)

print(f'Scores: {scores}')
print(f'Accuracy: {scores.mean():0.2f} (+/- {scores.std()/np.sqrt(5):0.2f})')

Scores: [0.69444444 0.80555556 0.82857143 0.74285714 0.68571429]
Accuracy: 0.75 (+/- 0.03)

Let’s use this to do a grid search to tune $C$.

Code

means = []
stds = []
folds = 5
C_vals = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
for C_val in C_vals:
    f = svm.SVC(kernel='linear', C = C_val)
    scores = model_selection.cross_val_score(f, wine[['Alcohol', 'Malic acid']], wine['region'], cv = folds)
    means.append(np.mean(scores))
    stds.append(np.std(scores) / np.sqrt(folds))
acc = np.array(means)
stderr = np.array(stds)
C_s = np.array(C_vals)

Code

plt.errorbar(np.log10(C_vals), acc, stderr)
plt.xlabel('log10(C)')
plt.ylabel('Accuracy')
plt.title(r'Generalization Accuracy of Linear SVM as Function of $C$')
plt.show()

SVM and NB: the Iris Dataset

To compare SVM and Naive Bayes, we’ll look at the Iris dataset. We will use two features for visualization.

We will not hold out data since we’re just interested in the shapes of the decision boundaries.

iris = datasets.load_iris()
X = iris.data[:, :2]                     
y = iris.target

C = 1.0  

svc = svm.SVC(kernel = 'linear', C = C).fit(X, y)
rbf_svc = svm.SVC(kernel = 'rbf', gamma = 0.7, C = C).fit(X, y)
poly_svc = svm.SVC(kernel = 'poly', degree = 3, C = C).fit(X, y)

To use Naive Bayes, one has to treat all the features as either

Gaussian
Multinomial (Categorical)
Binary

scikit-learn provides a Naive Bayes classifier for each of these cases.

We’ll use the Gaussian.

Code

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB().fit(X, y)

Code

# create a mesh to plot in
h = .02  # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVM with linear kernel',
          'Naive Bayes',
          'SVM with RBF kernel', 'SVM with poly kernel']

fig = plt.figure(figsize=(12,12))

for i, clf in enumerate((svc, gnb, rbf_svc, poly_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i], size = 20)

plt.tight_layout()
plt.show()

Validation and Testing

Validation and Test Sets

Validation Set

Test Set

Key Differences

Train/Test Split

Train/Validate/Test Split

Key Differences

K-Fold Cross Validation

What is k-Fold Cross-Validation?

Benefits

Example

Key Points

Implementing 5-Fold Cross-Validation in sklearn

Naive Bayes

Bayes Rule

Using Bayes Rule in a Classifier

Priors and Posteriors

Probabilistic Classification

Maximum A Posteriori

Naive Bayes

Steps of Naive Bayes

Computing Attribute Probabilities from Data

Discrete Attributes

Continuous Attributes

NB Summary

Support Vector Machines

Linear Separator

Criterion for Best Separator

Linear SVM: Separable Case

Maximum Margin Separator

Quadratic Programs

Linear SVM: Non-Separable Case

Nonlinear SVM

Kernels

SVM: Summary

SVM and Naive Bayes in Python

Regularization

Large \(C\)

Small \(C\)

Kernels

Linear Kernel

Polynomial Kernel

RBF Kernel

Cross-Validation

SVM and NB: the Iris Dataset