Pros: Allows for model tuning and unbiased evaluation.
Cons: Requires more data, more complex.
Key Differences
Train/Test Split:
Simpler, faster.
May lead to overfitting or underfitting.
Train/Validate/Test Sets:
More comprehensive.
Better for hyperparameter tuning and model selection.
K-Fold Cross Validation
What is k-Fold Cross-Validation?
Definition: A technique to evaluate the performance of a model by dividing the data into k subsets (folds).
Process:
Split the dataset into k equal-sized folds.
Train the model on k-1 folds.
Validate the model on the remaining fold.
Repeat the process k times, each time with a different fold as the validation set.
Average the results to get the final performance metric.
Benefits
More Reliable Estimates: Provides a better estimate of model performance compared to a single train/test split.
Efficient Use of Data: Utilizes the entire dataset for both training and validation.
Reduces Overfitting: Helps in detecting overfitting by validating on multiple subsets.
Example
5-Fold Cross-Validation: The dataset is split into 5 folds. The model is trained and validated 5 times, each time with a different fold as the validation set.
Key Points
Choice of k: Common choices are 5 or 10, but it can vary depending on the dataset size.
Computational Cost: More computationally intensive than a single train/test split.
Implementing 5-Fold Cross-Validation in sklearn
from sklearn.model_selection import cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_irisdata = load_iris()X, y = data.data, data.targetmodel = RandomForestClassifier()scores = cross_val_score(model, X, y, cv=5)print("Cross-Validation Scores:", scores)print("Mean Score:", scores.mean())
Naive Bayes
The classification problem is given a data point \(\mathbf{x}\) predict the corresponding label \(y\).
A fundamental way to attack this problem is via probability.
A key tool will be Bayes Rule.
Bayes Rule
Let \(A, B\) be events from a sample space \(\Omega\). Recall that Bayes’ rule states
\[ P(A \vert B) = \frac{P(B\vert A) \cdot P(A)}{P(B)}. \]
We’ve now seen several examples of the importance of Bayes’ rule (e.g., GMMs) where we computed the unknown quantity \(P(A \vert B)\) by using the known values for \(P(B\vert A)\), \(P(A)\), and \(P(B)\).
Using Bayes Rule in a Classifier
Let’s start with a simple example
A doctor knows that meningitis causes a stiff neck 75% of the time, i.e., \(P(S\vert M) = 0.75\)
If a patient comes in with a stiff neck, what is the probability she has meningitis, i.e., what is \(P(M\vert S)\)?
We only currently know \(P(S\vert M)\) so we would need additional information to compute \(P(M\vert S)\).
If the doctor also knows the following information
1 in 20 people in the population have a stiff neck at any given time, i.e., \(P(S) = 0.05\)
1 in 10,000 people in the population have meningitis at any given time, i.e., \(P(M) = 0.0001\)
The above expressions shows the essence of Bayesian reasoning
We have prior knowledge about the probability of someone having meningitis, i.e., a random person has probability 1/10000 of having meningitis.
When we learn that the person has a stiff neck, it increases their probability of having meningitis by a factor of 15.
Priors and Posteriors
We give these probabilities names according to their roles.
The random person’s probability (1/10000) is called the prior probability
The specific patient’s probability (15 \(\cdot\) 1/10000) is called the posterior probability.
We use this same principle to construct a classifier.
Probabilistic Classification
In classification problems we assume that we are given a set of data points \(\mathbf{x}\in\mathbb{R}^{d}\) with \(d\) features and corresponding class labels \(y\).
In our meningitis example, there is a single attribute which is stiff neck.
The class label is either meningitis or no meningitis.
How does the value of each feature change the probability of the class label?
More generally, consider a data point \(\mathbf{x}\) having attributes \(x_1, x_2, \dots, x_d\) and various classes (labels) for items: \(C_1, \dots, C_k\).
Our goal is to predict the class of \(\mathbf{x}\).
To do that, we will compute \[P(C_1\vert\mathbf{x}), P(C_2\vert\mathbf{x}), \ldots, P(C_k\vert\mathbf{x}).\]
These form a soft classification of \(\mathbf{x}\).
Maximum A Posteriori
From the soft labels, we can form a hard classification. One way would be to choose the class with the highest probability.
This gives us an possible way to solve the problem.
The issue that remains is how to estimate
\[ P(x_1, x_2, \ldots, x_d\vert C_i).\]
Why is this difficult?
Imagine if we tried to compute \(P(x_1, x_2, \ldots, x_d\vert C_i)\) directly from data.
We could use a histogram to estimate the necessary distribution, i.e., count how many times we see each combination of feature values.
If there were 20 features (\(d = 20\)) and 10 possible labels for each feature, then for each class \(C_i\) we need to construct a histogram with \(10^{20}\) bins.
This is impossible.
The underlying problem we face is the high dimensionality of the feature space.
The size of our histogram is exponential in the number of features.
So, we need to find a way to reduce the exponential size of the estimation problem.
We will do that by factoring the distribution \(P(x_1, x_2, \ldots, x_d\vert C_i)\).
Here is where the naive part comes in.
Naive Bayes
We will assume that attributes are independent in their assignment to items.
That is, for two sets of attributes, the values of the attributes in one set do not affect the values in the other set, i.e., all correlations among attributes are zero.
This is indeed a naive assumption … but it can be surprisingly effective in practice.
This is very helpful computationally, because the individual factors \(P(x_j\vert C_i)\) reside in a lower-dimensional space than the full distribution.
In a naive Bayes model, the quantity we calculate for each class \(C_i\) is
One problem that can arise is when a histogram bin has zero entries. Then the conditional probability for this attribute value is zero, which overrides all the other factors and yields a zero probability.
There are various strategies for making small corrections to the counts that avoid this problem.
Continuous Attributes
Continuous attributes can be handled via histograms as well. This is done by binning up the values.
In the previous example, we could create bins to hold ranges of the continuous values for Taxable Income .
However, another commonly used approach is to assume that the data follow a parametric probability distribution, such as a Normal (Gaussian) distribution.
We can form conditional probabilities for Taxable Income as
Naive Bayes solves the classification problem through probability.
Training is simple, based on estimating class-conditional histograms or parametric densities of features.
Naive Bayes can work well in high-dimensional settings
Many times the correct label is the MAP estimate, even if individual probabilities are less accurate.
Its principal drawback is the assumption of independence among the features.
Support Vector Machines
We now turn to the support vector machine (SVM).
The SVM is based on explicit geometric considerations about how best to build a classifier.
As an example, here is a set of training data, considered as points in \(\mathbb{R}^d\):
Linear Separator
We will start with the idea of a linear separator. This is a hyperplane that forms a decision boundary.
Here is one possible separator
Here is another possible separator
Which separator is better?
They both perfectly separate the two classes in the training data.
But what we really care about is accuracy on the test data, i.e., generalization ability.
It seems that \(B_1\) is a better choice because it is farther from both classes.
New data falling in the region between training classes is more likely to be correctly classified by \(B_1\).
Criterion for Best Separator
This leads to a principle for choosing the best separator:
We are concerned with the margin between the separator and the data, and
We prefer the separator that maximizes the margin.
In fact, there are theoretical results suggesting that this is an optimal strategy for choosing a separator that has good generalization ability.
Linear SVM: Separable Case
Let’s see how we can train an SVM.
Consider a training dataset consisting of tuples \(\{(\mathbf{x}_i, y_i)\}\), where \(\mathbf{x}_i \in \mathbb{R}^d\) and \(y_i \in \{-1, 1\}\).
We’re going to assume that our data can be perfectly separated by a hyperplane.
Any hyperplane (such as \(B_1\) below) can be written as:
\[ w_1 x_1 + w_2 x_2 + \dots + w_d x_d + b = 0\]
or more concisely:
\[ \mathbf{w}^T\mathbf{x} + b = 0. \]
So our decision boundary (i.e., our classifier) has parameters \(\{w_i\}\) and \(b\).
For any \(\mathbf{x}_+\) from the positive class (circle) located above the decision boundary,
\[ \mathbf{w}^T\mathbf{x}_+ + b = k \]
for some positive\(k\).
Likewise for any \(\mathbf{x}_-\) from the negative class (square) located below the decision boundary,
\[ \mathbf{w}^T\mathbf{x}_- + b = -k \]
for the same \(k\).
Rescaling the parameters \(\mathbf{w}\) and \(b\) by \(k\), we obtain new equations for the same hyperplanes
\[
\begin{align*}
b_{11}: \mathbf{w}^T\mathbf{x}_+ + b &= 1 \\
b_{12}: \mathbf{w}^T\mathbf{x}_- + b &= -1
\end{align*}
\]
How far apart are these hyperplanes?
The vector \(\mathbf{w}\) is orthogonal to the hyperplanes \(b_{11}\) and \(b_{12}\).
So the distance between the two hyperplanes is the magnitude of the component of \((\mathbf{x}_+ - \mathbf{x}_-)\) that is in the direction of \(\mathbf{w}\).
Specifically it is the magnitude of the projection of \((\mathbf{x}_+ - \mathbf{x}_-)\) onto \(\mathbf{w}\).
\[
\begin{align*}
\mathbf{w}^T\mathbf{x}_i + b \ge 1 \text{ if } y_i &= 1, \\
\mathbf{w}^T\mathbf{x}_i + b \le -1 \text{ if } y_i &= -1.
\end{align*}
\]
This is a constrained optimization problem with a quadratic objective function.
The quadratic form \(\Vert \mathbf{w}\Vert^{2}\) is positive definite, i.e., it is strictly convex and has a unique global minimum.
Quadratic Programs
Such problems are called quadratic programs.
There are standard methods for solving them. The methods are effective but can be slow.
The complexity of the problem grows with the number of constraints, i.e., the number of training points.
Ultimately, only a subset of the training points (constraints) will determine the final solution.
The points that determine the solution support it and are called the support vectors.
Linear SVM: Non-Separable Case
It may well happen that there is no hyperplane that perfectly separates the classes.
In this case, we allow points to fall on the “wrong” side of their separator, but we add a penalty for this occurrence.
To express this formally, we introduce slack variables \(\xi_i\).
Each \(\xi_i\) measures how far the data point is on the “wrong side” of its separator.
The new problem is minimize
\[
\mathbf{w}^* = \arg\min_\mathbf{w} \left(\Vert\mathbf{w}\Vert^2 + C \left(\sum_{i=1}^N \xi_i\right)\right),
\]
subject to
\[
\begin{align*}
\mathbf{w}^T\mathbf{x}_i + b \ge 1-\xi_i \text{ if } y_i &= 1, \\
\mathbf{w}^T\mathbf{x}_i + b \le -1+\xi_i \text{ if } y_i &= -1,\\
\xi_i &\ge 0.
\end{align*}
\]
Notice we have introduced a hyperparameter \(C\), which controls the complexity of the SVM. This hyperparameter needs to be set by cross-validation.
A small value of \(C\) allows many points to fall on the wrong side of their corresponding hyperplane (having a nonzero \(\xi_i\)) and results in a large number of support vectors. Small \(C\) results in a more stable decision boundary.
Large \(C\) values will penalize violations heavily, and will result in fewer nonzero \(\xi_i\)’s, leading to fewer support vectors.
Nonlinear SVM
Finally, we consider the case in which the decision boundary is strongly nonlinear.
The basic idea here is that we take the data and transform it into another, higher-dimensional space.
Here is the same dataset on transformed coordinates:
So the Euclidean distance can be defined entirely in terms of the inner product kernel.
To train an SVM with a different kernel, we replace all the inner products with calls to our new kernel function.
The result is that we can obtain highly curved decision boundaries.
In practice, RBF works well in many cases.
SVM: Summary
In practice, SVMs have shown good results on many problems.
In particular, it is effective at dealing with high-dimensional data through the use of kernels.
Since all data is represented as vectors, and we are relying on distance functions like Euclidean distance, it is important to pay attention to feature scaling when using SVMs.
svc = svm.SVC(kernel ='linear')svc.fit(X_train, y_train)y_pred_test = svc.predict(X_test)print(f'Accuracy of SVM on test set: {svc.score(X_test, y_test):0.3f}')
Accuracy of SVM on test set: 0.764
Let’s visualize the decision boundaries.
Code
from matplotlib.colors import ListedColormap# Create color maps for 3-class classification problem, as with iriscmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])def plot_estimator(estimator, X, y):try: X, y = X.values, y.valuesexceptAttributeError:pass estimator.fit(X, y) x_min, x_max = X[:, 0].min() -.1, X[:, 0].max() +.1 y_min, y_max = X[:, 1].min() -.1, X[:, 1].max() +.1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200)) Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure() plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')# Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold) plt.axis('tight') plt.tight_layout()plt.show()
Note that in practice we should pay attention to feature scaling when using SVMs. We haven’t done that here.
As described already, the SVM gets its name from the samples in the dataset from each class that lies closest to the other class.
These training samples are called “support vectors” because changing their position in the \(d\)-dimensional feature space would change the location of the decision boundary.
In scikit-learn, the indices of the support vectors for each class can be found in the support_vectors_ attribute of the SVC object.
Here, we will use just two of the three classes for clarity.
Since the classes are not linearly separable, there are nonzero slack variables, each of which is associated with a support vector.
Therefore we should consider how regularization is tuned via the \(C\) parameter.
In practice, a large \(C\) value means that the number of support vectors is small (less regularization, more model complexity), while a small \(C\) implies many support vectors (more regularization, less model complexity).
scikit-learn sets a default value of \(C=1\).
Large \(C\)
Code
svc = svm.SVC(kernel='linear', C=1e4)plot_estimator(svc, X, y)plt.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', edgecolors ='k', linewidths=2, zorder=10)plt.title(f'C = 10000: small number of support vectors (acc: {svc.score(X, y):0.3f})')plt.show()
Small \(C\)
Code
svc = svm.SVC(kernel='linear', C=1e-2)plot_estimator(svc, X, y)plt.scatter(svc.support_vectors_[:, 0], svc.support_vectors_[:, 1], s=80, facecolors='none', edgecolors ='k', linewidths=2, zorder=10)plt.title(f'C = 0.01: high number of support vectors (acc: {svc.score(X, y):0.3f})')plt.show()
Kernels
We can also choose from a suite of available kernels:
linear,
poly,
rbf,
sigmoid.
We can also pass in a custom kernel.
Note that the radial basis function (rbf) kernel is just a Gaussian kernel, but with parameter \(\gamma = \frac{1}{\sigma^2}\).
plt.errorbar(np.log10(C_vals), acc, stderr)plt.xlabel('log10(C)')plt.ylabel('Accuracy')plt.title(r'Generalization Accuracy of Linear SVM as Function of $C$')plt.show()
SVM and NB: the Iris Dataset
To compare SVM and Naive Bayes, we’ll look at the Iris dataset. We will use two features for visualization.
We will not hold out data since we’re just interested in the shapes of the decision boundaries.
iris = datasets.load_iris()X = iris.data[:, :2] y = iris.targetC =1.0svc = svm.SVC(kernel ='linear', C = C).fit(X, y)rbf_svc = svm.SVC(kernel ='rbf', gamma =0.7, C = C).fit(X, y)poly_svc = svm.SVC(kernel ='poly', degree =3, C = C).fit(X, y)
To use Naive Bayes, one has to treat all the features as either
Gaussian
Multinomial (Categorical)
Binary
scikit-learn provides a Naive Bayes classifier for each of these cases.
We’ll use the Gaussian.
Code
from sklearn.naive_bayes import GaussianNBgnb = GaussianNB().fit(X, y)
Code
# create a mesh to plot inh =.02# step size in the meshx_min, x_max = X[:, 0].min() -1, X[:, 0].max() +1y_min, y_max = X[:, 1].min() -1, X[:, 1].max() +1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# title for the plotstitles = ['SVM with linear kernel','Naive Bayes','SVM with RBF kernel', 'SVM with poly kernel']fig = plt.figure(figsize=(12,12))for i, clf inenumerate((svc, gnb, rbf_svc, poly_svc)):# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, m_max]x[y_min, y_max]. plt.subplot(2, 2, i +1) plt.subplots_adjust(wspace=0.4, hspace=0.4) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)# Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xticks(()) plt.yticks(()) plt.title(titles[i], size =20)plt.tight_layout()plt.show()