Distances and Time Series
We will start building some tools for making comparisons of data objects with particular attention to time series.
Working with data, we can encounter a wide variety of different data objects
- records of users,
- images,
- videos,
- text (webpages, books),
- strings (DNA sequences), and
- time series.
How can we compare them?
Feature space representation
Usually a data object consists of a set of attributes.
These are also commonly called features.
- (“J. Smith”, 25, $ 200,000)
- (“M. Jones”, 47, $ 45,000)
If all \(d\) dimensions are real-valued then we can visualize each data object as a point in a \(d\)-dimensional vector space.
(25, USD 200000)
\(\rightarrow \begin{bmatrix}25\\200000\end{bmatrix}\).
Likewise If all features are binary then we can think of each data object as a binary vector in vector space.
The space is called feature space.
Vector spaces are such a useful tool that we often use them even for non-numeric data.
For example, consider a categorical variable that can be only one of “house”, “tree”, or “moon”. For such a variable, we can use a one-hot encoding.
We would encode as follows:
house
: \([1, 0, 0]\)tree
: \([0, 1, 0]\)moon
: \([0, 0, 1]\)
So an encoding of (25, USD 200000, 'house')
could be: \[\begin{bmatrix}25\\200000\\1\\0\\0\end{bmatrix}.\]
We will see many other encodings that take non-numeric data and encode them into vectors or matrices.
For example, there are vector or matrix encodings for
- graphs,
- images, and
- text.
Matrix representation of data
We generally store data in a matrix form as
\[ \mbox{$m$ data objects}\left\{\begin{array}{c}\;\\\;\\\;\\\;\\\;\end{array}\right.\;\;\overbrace{\left[\begin{array}{ccccc} \begin{array}{c} x_{11} \\ \vdots \\ x_{i1} \\ \vdots \\ x_{m1} \end{array}& \begin{array}{c} \cdots \\ \ddots \\ \cdots \\ \ddots \\ \cdots \end{array}& \begin{array}{c} x_{1j} \\ \vdots \\ x_{ij} \\ \vdots \\ x_{mj} \end{array}& \begin{array}{c} \cdots \\ \ddots \\ \cdots \\ \ddots \\ \cdots \end{array}& \begin{array}{c} x_{1n} \\ \vdots \\ x_{in} \\ \vdots \\ x_{mn} \end{array} \end{array}\right]}^{\mbox{$n$ features}} \]
The number of rows is denoted by \(m\) and the number of columns by \(n\). The rows are instances or records of data and the columns are the features.
Metrics
A metric is a function \(d(x, y)\) that satisfies the following properties.
- \(d(x, x) = 0\)
- \(d(x, y) > 0 \hspace{1cm} \forall x\neq y\) (positivity)
- \(d(x, y) = d(y, x)\) (symmetry)
- \(d(x, y)\leq d(x, z) + d(z, y)\) (triangle inequality)
We can use a metric to determine how similar or dissimilar two objects are.
A metric is a measure of the dissimilarity between two objects. The larger the measure, the more dissimilar the objects are.
If the objects are vectors, then the metric is also commonly called a distance.
Sometimes we will use “distance” informally, i.e., to refer to a similarity or dissimilarity function even if we are not sure it is a metric.
We’ll try to say “dissimilarity” in those cases though.
The distance matrix is defined as
\[ \mbox{$m$ data objects}\left\{\begin{array}{c}\;\\\;\\\;\\\;\\\;\end{array}\right.\;\; \overbrace{\left[\begin{array}{ccccc} \begin{array}{c} 0 \\ d(x_1, x_2) \\ d(x_1,x_3) \\ \vdots \\ d(x_1,x_m) \end{array} & \begin{array}{c} \; \\ 0 \\ d(x_2,x_3) \\ \vdots \\ d(x_2,x_m) \end{array} & \begin{array}{c} \; \\ \; \\ 0 \\ \vdots \\ \cdots \end{array} & \begin{array}{c} \; \\ \; \\ \; \\ \ddots \\ d(x_{m-1},x_m) \end{array} & \begin{array}{c} \; \\ \; \\ \; \\ \; \\[6pt] 0 \end{array} & \end{array}\right]}^{\mbox{$m$ data objects}}, \]
where \(x_i\) denotes the \(i\)-th column of the data matrix \(X\).
Norms
Let \(\mathbf{u}, \mathbf{v}\in\mathbb{R}^{n}\) and \(a\in\mathbb{R}\). The vector function \(p(\mathbf{v})\) is called a norm if
- \(p(a\mathbf{v}) = |a|p(\mathbf{v})\),
- \(p(\mathbf{u} + \mathbf{v}) \leq p(\mathbf{u}) + p(\mathbf{v})\),
- \(p(\mathbf{v}) = 0\) if and only if \(\mathbf{v} = 0\).
Every norm defines a corresponding metric.
In particular If \(p()\) is a norm, then \(d(\mathbf{x}, \mathbf{y}) = p(\mathbf{x}-\mathbf{y})\) is a metric.
\(\ell_p\) norm
A general class of norms are called \(\ell_p\) norms, where \(p \geq 1.\)
\[\Vert \mathbf{x} \Vert_p = \left(\sum_{i=1}^d |x_i|^p\right)^{\frac{1}{p}}.\]
The corresponding distance that an \(\ell_p\) norm defines is called the Minkowski distance.
\[\Vert \mathbf{x} - \mathbf{y} \Vert_p = \left(\sum_{i=1}^d |x_i - y_i|^p\right)^{\frac{1}{p}}.\]
\(\ell_2\) norm
A special – but very important – case is the \(\ell_2\) norm.
\[ \Vert \mathbf{x} \Vert_2 = \sqrt{\sum_{i=1}^d |x_i|^2}. \]
We’ve already mentioned it: it is the Euclidean norm.
The distance defined by the \(\ell_2\) norm is the same as the Euclidean distance between two vectors \(\mathbf{x}, \mathbf{y}\).
\[ \Vert \mathbf{x} - \mathbf{y} \Vert_2 = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}. \]
\(\ell_1\) norm
Another important special case is the \(\ell_1\) norm.
\[ \Vert \mathbf{x} \Vert_1 = \sum_{i=1}^d |x_i|.\]
This defines the Manhattan distance, or (for binary vectors), the Hamming distance:
\[ \Vert \mathbf{x} - \mathbf{y} \Vert_1 = \sum_{i=1} |x_i - y_i|.\]
\(\ell_\infty\) norm
If we take the limit of the \(\ell_p\) norm as \(p\) gets large we get the \(\ell_\infty\) norm.
We have that
\[ \Vert \mathbf{x} \Vert_{\infty} = \max_{i} \vert x_{i} \vert . \]
What is the metric that this norm induces?
\(\ell_0\) norm
Another related idea is the \(\ell_0\) “norm,” which is not a norm, but is in a sense what we get from the \(p\)-norm for \(p = 0\).
Note that this is not a norm, but it gets called that anyway.
This “norm” simply counts the number of nonzero elements in a vector.
This is called the vector’s sparsity.
Here is the notion of a “circle” under each of three norms.
That is, for each norm, the set of vectors having norm 1, or distance 1 from the origin.
\(|x_1| + |x_2| = 1\)
\(\sqrt{x_1^2 + x_2^2} = 1\)
\(\max(|x_1|, |x_2|) = 1\)
Similarity and Dissimilarity Functions
Similarity functions quantify how similar two objects are. The higher the similarity score, the more alike the objects.
Examples
- cosine similarity,
- Jaccard similarity (intersection over union).
Dissimilarity functions quantifies the difference between two objects. The higher the dissimilarity score, the more different the objects are.
Examples
- Manhattan distance,
- Hamming distance.
Similarity
We know that the inner product of two vectors can be used to compute the cosine of the angle between them
\[ \cos(\theta) = \frac{\mathbf{x}^T\mathbf{y}}{\Vert\mathbf{x}\Vert \Vert\mathbf{y}\Vert} \equiv \cos(\mathbf{x}, \mathbf{y}) .\]
This value is close to 1 when \(\mathbf{x} \approx \mathbf{y}\). We can use this formula to define a similarity function called the cosine similarity \(\cos(\mathbf{x}, \mathbf{y})\).
Dissimilarity
Given a similarity function \(s(\mathbf{x}, \mathbf{y})\), how could we convert it to a dissimilarity function \(d(\mathbf{x}, \mathbf{y})\)?
Two straightforward ways of doing that are:
\[d(\mathbf{x},\mathbf{y}) = 1\,/\,s(\mathbf{x},\mathbf{y})\]
or
\[d(\mathbf{x},\mathbf{y}) = k - s(\mathbf{x},\mathbf{y})\]
for some properly chosen \(k\).
For cosine similarity, one often uses:
\[ d(\mathbf{x}, \mathbf{y}) = 1 - \cos(\mathbf{x}, \mathbf{y})\]
Note however that this is not a metric!
Bit vectors and Sets
When working with bit vectors, the \(\ell_1\) metric is commonly used and is called the Hamming distance.
This has a natural interpretation: “how well do the two vectors match?”
Or: “What is the smallest number of bit flips that will convert one vector into the other?”
In other cases, the Hamming distance is not a very appropriate metric.
Consider the case in which the bit vector is being used to represent a set.
In that case, Hamming distance measures the size of the set difference.
For example, consider two documents. We will use bit vectors to represent the sets of words in each document.
- Case 1: both documents are large, almost identical, but differ in 10 words.
- Case 2: both documents are small, disjoint, have 5 words each.
What matters is not just the size of the set difference, but the size of the intersection.
This leads to the Jaccard similarity:
\[J_{Sim}(\mathbf{x}, \mathbf{y}) = \frac{|\mathbf{x} \cap \mathbf{y}|}{|\mathbf{x} \cup \mathbf{y}|}.\]
This takes on values from 0 to 1, so a natural dissimilarity metric is \(1 - J_{Sim}().\)
In fact, this is a metric!
\[J_{Dist}(\mathbf{x}, \mathbf{y}) = 1- \frac{|\mathbf{x} \cap \mathbf{y}|}{|\mathbf{x} \cup \mathbf{y}|}.\]
Let’s revisit the previously introduces cases comparing documents.
Case 1: Very large almost identical documents.
Here \(J_{Sim}(\mathbf{x}, \mathbf{y})\) is almost 1.
Case 2: Very small disjoint documents.
Here \(J_{Sim}(\mathbf{x}, \mathbf{y})\) is 0.
Time Series
A time series is a sequence of real numbers, representing the measurements of a real variable at (possibly equal) time intervals.
Some examples are
- stock prices,
- the volume of sales over time, and
- daily temperature readings.
A time series database is a large collection of time series.
Similarity of Time Series
Suppose we wish to compare the following time series.
- Stock price movements for companies over a time interval.
- The motion data of two people walking.
- Credit usage patterns for bank clients.
How should we measure the “similarity” of these time series?
There are two problems to address.
- Defining a meaningful similarity (or distance) function.
- Finding an efficient algorithm to compute it.
Norm-based Similarity Measures
We could just view each sequence as a vector.
Then we could use a \(p\)-norm, e.g., \(\ell_1, \ell_2,\) or \(\ell_p\) to measure similarity.
Advantages
- Easy to compute - linear in the length of the time series (O(n)).
- It is a metric.
Disadvantage 1. May not be meaningful!
We may believe that \(\mathbf{ts1}\) and \(\mathbf{ts2}\) are the most “similar” pair of time series.
However, according to Euclidean distance:
\[ \Vert \mathbf{ts1} - \mathbf{ts2} \Vert_2 = 26.9,\]
while
\[ \Vert \mathbf{ts1} - \mathbf{ts3} \Vert_2 = 23.2.\]
Feature Engineering
In general, there may be different aspects of a time series that are important in different settings.
The first step therefore is to ask yourself “what is important about time series in my application?”
This is an example of feature engineering.
Feature engineering is the art of computing some derived measure from your data object that makes the important properties usable in a subsequent step.
A reasonable approach is to
- extract the relevant features,
- use a simple method (e.g., a norm) to define similarity over those features.
In the case above, one might think of using
- Fourier coefficients (to capture periodicity),
- histograms,
- or something else!
Dynamic Time Warping
One case that arises often is something like the following: “bump hunting”
Both time series have the same key characteristics: four bumps.
But a one-to-one match (ala Euclidean distance) will not detect the similarity.
(Be sure to think about why Euclidean distance will fail here.)
A solution to this is called dynamic time warping.
The basic idea is to allow acceleration or deceleration of signals along the time dimension.
Classic applications
- speech recognition
- handwriting recognition
Specifically
- Consider \(X = x_1, x_2, \dots, x_n\) and \(Y = y_1, y_2, \dots, y_m\).
- We are allowed to modify each sequence by inserting, deleting, or matching elements to form \(X'\) and \(Y'\).
- We then calculate, e.g., Euclidean distance between the extended sequences \(X'\) and \(Y'\).
There is a simple way to visualize this algorithm.
Consider a matrix \(M\) where \(M_{ij} = |x_i - y_j|\) (or some other error measure).
\(M\) measures the amount of error we get if we match \(x_i\) with \(y_j\).
So we seek a path through \(M\) that minimizes the total error.
We need to start in the lower left and work our way up via a continuous path.
The basic restrictions on path are:
- Monotonicity
- The path should not go down or to the left.
- Continuity
- No elements may be skipped in a sequence.
This can be solved via dynamic programming. However, the algorithm is still quadratic in \(n\).
To reduce the computational complexity, we can put a restriction on the amount that the path can deviate from the diagonal.
The basic algorithm looks like this:
D[0, 0] = 0
for i in range(n):
for j in range(m):
D[i,j] = M[i,j] +
min( D[i-1, j], # insertion
D[i, j-1], # deletion
D[i-1, j-1] ) # match
Unfortunately, the algorithm is still quadratic in \(n\) – it is \(\mathcal{O}(nm)\).
Hence, we may choose to put a restriction on the amount that the path can deviate from the diagonal.
This is implemented by not allowing the path to pass through locations where \(|i - j| > w\).
Then the algorithm is \(\mathcal{O}(nw)\).
From Time series to Strings
A closely related idea concerns strings.
The key point is that, like time series, strings are sequences.
Given two strings, one way to define a ‘distance’ between them is:
- the minimum number of edit operations that are needed to transform one string into the other.
Edit operations are insertion, deletion, and substitution of single characters.
This is called edit distance or Levenshtein distance.
For example, given strings:
s = VIVALASVEGAS
and
t = VIVADAVIS
we would like to
- compute the edit distance, and
- obtain the optimal alignment.
A dynamic programming algorithm can also be used to find this distance, and it is very similar to dynamic time-warping.
In bioinformatics this algorithm is called “Smith-Waterman” sequence alignment.