Given a vector space over the reals, we say is linear if and . An inner product space has a real symmetric bilear map such that , strict if the preceding inequality is strict. General norm, distance as in the Euclidean case.
A Hilbert space is an inner product space that is also
complete: contains the limit of every Cauchy sequence (for which as ).
separable: contains a countable dense subset (equiv: contains a countable orthonormal basis).
A kernel is a function such that for some feature function into a Hilbert space.
Note that it is often possible to compute the value of a kernel without explicitly constructing : consider . Then .
The Gram matrix for a set of vectors and a kernel has . Claim: every Gram matrix is PSD—just observe that for some so , which is nonnegative.
A binary function is finitely PSD if its restriction to every finite set of finite vectors forms a PSD matrix. Claim: a function is a kernel iff it is finitely PSD. () comes from preceding paragraph. (For simplicity assume is closed under scaling and linear combination—otherwise there’s a little extra work.) (): for a function under consideration, consider the set of functions , with . This is clearly real-valued, symmetric, and bilinear; positivity follows from the assumption that is finitely PSD. So we only need completeness and separability of the function space containing the . We get separability if the input space is countable or the kernel is continuous (also see below). For completeness, observe that by Cauchy-Schwarz, so it suffices to additionally make our function space closed under such limits.
We call the space constructed above the reproducing kernel Hilbert space associated with .
Mercer’s theorem: Suppose is a continuous kernel. We can associate with it a linear operator of the form
for in . Then there is an orthonormal basis of of eigenfunctions with nonnegative eigenvalues, and has the form
Practical consequences: the RKHS constructed above has an orthonormal basis. Indeed, every kernal can be expressed as a sum over some countable set of functions of their product on pairs of inputs. In this view, every kernel is a covariance from a distribution over a function class.
So the following are equivalent:
Every Gram matrix is PSD
The integral operator is PSD
We can express as
is the reproducing kernel of an RKHS.
If and are kernels, is a function into , is a function into and is PSD, the following are kernels:
The last of these gives us the RBF kernel.
Norm of a feature representation: . We can use this to normalize kernels.
Distance between feature representations: .
The eigenvectors of the covariance matrix (rows of are observations) are the directions of maximum covariance. (Let be a unit vector. Then the covariance of the data projected onto is proportional to . This is a Rayleigh quotient, which is maximized by the largest eigenvalue of for the largest eigenvector.)
So if we project the data onto the first principal components, we wind up with a low-dimensional orthogonal basis, and it’s easy to project other points.
What if we want to work in a kernelized representations? The covariance matrix is a sum of outer products, not inner products, so we can’t hope to compute the eigenvectors of the (kernelized) covariance matrix.
Call the matrix whose columns are the eigenvectors of the covariance matrix, and the matrix whose columns are the eigenvectors of the kernel matrix. It is easy to show that is an eigenvector of , and vice-versa; in particular, and . Thus the first columns of are given by . We can use this fact to project any vector onto the first kernelized principal components:
where .
A simple way to do online learning of a binary classifier: make a prediction . If the prediction is wrong, .
Claim: if the data lies in some ball of radius and is linearly separable with margin , the perceptron algorithm terminates after no more than updates. Note that , which is the (sign-corrected) projection of onto , which lies perpendicular to the decision boundary.
General idea: we want to show that grows close to 1. In particular, at any timestep , so . But , so . Combining these, we have , whence .
Easy observation: if is initially zero, our final prediction rule is just a sum of observations. So all we need are dot products between previous observations and the current example.
Our “update only on mistakes” rule is implicitly a zero-one loss , and we can think of our update as taking a step in the direction of the difference between features that fire on the best predicted label and features that fire on the true label, multiplied by the loss (a little trickiness here with pushing the class label into the feature vector, but it works out). Corresponds to a loss function which is the difference between the model’s score on the prediction and the right answer. Update is where is the model prediction.
Rather than using a procedure which will find some acceptable split of the data eventually, let’s explicitly try to maximize the margin. Set up an optimization problem of the form:
Note that the minimal will assign 1 or -1 to points on the margin (if it gave a higher score then we could reduce it and still be feasible). But then the margin is , so the objective just wants to maximize the margin.
Let’s form the Lagrangian dual of the above objective.
Form the dual, and take the inf over first. Setting derivatives equal to zero, we find
from which (with a little algebra) we get
so again, we only need inner products between previously-seen training examples to make our prediction.
Useful fact from optimization: complementary slackness. If duality is tight, a dual variable is zero iff the corresponding primal constraint is not tight. Thus the s are sparse—only terms associated with support vectors are greater than zero. (This is because , every is positive and is negative, so each term in the sum must be zero.)
How do we know if duality is tight? KKT conditions (necessary): is primal feasible, are dual feasible, complementary slackness holds, and all partial derivatives are zero.
What if there’s no feasible solution? New objective:
Optimal params for linear regression have form for some vector —so the predictor is just a linear combination of training examples. Kernelized version follows.
TODO: in what sense is this a dual?
It’s convenient that these optimization problems keep winding up with forms that admit kernelization. How generally true is this? Theorem: fix a kernel associated with an RKHS . For any loss function and nondecreasing ,
is minimized by some
Moreover, if is increasing, then every minimizer of can be expressed in this form.
Proof of main claim: consider the projection of onto the subspace spanned by RKHS elements associated with the training data. Decompose into this projection (a parallel part and an orthogonal part ). Then . But , so . So (which is a linear combination of training points) is no worse than the best function in the function space.