Prelim notes: Mostly kernel methods

Preliminaries

Given a vector space over the reals, we say is linear if and . An inner product space has a real symmetric bilear map such that , strict if the preceding inequality is strict. General norm, distance as in the Euclidean case.

A Hilbert space is an inner product space that is also

  1. complete: contains the limit of every Cauchy sequence (for which as ).

  2. separable: contains a countable dense subset (equiv: contains a countable orthonormal basis).

A kernel is a function such that for some feature function into a Hilbert space.

Note that it is often possible to compute the value of a kernel without explicitly constructing : consider . Then .

The Gram matrix for a set of vectors and a kernel has . Claim: every Gram matrix is PSD—just observe that for some so , which is nonnegative.

A binary function is finitely PSD if its restriction to every finite set of finite vectors forms a PSD matrix. Claim: a function is a kernel iff it is finitely PSD. () comes from preceding paragraph. (For simplicity assume is closed under scaling and linear combination—otherwise there’s a little extra work.) (): for a function under consideration, consider the set of functions , with . This is clearly real-valued, symmetric, and bilinear; positivity follows from the assumption that is finitely PSD. So we only need completeness and separability of the function space containing the . We get separability if the input space is countable or the kernel is continuous (also see below). For completeness, observe that by Cauchy-Schwarz, so it suffices to additionally make our function space closed under such limits.

We call the space constructed above the reproducing kernel Hilbert space associated with .

Mercer’s theorem: Suppose is a continuous kernel. We can associate with it a linear operator of the form

\[ [T_k \phi](x) = \int_\mathcal{X} k(x,s) \phi(s) ds \]

for in . Then there is an orthonormal basis of of eigenfunctions with nonnegative eigenvalues, and has the form

\[ k(s,t) \sum_{j=1}^\infty e_j(s) e_j(t) \]

Practical consequences: the RKHS constructed above has an orthonormal basis. Indeed, every kernal can be expressed as a sum over some countable set of functions of their product on pairs of inputs. In this view, every kernel is a covariance from a distribution over a function class.

So the following are equivalent:

  1. Every Gram matrix is PSD

  2. The integral operator is PSD

  3. We can express as

  4. is the reproducing kernel of an RKHS.

Kernel calculus

If and are kernels, is a function into , is a function into and is PSD, the following are kernels:

The last of these gives us the RBF kernel.

Norm of a feature representation: . We can use this to normalize kernels.

Distance between feature representations: .

Principal components analysis

The eigenvectors of the covariance matrix (rows of are observations) are the directions of maximum covariance. (Let be a unit vector. Then the covariance of the data projected onto is proportional to . This is a Rayleigh quotient, which is maximized by the largest eigenvalue of for the largest eigenvector.)

So if we project the data onto the first principal components, we wind up with a low-dimensional orthogonal basis, and it’s easy to project other points.

What if we want to work in a kernelized representations? The covariance matrix is a sum of outer products, not inner products, so we can’t hope to compute the eigenvectors of the (kernelized) covariance matrix.

Call the matrix whose columns are the eigenvectors of the covariance matrix, and the matrix whose columns are the eigenvectors of the kernel matrix. It is easy to show that is an eigenvector of , and vice-versa; in particular, and . Thus the first columns of are given by . We can use this fact to project any vector onto the first kernelized principal components:

\[ P(\phi(x)) = \sum_{i=1}^l \alpha_i^j k(x_i, x) \]

where .

Kernel machines

Perceptron

A simple way to do online learning of a binary classifier: make a prediction . If the prediction is wrong, .

Claim: if the data lies in some ball of radius and is linearly separable with margin , the perceptron algorithm terminates after no more than updates. Note that , which is the (sign-corrected) projection of onto , which lies perpendicular to the decision boundary.

General idea: we want to show that grows close to 1. In particular, at any timestep , so . But , so . Combining these, we have , whence .

Kernel perceptron

Easy observation: if is initially zero, our final prediction rule is just a sum of observations. So all we need are dot products between previous observations and the current example.

Structured perceptron

Our “update only on mistakes” rule is implicitly a zero-one loss , and we can think of our update as taking a step in the direction of the difference between features that fire on the best predicted label and features that fire on the true label, multiplied by the loss (a little trickiness here with pushing the class label into the feature vector, but it works out). Corresponds to a loss function which is the difference between the model’s score on the prediction and the right answer. Update is where is the model prediction.

SVM

Rather than using a procedure which will find some acceptable split of the data eventually, let’s explicitly try to maximize the margin. Set up an optimization problem of the form:

\[ \min_\theta ||\theta|| \] s.t. \[ \theta^\top x_i \cdot y_i \geq 1 \; \forall i \]

Note that the minimal will assign 1 or -1 to points on the margin (if it gave a higher score then we could reduce it and still be feasible). But then the margin is , so the objective just wants to maximize the margin.

Dual SVM

Let’s form the Lagrangian dual of the above objective.

\[ L(\theta, \alpha) = \frac{1}{2}||\theta||^2 + \sum_i \alpha_i (1 - y_i \theta^\top x_i) \]

Form the dual, and take the inf over first. Setting derivatives equal to zero, we find

\[ \theta = \sum_i \alpha_i y_i x_i \]

from which (with a little algebra) we get

\[ g(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^\top x_j \]

so again, we only need inner products between previously-seen training examples to make our prediction.

Useful fact from optimization: complementary slackness. If duality is tight, a dual variable is zero iff the corresponding primal constraint is not tight. Thus the s are sparse—only terms associated with support vectors are greater than zero. (This is because , every is positive and is negative, so each term in the sum must be zero.)

How do we know if duality is tight? KKT conditions (necessary): is primal feasible, are dual feasible, complementary slackness holds, and all partial derivatives are zero.

Soft-margin SVM

What if there’s no feasible solution? New objective:

\[ \min_{\theta,\xi} \frac{1}{2}||\theta||^2 + \frac{C}{n}\sum_i \xi_i \] s.t. \[ \xi_i \geq 0 \] \[ y_i \theta^\top x_i \geq 1 - \xi_i \] New dual is: \[ \max_\alpha \sum_i \alpha_i -\frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^\top x_j \] s.t. \[ 0 \leq \alpha \leq \frac{C}{n} \]

(Dual) linear regression

Optimal params for linear regression have form for some vector —so the predictor is just a linear combination of training examples. Kernelized version follows.

TODO: in what sense is this a dual?

The representer theorem

It’s convenient that these optimization problems keep winding up with forms that admit kernelization. How generally true is this? Theorem: fix a kernel associated with an RKHS . For any loss function and nondecreasing ,

\[ J(f) = L(f(x_1),\ \ldots,\ f(x_n)) + \Omega(||f||_\mathcal{H}^2) \]

is minimized by some

\[ f = \sum_{i=1}^n \alpha_i k(x_i, \cdot) \]

Moreover, if is increasing, then every minimizer of can be expressed in this form.

Proof of main claim: consider the projection of onto the subspace spanned by RKHS elements associated with the training data. Decompose into this projection (a parallel part and an orthogonal part ). Then . But , so . So (which is a linear combination of training points) is no worse than the best function in the function space.

— 28 July 2014