Loading [MathJax]/jax/output/HTML-CSS/jax.js

Prelim notes: Mostly kernel methods

Preliminaries

Given a vector space over the reals, we say f is linear if f(cx)=cf(x) and f(x+y)=f(x)+f(y). An inner product space has a real symmetric bilear map such that x,x0, strict if the preceding inequality is strict. General norm, distance as in the Euclidean case.

A Hilbert space is an inner product space that is also

  1. complete: contains the limit of every Cauchy sequence (for which supm>n||hnhm||0 as n ).

  2. separable: contains a countable dense subset (equiv: contains a countable orthonormal basis).

A kernel is a function k such that k(x,y)=ϕ(x),ϕ(y) for some feature function ϕ into a Hilbert space.

Note that it is often possible to compute the value of a kernel without explicitly constructing ϕ: consider ϕ(x)=x21+x22+2x1x2. Then k(x,y)=x,y2.

The Gram matrix for a set of vectors {xi} and a kernel k has Gij=k(xi,xj). Claim: every Gram matrix is PSD—just observe that G=PP for some P so zPPz=||Pz||2, which is nonnegative.

A binary function is finitely PSD if its restriction to every finite set of finite vectors forms a PSD matrix. Claim: a function is a kernel iff it is finitely PSD. () comes from preceding paragraph. (For simplicity assume X is closed under scaling and linear combination—otherwise there’s a little extra work.) (): for a function k under consideration, consider the set of functions kx=k(x,), with kx,ky=k(x,y). This is clearly real-valued, symmetric, and bilinear; positivity follows from the assumption that k is finitely PSD. So we only need completeness and separability of the function space containing the kx. We get separability if the input space is countable or the kernel is continuous (also see below). For completeness, observe that (ka(x)kb(x))2=kakb,kx||kakb||2k(x,x) by Cauchy-Schwarz, so it suffices to additionally make our function space closed under such limits.

We call the space constructed above the reproducing kernel Hilbert space associated with k.

Mercer’s theorem: Suppose k is a continuous kernel. We can associate with it a linear operator Tk of the form

[Tkϕ](x)=Xk(x,s)ϕ(s)ds

for ϕ in L2(X). Then there is an orthonormal basis of L2(X) of eigenfunctions Tk with nonnegative eigenvalues, and k has the form

k(s,t)j=1ej(s)ej(t)

Practical consequences: the RKHS constructed above has an orthonormal basis. Indeed, every kernal can be expressed as a sum over some countable set of functions of their product on pairs of inputs. In this view, every kernel is a covariance from a distribution over a function class.

So the following are equivalent:

  1. Every Gram matrix is PSD

  2. The integral operator Tk is PSD

  3. We can express k(x,y) as λϕi(x)ϕi(y)

  4. k is the reproducing kernel of an RKHS.

Kernel calculus

If k1 and k2 are kernels, f is a function into R, ϕ is a function into Rn and B is PSD, the following are kernels:

The last of these gives us the RBF kernel.

Norm of a feature representation: k(,). We can use this to normalize kernels.

Distance between feature representations: k(x,x)+k(y,y)2k(x,y).

Principal components analysis

The eigenvectors of the covariance matrix XX (rows of X are observations) are the directions of maximum covariance. (Let w be a unit vector. Then the covariance of the data projected onto w is proportional to wXXw. This is a Rayleigh quotient, which is maximized by the largest eigenvalue of XXX for w the largest eigenvector.)

So if we project the data onto the first k principal components, we wind up with a low-dimensional orthogonal basis, and it’s easy to project other points.

What if we want to work in a kernelized representations? The covariance matrix is a sum of outer products, not inner products, so we can’t hope to compute the eigenvectors of the (kernelized) covariance matrix.

Call U the matrix whose columns are the eigenvectors of the covariance matrix, and V the matrix whose columns are the eigenvectors of the kernel matrix. It is easy to show that Xv is an eigenvector of C, and vice-versa; in particular, λ1/2Xv=u and λ1/2Xu=v. Thus the first t columns of U are given by XVtΛ1/2t. We can use this fact to project any vector onto the first t kernelized principal components:

P(ϕ(x))=li=1αjik(xi,x)

where αj=λ1/2vj.

Kernel machines

Perceptron

A simple way to do online learning of a binary classifier: make a prediction sgn(θxi). If the prediction is wrong, θθ+yixi.

Claim: if the data lies in some ball of radius D and is linearly separable with margin γ, the perceptron algorithm terminates after no more than R2/γ2 updates. Note that γ=miniθxiyi/||θ||, which is the (sign-corrected) projection of x onto θ, which lies perpendicular to the decision boundary.

General idea: we want to show that θθ/||θ|| grows close to 1. In particular, at any timestep θt+1θ=(θt+yixi)θθtθ+γ||θ||, so θtθtγ||θ||. But ||θt+1||2=||θt+yixi||2=||θt||2+||xi||2+2yiθtxi||θt||2+R2, so ||θt||2tR2. Combining these, we have tγ||θ||θtθ||θt||||θ||tR||θ||, whence tR2/γ2.

Kernel perceptron

Easy observation: if θ is initially zero, our final prediction rule is just a sum of observations. So all we need are dot products between previous observations and the current example.

Structured perceptron

Our “update only on mistakes” rule is implicitly a zero-one loss |yiθxi|, and we can think of our update as taking a step in the direction of the difference between features that fire on the best predicted label and features that fire on the true label, multiplied by the loss (a little trickiness here with pushing the class label into the feature vector, but it works out). Corresponds to a loss function which is the difference between the model’s score on the prediction and the right answer. Update is θθ+ϕ(x,y)ϕ(x,y) where y is the model prediction.

SVM

Rather than using a procedure which will find some acceptable split of the data eventually, let’s explicitly try to maximize the margin. Set up an optimization problem of the form:

minθ||θ|| s.t. θxiyi1i

Note that the minimal θ will assign 1 or -1 to points on the margin (if it gave a higher score then we could reduce it and still be feasible). But then the margin is 2|θxi|/||θ||=2/||θ||, so the objective just wants to maximize the margin.

Dual SVM

Let’s form the Lagrangian dual of the above objective.

L(θ,α)=12||θ||2+iαi(1yiθxi)

Form the dual, and take the inf over θ first. Setting derivatives equal to zero, we find

θ=iαiyixi

from which (with a little algebra) we get

g(α)=iαi12ijαiαjyiyjxixj

so again, we only need inner products between previously-seen training examples to make our prediction.

Useful fact from optimization: complementary slackness. If duality is tight, a dual variable is zero iff the corresponding primal constraint is not tight. Thus the αs are sparse—only terms associated with support vectors are greater than zero. (This is because iλifi(x)=0, every λ is positive and fi is negative, so each term in the sum must be zero.)

How do we know if duality is tight? KKT conditions (necessary): θ is primal feasible, α are dual feasible, complementary slackness holds, and all partial derivatives are zero.

Soft-margin SVM

What if there’s no feasible solution? New objective:

minθ,ξ12||θ||2+Cniξi s.t. ξi0 yiθxi1ξi New dual is: maxαiαi12ijαiαjyiyjxixj s.t. 0αCn

(Dual) linear regression

Optimal params for linear regression have form (XX)1Xy=(XX)(XX)2Xy=Xα for some vector α—so the predictor is just a linear combination of training examples. Kernelized version follows.

TODO: in what sense is this a dual?

The representer theorem

It’s convenient that these optimization problems keep winding up with forms that admit kernelization. How generally true is this? Theorem: fix a kernel k associated with an RKHS H. For any loss function L and nondecreasing Ω,

J(f)=L(f(x1), , f(xn))+Ω(||f||2H)

is minimized by some

f=ni=1αik(xi,)

Moreover, if Ω is increasing, then every minimizer of J(f) can be expressed in this form.

Proof of main claim: consider the projection of f onto the subspace spanned by RKHS elements associated with the training data. Decompose f into this projection (a parallel part f|| and an orthogonal part f). Then f(xi)=f,k(xi,)=f||,k(xi,)=f||(xi). But ||f||2||f||||2, so J(f)=L(f)+Ω(||f||2)L(f||)+Ω(||f||||2). So f|| (which is a linear combination of training points) is no worse than the best function in the function space.

— 28 July 2014