Given a vector space over the reals, we say f is linear if f(cx)=cf(x) and f(x+y)=f(x)+f(y). An inner product space has a real symmetric bilear map such that ⟨x,x⟩≥0, strict if the preceding inequality is strict. General norm, distance as in the Euclidean case.
A Hilbert space is an inner product space that is also
complete: contains the limit of every Cauchy sequence (for which supm>n||hn−hm||→0 as n→∞ ).
separable: contains a countable dense subset (equiv: contains a countable orthonormal basis).
A kernel is a function k such that k(x,y)=⟨ϕ(x),ϕ(y)⟩ for some feature function ϕ into a Hilbert space.
Note that it is often possible to compute the value of a kernel without explicitly constructing ϕ: consider ϕ(x)=x21+x22+√2x1x2. Then k(x,y)=⟨x,y⟩2.
The Gram matrix for a set of vectors {xi} and a kernel k has Gij=k(xi,xj). Claim: every Gram matrix is PSD—just observe that G=P⊤P for some P so z⊤P⊤Pz=||Pz||2, which is nonnegative.
A binary function is finitely PSD if its restriction to every finite set of finite vectors forms a PSD matrix. Claim: a function is a kernel iff it is finitely PSD. (⇒) comes from preceding paragraph. (For simplicity assume X is closed under scaling and linear combination—otherwise there’s a little extra work.) (⇐): for a function k under consideration, consider the set of functions kx=k(x,⋅), with ⟨kx,ky⟩=k(x,y). This is clearly real-valued, symmetric, and bilinear; positivity follows from the assumption that k is finitely PSD. So we only need completeness and separability of the function space containing the kx. We get separability if the input space is countable or the kernel is continuous (also see below). For completeness, observe that (ka(x)−kb(x))2=⟨ka−kb,kx⟩≤||ka−kb||2k(x,x) by Cauchy-Schwarz, so it suffices to additionally make our function space closed under such limits.
We call the space constructed above the reproducing kernel Hilbert space associated with k.
Mercer’s theorem: Suppose k is a continuous kernel. We can associate with it a linear operator Tk of the form
for ϕ in L2(X). Then there is an orthonormal basis of L2(X) of eigenfunctions Tk with nonnegative eigenvalues, and k has the form
Practical consequences: the RKHS constructed above has an orthonormal basis. Indeed, every kernal can be expressed as a sum over some countable set of functions of their product on pairs of inputs. In this view, every kernel is a covariance from a distribution over a function class.
So the following are equivalent:
Every Gram matrix is PSD
The integral operator Tk is PSD
We can express k(x,y) as ∑λϕi(x)ϕi(y)
k is the reproducing kernel of an RKHS.
If k1 and k2 are kernels, f is a function into R, ϕ is a function into Rn and B is PSD, the following are kernels:
The last of these gives us the RBF kernel.
Norm of a feature representation: √k(⋅,⋅). We can use this to normalize kernels.
Distance between feature representations: k(x,x)+k(y,y)−2k(x,y).
The eigenvectors of the covariance matrix X⊤X (rows of X are observations) are the directions of maximum covariance. (Let w be a unit vector. Then the covariance of the data projected onto w is proportional to w⊤X⊤Xw. This is a Rayleigh quotient, which is maximized by the largest eigenvalue of X⊤XX for w the largest eigenvector.)
So if we project the data onto the first k principal components, we wind up with a low-dimensional orthogonal basis, and it’s easy to project other points.
What if we want to work in a kernelized representations? The covariance matrix is a sum of outer products, not inner products, so we can’t hope to compute the eigenvectors of the (kernelized) covariance matrix.
Call U the matrix whose columns are the eigenvectors of the covariance matrix, and V the matrix whose columns are the eigenvectors of the kernel matrix. It is easy to show that X′v is an eigenvector of C, and vice-versa; in particular, λ−1/2X⊤v=u and λ−1/2Xu=v. Thus the first t columns of U are given by X⊤VtΛ−1/2t. We can use this fact to project any vector onto the first t kernelized principal components:
where αj=λ−1/2vj.
A simple way to do online learning of a binary classifier: make a prediction sgn(θ⊤xi). If the prediction is wrong, θ←θ+yixi.
Claim: if the data lies in some ball of radius D and is linearly separable with margin γ, the perceptron algorithm terminates after no more than R2/γ2 updates. Note that γ=miniθ⊤xiyi/||θ||, which is the (sign-corrected) projection of x onto θ, which lies perpendicular to the decision boundary.
General idea: we want to show that θ⊤θ∗/||θ|| grows close to 1. In particular, at any timestep θ⊤t+1θ∗=(θt+yixi)⊤θ∗≥θ⊤tθ∗+γ||θ∗||, so θ⊤tθ∗≥tγ||θ∗||. But ||θt+1||2=||θt+yixi||2=||θt||2+||xi||2+2yiθ⊤txi≤||θt||2+R2, so ||θt||2≤tR2. Combining these, we have tγ||θ∗||≤θ⊤tθ∗≤||θt||||θ∗||≤√tR||θ||, whence t≤R2/γ2.
Easy observation: if θ is initially zero, our final prediction rule is just a sum of observations. So all we need are dot products between previous observations and the current example.
Our “update only on mistakes” rule is implicitly a zero-one loss |yi−θ⊤xi|, and we can think of our update as taking a step in the direction of the difference between features that fire on the best predicted label and features that fire on the true label, multiplied by the loss (a little trickiness here with pushing the class label into the feature vector, but it works out). Corresponds to a loss function which is the difference between the model’s score on the prediction and the right answer. Update is θ←θ+ϕ(x,y)−ϕ(x,y∗) where y∗ is the model prediction.
Rather than using a procedure which will find some acceptable split of the data eventually, let’s explicitly try to maximize the margin. Set up an optimization problem of the form:
Note that the minimal θ will assign 1 or -1 to points on the margin (if it gave a higher score then we could reduce it and still be feasible). But then the margin is 2|θ⊤xi|/||θ||=2/||θ||, so the objective just wants to maximize the margin.
Let’s form the Lagrangian dual of the above objective.
Form the dual, and take the inf over θ first. Setting derivatives equal to zero, we find
from which (with a little algebra) we get
so again, we only need inner products between previously-seen training examples to make our prediction.
Useful fact from optimization: complementary slackness. If duality is tight, a dual variable is zero iff the corresponding primal constraint is not tight. Thus the αs are sparse—only terms associated with support vectors are greater than zero. (This is because ∑iλ∗ifi(x∗)=0, every λ is positive and fi is negative, so each term in the sum must be zero.)
How do we know if duality is tight? KKT conditions (necessary): θ is primal feasible, α are dual feasible, complementary slackness holds, and all partial derivatives are zero.
What if there’s no feasible solution? New objective:
Optimal params for linear regression have form (X⊤X)−1X⊤y=(X⊤X)(X⊤X)−2X⊤y=Xα for some vector α—so the predictor is just a linear combination of training examples. Kernelized version follows.
TODO: in what sense is this a dual?
It’s convenient that these optimization problems keep winding up with forms that admit kernelization. How generally true is this? Theorem: fix a kernel k associated with an RKHS H. For any loss function L and nondecreasing Ω,
is minimized by some
Moreover, if Ω is increasing, then every minimizer of J(f) can be expressed in this form.
Proof of main claim: consider the projection of f onto the subspace spanned by RKHS elements associated with the training data. Decompose f into this projection (a parallel part f|| and an orthogonal part f⊥). Then f(xi)=⟨f,k(xi,⋅)⟩=⟨f||,k(xi,⋅)⟩=f||(xi). But ||f||2≥||f||||2, so J(f)=L(f)+Ω(||f||2)≥L(f||)+Ω(||f||||2). So f|| (which is a linear combination of training points) is no worse than the best function in the function space.