Well-behavedness of self-normalizing models (pt. 1)

Models that end with a softmax (either log-linear or fancy neural things) show up everywhere, but getting normalized probabilities out of them is expensive if, like in language modeling, you have a large label space (& consequently lots of terms in the denominator).

If we just want to plug in the model score as a feature in some downstream task (e.g. MT) it seems fine if the scores don’t actually sum to 1 over all outputs (though we want them to be reasonably close, so that scores from different contexts are on basically the same scale). Thus it would be nice if we could encourage the denominator to be 1 everywhere—then we could just treat the numerator (a single dot product) as a “normalized” score.

There are models (e.g. NCE) that promise to do this automatically, if you just neglect to normalize your predictions, but people (e.g. Devlin NNJM paper) also seem to get reasonable results by just adding a penalty term to the objective which encourages the denominator to be one only on training samples, which initially seemed to me like it might mean nothing at all.

Obviously we can’t expect this to work perfectly in either case—if we imagine explicitly encoding a sum-to-one constraint for every subset of features, we wind up with equations in only unknowns. But maybe we can do this approximately (DLWH says: “within a nat is pretty good”).

So two questions we should ask:

  1. For what distributions over observations should we actually expect this to work?

  2. Regardless of distributional assumptions, if I observe that I’m mostly fitting my training normalizers pretty well, how close should I expect to get to 1 at test time?

We’ll return to 1 some other day.

2 has the form of a kind of weird empirical risk minimization problem, where my goal is to produce the same output on all training examples. We can use most of the standard tools for ERM here. We want to know, in particular, how quickly (and indeed whether)

\[ \sup_{\theta \in \mathbb{R}^{d \times k}} \left| \hat{\mathbb{E}}\left[R\left(\sum_i \exp(\theta_i^\top X)\right)\right] - \mathbb{E}\left[R\left(\sum_i \exp(\theta_i^\top X)\right)\right] \right| \]

where or or something.

We can bound this quantity in terms of the Rademacher complexity of the normalizer, which we can in turn bound by a “scale-sensitive” dimension . In particular (this is due to Alon et al.), call the above quantity . Then for all

\[ Pr[\Delta(R) > \epsilon] \leq \delta \]

for

\[ n \in O\left( \frac{1}{\epsilon^2} \left(d\ \log^2 \frac{d}{\epsilon} + \log\frac{1}{\delta}\right)\right) \]

where is the -dimension of the function class. Note that for this to work, we need the normalizer to be bounded. This is fine if the procedure that selects from training data is regularized—then the Ivanov view of regularization tells us that we’re effectively choosing from inside a ball of fixed radius. So all we need is a bound on .

Super-crude way of doing this: there’s a standard way of getting VC bounds on parametric families involving basic arithmetic operations and conditionals, with for dimensions and operations. If we allow exponentials as well, the complexity becomes . For our denominator, we have parameters, multiplications and additions (inside exponents), and then an additional exponentials and additions (for a total of operations). Thus we get an upper bound on VC dimension of . Plugging this in above gives a PAC-style sample complexity of . We need to do a little extra work to show that this also gives a scale-sensitive dimension for all choices of , but that’s straightforward.

This is pretty dreadful, as in applications we expect both and to be quite large, but at least we have concentration.

It’s almost certainly possible to get a tighter bound, but I’m too lazy to do it right now. Intuition: the level sets of the normalizer are basically “soft” unions of halfspaces, which have dimension . I don’t think the softness buys you anything, so seems right here as well—for a much more civilized sample complexity of . We also know that the arithmetic method given above way overestimates the complexity of single halfspaces, so it’s not surprising that it’s pretty sloppy here as well.

So questions left to answer: is this the real sample complexity? And from above:

  1. For what distributions over observations should we actually expect this to work?

— 20 July 2014