N-grams are sparse. How do we solve the sparsity problem?
Laplace smoothing. Add one to each count, then renormalize. Equivalently, the adjusted count . This is bad.
Good-Turing discounting. For every type that occurs times, allocate mass to it from the types that occur times. In particular, where is the number of types that occur times. Observe that the sum of all adjusted counts is from which we see that a total count of is missing. We divide this among all events observed zero times.
Linear interpolation. Just take a linear-combination of high- and low-order models, with weights set by search on a held out corpus.
Kneser-Ney. Intuition is that when we back off to a lower-order model, it might be useful to use something other than the raw count. Example: “Francisco” is very common, but only after “San”. So maybe better to do approximate the probability of a word as the number of different contexts it has appeared in. So general form of K-N is
In fact we have higher-order analogs, where we back down to type models of different granularities.
General setup: a noisy channel model. Given a sequence of observations , we want to find the most likely sequence of words that produced them; we write . is just a language model. What does the state space look like for ? Easiest thing is just an HMM, but the beginning, middle, and end of each phone look different—instead we split each phone state into three substates, and learn a model of these. (Note that this forces us to assume that phone lengths are geometrically distributed—we’ll come back to this.)
What representation of sound do we actually generate? Theoretically possible to make the waveform directly, but get much better performance by doing a bunch of preprocessing, and then generating the processed representation. In particular, we turn each frame of the recording into an MFCC vector, as follows:
Preemphasis: apply a high-pass filter to increase the prominence of high frequencies (this prevents high formants from being masked by lower ones).
Windowing: divide signal into a bunch of overlapping frames. For these use a Hamming window, which passes the signal through a sinusoidal envelope.
DFT
Mel filter bank: human auditory perception isn’t uniform, but is instead linear below 1000 Hz and logarithmic above–transform this way. In practice, this is implemented with a bunch of band-pass-ish filters with triangular responses.
Cepstral transform: find the spectrum of the log of the spectrum. Intuitively, this separates out effects of the fundamental (whose overtones show up periodically across the spectrum) from other filter effects. Keep around the first 12 components.
Feature computation: from the cepstral transform, use the 12 components, estimates of their first and second derivatives, and corresponding features () on the total energy in the frame.
The resulting vector is real-valued, so we still need a way of generating it from the discrete hidden state space. Na"ively, just discretize (with clustering alg.\ of your choice), then generate from a categorical; more cleverly, represent each emission function as a mixture of Gaussians. This just increases the hidden state space. In practice, these Gaussians are usually assumed to have diagonal covariance.
How to do this faster?
Run -means rather than EM when estimating model parameters
Do coarse-to-fine (where refinement is e.g.\ over LM context length)
A*
How to do this more accurately?
Further state refinement—Markovization of state space.
Discriminative training
First thing we need for learning a statistical MT system from parallel data is alignment between source and target words. A couple of ways to do this:
Model 1. Generative process is
Draw an alignment between source and target uniformly at random.
Draw a target word conditioned on its aligned source word.
Only parameters here are . While EM training, there’s a latent variable associated with each target word indicating what source word it’s aligned to. Global maximum! And MAP estimation is easy as each target word just chooses its most likely source.
Model 3. Generative process is
For each source word, draw a number of children (“fertility model”) and duplicate accordingly.
Insert NULLs.
Translate lexically.
Distort.
Main new contribution is a fertility model. Model params are fertility distributions, NULL insertion probability, lexical parameters, distortion parameters. Hidden variables are selected fertilities, alignments per word.
Inference requires sampling.
HMM model. Generative process is
Draw a target length conditioned on source length.
Draw a (target-to-source) alignment variable conditioned on the previous alignment variable.
Draw a target word conditioned on the aligned source word.
In the second step, the distribution is parameterized only by the difference between the two indices, rather than the absolute indices themselves.
This model captures the intuition that nearby words tend to align near each other. We can do inference with forward–backward, so EM training is easy.
The actual translations we get out of this model tend to be terrible—more useful to let the model memorize large many-to-many alignments. Various hacky ways of extracting these things once we have MAP alignments for each sentence.
Now we have a phrase table, with weights on each rule (don’t worry about where these weights came from). How do we translate? Looks like a search problem: incrementally consume some block of words (anywhere) in the source sentence, place them next in the target sentence, and repeat until entire source is used up. Decoder state space looks like (source bitstring, target LM context). Just beam search.
Hypergraph inside–outside.
For each inside step,
For each outside step,