Prelim notes: NLP

Language modeling

N-grams are sparse. How do we solve the sparsity problem?

Speech

General setup: a noisy channel model. Given a sequence of observations , we want to find the most likely sequence of words that produced them; we write . is just a language model. What does the state space look like for ? Easiest thing is just an HMM, but the beginning, middle, and end of each phone look different—instead we split each phone state into three substates, and learn a model of these. (Note that this forces us to assume that phone lengths are geometrically distributed—we’ll come back to this.)

What representation of sound do we actually generate? Theoretically possible to make the waveform directly, but get much better performance by doing a bunch of preprocessing, and then generating the processed representation. In particular, we turn each frame of the recording into an MFCC vector, as follows:

  1. Preemphasis: apply a high-pass filter to increase the prominence of high frequencies (this prevents high formants from being masked by lower ones).

  2. Windowing: divide signal into a bunch of overlapping frames. For these use a Hamming window, which passes the signal through a sinusoidal envelope.

  3. DFT

  4. Mel filter bank: human auditory perception isn’t uniform, but is instead linear below 1000 Hz and logarithmic above–transform this way. In practice, this is implemented with a bunch of band-pass-ish filters with triangular responses.

  5. Cepstral transform: find the spectrum of the log of the spectrum. Intuitively, this separates out effects of the fundamental (whose overtones show up periodically across the spectrum) from other filter effects. Keep around the first 12 components.

  6. Feature computation: from the cepstral transform, use the 12 components, estimates of their first and second derivatives, and corresponding features () on the total energy in the frame.

The resulting vector is real-valued, so we still need a way of generating it from the discrete hidden state space. Na"ively, just discretize (with clustering alg.\ of your choice), then generate from a categorical; more cleverly, represent each emission function as a mixture of Gaussians. This just increases the hidden state space. In practice, these Gaussians are usually assumed to have diagonal covariance.

How to do this faster?

How to do this more accurately?

Machine translation

Alignment

First thing we need for learning a statistical MT system from parallel data is alignment between source and target words. A couple of ways to do this:

The actual translations we get out of this model tend to be terrible—more useful to let the model memorize large many-to-many alignments. Various hacky ways of extracting these things once we have MAP alignments for each sentence.

Decoding

Now we have a phrase table, with weights on each rule (don’t worry about where these weights came from). How do we translate? Looks like a search problem: incrementally consume some block of words (anywhere) in the source sentence, place them next in the target sentence, and repeat until entire source is used up. Decoder state space looks like (source bitstring, target LM context). Just beam search.

General dynamic programming

— 11 August 2014