Markov Chains, Bayesian Inference and Natural Language

Just a little idea I’ve been working on over the past week. It seems that by combining Markovian models and Bayesian inference, it’s possible to do some really neat things with recognizing natural language!

By now, most geeks are familiar with the idea of Bayesian inference because of its use in spam blockers. The idea is pretty simple: build a statistical model of which words are likely to follow which other words in a “valid” text (e.g. anything that is not spam). Use Bayes’ inference rule to “chain together” probabilistic judgments about how likely successive pairs of words are to appear next to one another, and after observing enough text, you arrive at a conclusion about whether this text “valid.”

A Markov chain is a discrete-time stochastic process where the next state solely depends on the present state, and has no direct dependency on previous states. Applied to natural language, Markovian models can be used to build a model of how words or phrases in a language “look” – which letters/words are likely to follow other letters/words. Among other things, a Markovian model can be employed to generate text that looks similar to some previously-observed text, but which is actually random.

My insight was that these techniques can be combined. By scanning a corpus of English text, we can build a Markovian model of English words. The Markovian model is capable of answering questions such as : given that we have just seen the characters o m s what is the probability of seeing i as the next character? (Answer: very low! In English, seeing “oms” almost always indicates that we are at the end of a word.)

Our Markovian model lets us make snap judgements about individual characters in a word. Bayesian inference lets us chain together these snap judgements so that they extend to the entire word. After running a word through this process, we arrive at a specific confidence level that this word is “valid” according to our Markovian model. The word may not be an English word, but it sure looks like one.

So: after all of this rigmarole, we have basically reinvented the spell checker, only our spell checker is dumb and it can be fooled. What’s the big deal?

The big deal is that our spell checker is very compact. The Markovian model for the Purdue dictionary of English serializes down to 300k of Ruby-marshalled binary data, and it can be compressed to 100k using bzip2. Using some simple tricks to compactify it further, I estimate that I could shrink the model to 25k binary, or about 100k of JSON. This means that the entire scoring process could be implemented in JavaScript!

Our spell checker is also tolerant of language drift. It isn’t checking against a known dictionary; it’s simply identifying words that look like they could be English words.

Drop me a line if you’re interested in this technique, or if you have an idea for another application I haven’t thought of.