Fun with Markov Chains and Clojure

Markov chains are a simple concept. Start with a state (e.g. a word, a location on a map, a musical note) and move to another one at random. Repeat until you have a chain of states that you can use for something. The key is that you don’t choose randomly from all possible states, only from those that have some probability of following the current state.

For example, suppose you know that a sentence starts with the word “She.” It’s reasonable to bet on the next word being “is” or “was.” It’s possible (but less likely) to see less common words such as “asserted” or “obliterated.” Most likely your text doesn’t contain instances of phrases like “She chemistry” or “She absolutism.” If that’s the case, those words have zero probability of being chosen.

Let’s take the following text as an example:

He is a boy. She is a girl. He is young. She is younger.

Here is a list of all the transitions we see in the sentences from this text:

{"She" ("is" "is"), "a" ("boy." "girl."), 
"is" ("a" "a" "young." "younger."),
"He" ("is" "is"), "*START*" ("He" "She" "He" "She")}

Note that the word “is” has four words that it could transition to. Two of them are “a” so if we start generating random sentences, “is” will be followed by “a” half the time. Let’s choose a word that follows the *START* state (nothing) and continue until we find a word with a period. Here are some sentences we could see:

"He is younger." "She is a girl." "She is a boy." "He is a boy."

Ok, time to automate this with some Clojure code (or skip to the “generate gibberish” button below if you’re not a programmer):

The first function generates a map of transitions like the one shown above. The second one chooses words from the map until it finds a word that ends in a period. It’s easy to compile this into JavaScript via ClojureScript. I’ve done that, check out the full code on Github. You can test it below: click the “generate gibberish” button and check the output. The input box contains Blowin’ in the Hamlet by Bob Shakespeare. Try replacing that with a few paragraphs of your own.


Output:

——-

Side note: the code above is the simplest algorithm I came up with for generating a Markov chain out of a few paragraphs, but it doesn’t scale for large amounts of text. If you try running it against a book, it will run out of memory. In a follow-up post I’ll show how to put together a solution that scales better. It will be more about Clojure than about Markov Chains.