A couple of years ago, I worked on a natural language processing application along with fellow C# developer Michael Cronin. While working on this, I did extensive research into Markov Chains. Although there are many articles on the internet, I found that none of them did a good job of explaining the concepts behind Markov Chains.
Markov Chains are used to produce random text that still largely conforms to the rules of the English language. This random text is sometimes used by authors and musicians as a source of inspiration, although it is also used by spammers to trick spam filters.
Markov chains are generated by analzying a piece of text to determine the probability of each word. Here’s an example, calculating the probability data for the word Jack in the nursery rhyme “Jack and Jill”:
Jack and Jill went up the hill to fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after.
Up got Jack, and home did trot
As fast as he could caper
He went to bed and bound his head
With vinegar and brown paper.
There are three instances of the word Jack in the text. There are two cases where Jack is followed by and, and one instance where it is followed by fell.
A Markov chain sees this as a 66% chance (2/3) of the word Jack being followed by the word and, and a 33% chance (1/3) of Jack being followed by the word fell.
To produce a Markov chain, you need to calculate probabilities for every word in the text. This produces the following probability data:
and | Jill (33%) | broke (16%) | home (16%) | bound (16%) | brown (16%) |
Jack | and (33%) | fell (66%) | |||
as | fast (50%) | he (50%) | |||
he | could (50%) | went (50%) | |||
his | crown (50%) | head (50%) | |||
Jill | went (50%) | came (50%) | |||
to | fetch (50%) | bed (50%) | |||
up | the (50%) | got (50%) | |||
went | up (50%) | to (50%) |
All of the other words in the rhyme are followed by only one word (100% probability).
Let’s now create some example chains from this data, using the word Jack as a starting point. We’ll stop at 7 words for the sake of simplicity.
Jack fell down and home did trot (80%)
Jack fell down and brown paper (76%)
Jack fell down and Jill came tumbling (75%)
Jack fell down and broke his crown (72%)
Jack fell down and broke his head (72%)
Jack fell down and bound his crown (72%)
Jack fell down and bound his head (72%)
All of these chains could potentially be generated from the text, although some are more likely than others.
Here’s a breakdown of the first chain:
Jack has a 66% chance of being followed by fell
fell has a 100% chance of being followed by down
down has a 100% chance of being followed by and
and has a 16% chance of being followed by home
home has a 100% chance of being followed by did
did has a 100% chance of being followed by trot
The average of these percentages is around 80%.
The actual probability of this result is around 10%.