Markov chains in ASP.NET

My previous article covers the concepts behind Markov chains. In this article, we’ll look at how Markov chains can be generated using C# code. You can  download the complete source code of this example on the Downloads page.

To start with, the text will need to be sanitized to remove any symbols and punctuation marks that might interfere with the analysis of the text. You can do this using a regular expression, as in the following code:

Input = Regex.Replace(Input, @"[^A-Za-z \.]", "");

Regular expressions like this one are covered in Session 10 of my upcoming ASP.NET, C# and Visual Studio Expert Skills course.

You’ll notice that the full-stop characters are left in place. This is because they’ll be used to indicate the end of a chain. For this to work, however, they’ll need to be separated from the words in the text using the following code:

Input = Input.Replace(".", " . ");

You could do more sanitization at this point to remove double-spaces or HTML tags, but this is enough for this example.

Next the individual words must be extracted from the text and placed in an array. This can be done using the Split method:

string[] Words = Input.Split(' ');

Now the individual words have been placed in the Words array.

Next you’ll need to analyze the text to get the probability data needed to create a Markov chain. To do this, you’ll need a class structure to store the probabilities:

private class MarkovWord
    public MarkovWord(string NewWord, int NewInstances)
        Word = NewWord;
        Instances = NewInstances;
        FollowingWords = new List<MarkovWord>();
    public string Word;
    public int Instances;
    public double Probability;
    public List<MarkovWord> FollowingWords;

Now you’ll need to use this class to extract the probabilities from the text. First you’ll need to extract a unique list of words to use as a ‘top level’:

string[] UniqueWords = Words.Select(UniqueWord => UniqueWord).Distinct().ToArray();

This code uses LINQ to extract a unique list of words from the Words collection. Using LINQ in this way is covered in Session 10 of my ASP.NET, C# and Visual Studio Essential Skills course.

Now you’ll need some more complex code to extract the probability data from the text. You’ll need to create a new collection of MarkovWord objects and use a foreach loop to cycle through each of the unique words.

List<MarkovWord> MarkovData = new List<MarkovWord>();
foreach (string Word in UniqueWords)

Now you’ll need to create a new MarkovWord object for the current word and a collection of MarkovWord objects to store each of the words that follows the current word:

MarkovWord NewWord = new MarkovWord(Word, 1);
List<MarkovWord> FollowingWords = new List<MarkovWord>();

Next, you need to loop through every word in the complete text and look for the current word. You can do this with the following for loop:

for (int Counter = 0; Counter < Words.Length - 1; Counter++)

When you find an instance of the current word, you need to get the word that follows it in the Words array and add it to the FollowingWords collection. If it already exists, you’ll increase the Instances property of the existing object instead of adding a new one:

if (Words[Counter] == Word)
    string FollowingWord = Words[Counter + 1];
    MarkovWord ExistingWord = FollowingWords.SingleOrDefault(FWord => FWord.Word == FollowingWord);
    if (ExistingWord == null) FollowingWords.Add(new MarkovWord(FollowingWord, 1));
    else ExistingWord.Instances++;

Now it’s time to calculate the probability of each following word. This is easy to do by simply adding up the total number of following words and dividing the number of instances of each following word:

double TotalFollowingWords = FollowingWords.Sum(FWord => FWord.Instances);
foreach (MarkovWord FollowingWord in FollowingWords)
    FollowingWord.Probability = FollowingWord.Instances / TotalFollowingWords;

Now all that’s needed is to add the finished MarkovWord object to the MarkovData collection:

NewWord.FollowingWords = FollowingWords;

You have now created the complete data structure needed to create a Markov chain.

To create the Markov chain, you’ll need a Random object:

Random Randomizer = new Random();

You’ll use this to randomly select words for the chain.

Next, you need to select a starting word and create variables to store the random probability and output string:

MarkovWord CurrentWord = MarkovData[0];
double RandomValue;
string OutputString = CurrentWord.Word;

Now you’ll create a while loop to cycle until a full stop is found:

while (CurrentWord.Word != ".")

Within the while loop, you will need to generate a random value. This will determine which word is selected from the FollowingWords collection.

bool WordPicked = false;
RandomValue = Randomizer.NextDouble();
double CurrentProbability = 0;

The NextDouble method will populate the RandomValue variable with a number from 0.0 to 1.0. The WordPicked and CurrentProbability variables will be used later.

Next you’ll need to loop through all of the words in the FollowingWords collection and use the RandomValue to determine which one is selected:

foreach (MarkovWord FollowingWord in CurrentWord.FollowingWords)
    if (!WordPicked)
        CurrentProbability += FollowingWord.Probability;
        if (CurrentProbability >= RandomValue)
            WordPicked = true;
            OutputString += " " + FollowingWord.Word;
            CurrentWord = MarkovData.Single(BaseWord => BaseWord.Word == FollowingWord.Word);

This works by adding the probability of each following word to CurrentProbability and selecting the word if CurrentProbability is greater than or equal to RandomValue.

For example, using the following probabilities:
and (0.33333)
fell (0.66666)

…the word and will only be selected if RandomValue is less than or equal to 0.33333.

Your code is now complete. A Markov chain is generated and placed in the OutputString variable. You can download a complete working version of this code on the Downloads page.

One response to “Markov chains in ASP.NET

Leave a Reply

Your email address will not be published.

Blue Captcha Image