Text Prediction with Markov Chains in Haskell

Let’s work through a few prediction techniques that can give us text that resembles some source material. Using Haskell, we can reuse lots of code while we work up to a Markov chain. First, we will predict words at random by choosing, with replacement, from all unique words in the material. Next, we will predict using the previous technique on every word position in a sentence. Then we will implement a Markov chain which allows us to predict the next word given a chunk of words from the material. So, given a chunk of words from the source material, we keep track of the unique words that follow it. We then build a cumulative distribution function (CDF) for each list of found words. We can then build the next chunk using a word chosen at random from the cumulative distribution function.

Running code

The code in this project uses no ghc extensions or external modules. The only requirement is ghc with the Prelude.

Choosing from all words, with replacement

Let’s build up the file piece by piece. First we have the imports. Each is imported as qualified to make code more clear about the origins of various functions. We use Map to create histograms that are then transformed into Map.Map [String] CDF where CDF is [(Double,"String")]. Random is used to generate samples between zero and one. The Time is used to seed the random number generator.

import qualified Data.Map        as Map
import qualified System.Random   as Random
import qualified Data.Time       as Time

Here is the filename of the source material in text format. We will do a little cleanup when we initially parse it but you may need to cleanup differently depending on your source material.

fileName = "dracula.txt" -- [Project Gutenberg, Dracula](https://www.gutenberg.org/files/345/345-0.txt)

We build a histogram by counting the occurences of each word and storing the word with its associated count in a Map.

histogram :: (Ord k, Num a) => [k] -> Map.Map k a -> Map.Map k a
histogram [] map = map
histogram (w:ws) map =
  case Map.lookup w map of
    Nothing -> histogram ws (Map.insert w 1 map)
    Just v -> histogram ws (Map.insert w (v + 1) map)

Now for the workhorse of our implmentation, we need to create cumulative distribution functions for each word chunk. This will allow us to randomly generate samples between zero and one that we use to select the corresponding prediction from the CDF. First, createPDF creates a probability density function which tells us the probability of choosing each prediction. Then createCDF runs a cummulative sum on the probabilities that sampleCDF can then sample from.

createPDF histo n = fmap (\v -> fromIntegral v / fromIntegral n) histo

createCDF :: Map.Map String Double -> [(Double, String)]
createCDF map = scanl ( \( cummProb, word ) (key, prob) -> ( cummProb + prob, key ) ) (0, "") (Map.toList map)

sampleCDF :: Ord a => [(a, b)] -> a -> (a, b)
sampleCDF cdf sample = head $ dropWhile ( \(p, _) -> sample > p ) cdf

Just a few helper functions. cleanup will remove the listed characters, and now gives us an integer value to seed the random number generator.

cleanup :: String -> String
cleanup s = filter (\c -> not (c `elem` ['\\', '\"', '\'', '.'])) s

now :: IO Int
now = (read <$> Time.formatTime Time.defaultTimeLocale "%s" <$> Time.getCurrentTime)

Now to use the pieces we’ve built up we read the file from disk, clean it up, then break it into a list of words to process. You can easily follow this process by reading the variable names. First we create a histogram and get the total number of words. We then create our PDF and CDF from the histogram, generate random samples. For each sample, we choose a prediction from the CDF.

split_by_word = do
  words . cleanup <$> readFile fileName
  >>= \rawWs -> do
    let histo = histogram rawWs (Map.empty)

    let totalWs = length rawWs
    let pdf = createPDF histo totalWs
    let cdf = createCDF pdf

    epoch_int <- now
    let gen = Random.mkStdGen epoch_int
    let samples = Random.randomRs (0.0 :: Double, 1.0 :: Double) gen

    print $ unwords . fmap snd . take 12 $ fmap (sampleCDF cdf) samples

main = split_by_word

Some interesting results when using Dracula as a source text:

fears how on you diary do and his She a for the
fumbling from complete it the with wife way Morris:-- in got lest
up plainly credentials; the click steel that _2 that wonder And to

Choosing from words in each position

From here, we will only make additions or modifications to the contents listed above.

We need a few additions to help us build up to the task of creating CDFs for each word position. We will add Split to break the material up into sentences and Char to make all the words lower case.

import qualified Data.Map        as Map
import qualified System.Random   as Random
import qualified Data.Time       as Time
import qualified Data.List.Split as Split
import qualified Data.Char       as Char

Next is a little helper to define our bounds when generating random samples.

zeroToOne :: (Double, Double)
zeroToOne = (0.0, 1.0)

The main difference in our algorithm is creating histograms from each word position. This is done in histogramByPosition. If you were so inclined, you could break up the material by sentence, then transpose it to get a list of lists where the inner lists correspond to a word position.

histogramByPosition :: [[String]] -> Map.Map String Int -> Int -> Map.Map String Int
histogramByPosition sentences map pos = histogram listOfWords (Map.empty)
    listOfWords = concat . fmap (take 1 . drop pos) $ sentences

Don’t remove punctuation because we are using it to separate the material into sentences.

cleanup :: String -> String
cleanup s = filter (\c -> not (c `elem` ['\\', '\"', '\'', ','])) s

There is a little more cleanup before we are able to start parsing the sentences. We remove unwanted punctation in cleanup, make all words lower case, split on sentence punction, split each sentence into words, then remove empty sentence lists and list of length 2 or less. To generate the CDFs for each word position we map each word position (starting from zero) to a histogram using histogramByPosition. Then for each histogram, we generate a PDF and a CDF so that cdfs now holds a CDF for each word position. The final task is in generating samples and zipping them with the CDFs to process. To process each (sample, cdf) tuple, we run fmap on it and call sampleCDF on each sample and cdf.

split_by_sentence =
  (filter (\x -> (not . null $ x) && (length x > 2))
    . fmap (words)
    . Split.splitOneOf ".!?\n"
    . fmap Char.toLower
    . cleanup)
    <$> readFile fileName >>= \sentences -> do
      let sentenceLength = 10
      let cdfs = fmap createCDF
               . fmap createPDF
               . fmap (histogramByPosition sentences (Map.empty))
               $ [0..sentenceLength - 1]

      time <- now
      let gen = Random.mkStdGen time
      let samples = take sentenceLength $ Random.randomRs zeroToOne gen
      let sampleCDFs = zip samples cdfs
      let sentence = fmap (\(sample, cdf) -> sampleCDF cdf sample) sampleCDFs
      print $ unwords $ fmap snd $ sentence

main = split_by_sentence

Some interesting results when using Dracula as a source text:

i looks down harkers helsing in reading a own lloyds
he back looked yer is placed we fair that at
instead arrangements eyes would go who got i all me

Using Markov Chains

There is a significant amount of work to build up the Markov chains when compared to the previous prediction attempts. First we’ll deal with the different structure of our mapping. We will create a histogram for each key that simply gets the length of the given list. This is to build a starting point and a place to choose the next prediction randomly.

histogram' :: (Functor f, Foldable t) => f (t a) -> f Int
histogram' map = fmap length map

Next we need to account for storing the key as a list of words rather than just a single word. The last thing we need to change is in createCDF' where we create [(Double,[String])] as tuples of a cumulative probability and a list of words.

createCDF' map = scanl ( \( cummProb, word ) (key, prob) -> ( cummProb + prob, key ) ) (0, [""]) (Map.toList map)

Now for the key to building predictions based on a chunk of words, we use buildTuples to take a list of words and the size of the chunks. We will keep the next word prediction as a list of String since we will be joining these words in the next step.

buildTuples :: [a] -> Int -> [([a], [a])]
buildTuples [] _ = []
buildTuples ws n = [(key, nextWord)] ++ buildTuples (drop 1 ws) n
    key = take n ws
    nextWord = take 1 . drop n $ ws

Here we gather the possible words that follow our chunks into lists. After this step, we will have Map.Map [String] [String]. From here we could sample uniformly from the values after choosing a key at random. But we will be generating CDFs from each predicted word list for each chosen word chunk.

gatherPredictions :: Ord k => Map.Map k [a] -> [(k, [a])] -> Map.Map k [a]
gatherPredictions map [] = map
gatherPredictions map ((keys,predWords):tuples) =
  case Map.lookup keys map of
    Nothing -> gatherPredictions (Map.insert keys predWords map) tuples
    Just v -> gatherPredictions (Map.insert keys (v ++ predWords) map) tuples 

The last building blocks we need will be in getting the next key by joining the previous key with the predicted word while dropping the leftmost word. Then we can build a sentence starting at some key then choosing from the associated CDF using the given samples. buildSentence uses getNextKey to generate the next chunk to use for prediction while generating a list of the predicted words.

getNextKey map allKeys key sample = 
  case Map.lookup key map of
    Just cdf -> (drop 1 key) ++ [snd $ sampleCDF cdf sample]
    Nothing -> snd $ sampleCDF allKeys sample 

buildSentence tuples keysCDF key [] = []
buildSentence tuples keysCDF key (s:samples) = take 1 key : buildSentence tuples keysCDF nextKey samples
    nextKey = getNextKey tuples keysCDF key s

To utilize these extra building blocks we first build up the tuples of word chunks and next words, then gather those next words into lists. We run the same sort of processing to generate the CDFs in our mappings. To make sense of this, read the last line. On the last line, we build the sentence from our tuples of word chunks and CDFs and the CDF of word chunks starting at the first sample. For each word, we need to sample from our main mapping of word chunk keys and CDF values. You can experiment with the size of the word chunk by changing the chunk size in the call to buildTuples. I found chunk sizes above three to mostly give verbatim source material since the number of word choices decreases given a chunk of words of increasing size.

markov_chain =
  filter (\w -> not (w == ""))
  . words
  . fmap Char.toLower
  . cleanup
  <$> readFile fileName >>= \ws -> do
    let predictions = gatherPredictions Map.empty $ buildTuples ws 3
    let tuples = fmap (createCDF . createPDF . (\ws -> histogram ws Map.empty)) $ predictions

    time <- now
    let gen = Random.mkStdGen time

    let keysCDF = createCDF' . createPDF $ histogram' predictions
    let (sample, gen') = Random.randomR zeroToOne $ gen
    let firstSample = sampleCDF keysCDF sample
    let samples = take 100 $ Random.randomRs zeroToOne gen'

    print $ unwords . concat $ buildSentence tuples keysCDF (snd firstSample) samples

main = markov_chain

Some interesting results when using Dracula as a source text with chunk size preceding:

1 a wreck and infinite pity me as we find him i could not have always wakes what we have helped

2 of crew petrofsky was missing and we had come with us it seemed to me and i were like me if

3 dear lucys death and all that sort of nonsense you might as well ask a man to eat molecules with