Text Prediction with Markov Chains in Haskell
Let’s work through a few prediction techniques that can give us text that resembles some source material. Using Haskell, we can reuse lots of code while we work up to a Markov chain. First, we will predict words at random by choosing, with replacement, from all unique words in the material. Next, we will predict using the previous technique on every word position in a sentence. Then we will implement a Markov chain which allows us to predict the next word given a chunk of words from the material. So, given a chunk of words from the source material, we keep track of the unique words that follow it. We then build a cumulative distribution function (CDF) for each list of found words. We can then build the next chunk using a word chosen at random from the cumulative distribution function.
The code in this project uses no ghc extensions or external modules. The only requirement is
ghc with the
Choosing from all words, with replacement
Let’s build up the file piece by piece. First we have the imports. Each is imported as qualified to make code more clear about the origins of various functions. We use Map to create histograms that are then transformed into
Map.Map [String] CDF where CDF is
[(Double,"String")]. Random is used to generate samples between zero and one. The Time is used to seed the random number generator.
import qualified Data.Map as Map import qualified System.Random as Random import qualified Data.Time as Time
Here is the filename of the source material in text format. We will do a little cleanup when we initially parse it but you may need to cleanup differently depending on your source material.
fileName = "dracula.txt" -- [Project Gutenberg, Dracula](https://www.gutenberg.org/files/345/345-0.txt)
We build a histogram by counting the occurences of each word and storing the word with its associated count in a Map.
histogram :: (Ord k, Num a) => [k] -> Map.Map k a -> Map.Map k a histogram  map = map histogram (w:ws) map = case Map.lookup w map of Nothing -> histogram ws (Map.insert w 1 map) Just v -> histogram ws (Map.insert w (v + 1) map)
Now for the workhorse of our implmentation, we need to create cumulative distribution functions for each word chunk. This will allow us to randomly generate samples between zero and one that we use to select the corresponding prediction from the CDF. First,
createPDF creates a probability density function which tells us the probability of choosing each prediction. Then
createCDF runs a cummulative sum on the probabilities that
sampleCDF can then sample from.
createPDF histo n = fmap (\v -> fromIntegral v / fromIntegral n) histo createCDF :: Map.Map String Double -> [(Double, String)] createCDF map = scanl ( \( cummProb, word ) (key, prob) -> ( cummProb + prob, key ) ) (0, "") (Map.toList map) sampleCDF :: Ord a => [(a, b)] -> a -> (a, b) sampleCDF cdf sample = head $ dropWhile ( \(p, _) -> sample > p ) cdf
Just a few helper functions.
cleanup will remove the listed characters, and
now gives us an integer value to seed the random number generator.
cleanup :: String -> String cleanup s = filter (\c -> not (c `elem` ['\\', '\"', '\'', '.'])) s now :: IO Int now = (read <$> Time.formatTime Time.defaultTimeLocale "%s" <$> Time.getCurrentTime)
Now to use the pieces we’ve built up we read the file from disk, clean it up, then break it into a list of words to process. You can easily follow this process by reading the variable names. First we create a histogram and get the total number of words. We then create our PDF and CDF from the histogram, generate random samples. For each sample, we choose a prediction from the CDF.
split_by_word = do words . cleanup <$> readFile fileName >>= \rawWs -> do let histo = histogram rawWs (Map.empty) let totalWs = length rawWs let pdf = createPDF histo totalWs let cdf = createCDF pdf epoch_int <- now let gen = Random.mkStdGen epoch_int let samples = Random.randomRs (0.0 :: Double, 1.0 :: Double) gen print $ unwords . fmap snd . take 12 $ fmap (sampleCDF cdf) samples main = split_by_word
Some interesting results when using Dracula as a source text:
fears how on you diary do and his She a for the fumbling from complete it the with wife way Morris:-- in got lest up plainly credentials; the click steel that _2 that wonder And to
Choosing from words in each position
From here, we will only make additions or modifications to the contents listed above.
We need a few additions to help us build up to the task of creating CDFs for each word position. We will add Split to break the material up into sentences and Char to make all the words lower case.
import qualified Data.Map as Map import qualified System.Random as Random import qualified Data.Time as Time import qualified Data.List.Split as Split import qualified Data.Char as Char
Next is a little helper to define our bounds when generating random samples.
zeroToOne :: (Double, Double) zeroToOne = (0.0, 1.0)
The main difference in our algorithm is creating histograms from each word position. This is done in
histogramByPosition. If you were so inclined, you could break up the material by sentence, then transpose it to get a list of lists where the inner lists correspond to a word position.
histogramByPosition :: [[String]] -> Map.Map String Int -> Int -> Map.Map String Int histogramByPosition sentences map pos = histogram listOfWords (Map.empty) where listOfWords = concat . fmap (take 1 . drop pos) $ sentences
Don’t remove punctuation because we are using it to separate the material into sentences.
cleanup :: String -> String cleanup s = filter (\c -> not (c `elem` ['\\', '\"', '\'', ','])) s
There is a little more cleanup before we are able to start parsing the sentences. We remove unwanted punctation in
cleanup, make all words lower case, split on sentence punction, split each sentence into words, then remove empty sentence lists and list of length 2 or less. To generate the CDFs for each word position we map each word position (starting from zero) to a histogram using
histogramByPosition. Then for each histogram, we generate a PDF and a CDF so that
cdfs now holds a CDF for each word position. The final task is in generating samples and zipping them with the CDFs to process. To process each (sample, cdf) tuple, we run fmap on it and call
sampleCDF on each sample and cdf.
split_by_sentence = (filter (\x -> (not . null $ x) && (length x > 2)) . fmap (words) . Split.splitOneOf ".!?\n" . fmap Char.toLower . cleanup) <$> readFile fileName >>= \sentences -> do let sentenceLength = 10 let cdfs = fmap createCDF . fmap createPDF . fmap (histogramByPosition sentences (Map.empty)) $ [0..sentenceLength - 1] time <- now let gen = Random.mkStdGen time let samples = take sentenceLength $ Random.randomRs zeroToOne gen let sampleCDFs = zip samples cdfs let sentence = fmap (\(sample, cdf) -> sampleCDF cdf sample) sampleCDFs print $ unwords $ fmap snd $ sentence main = split_by_sentence
Some interesting results when using Dracula as a source text:
i looks down harkers helsing in reading a own lloyds he back looked yer is placed we fair that at instead arrangements eyes would go who got i all me
Using Markov Chains
There is a significant amount of work to build up the Markov chains when compared to the previous prediction attempts. First we’ll deal with the different structure of our mapping. We will create a histogram for each key that simply gets the length of the given list. This is to build a starting point and a place to choose the next prediction randomly.
histogram' :: (Functor f, Foldable t) => f (t a) -> f Int histogram' map = fmap length map
Next we need to account for storing the key as a list of words rather than just a single word. The last thing we need to change is in
createCDF' where we create
[(Double,[String])] as tuples of a cumulative probability and a list of words.
createCDF' map = scanl ( \( cummProb, word ) (key, prob) -> ( cummProb + prob, key ) ) (0, [""]) (Map.toList map)
Now for the key to building predictions based on a chunk of words, we use
buildTuples to take a list of words and the size of the chunks. We will keep the next word prediction as a list of String since we will be joining these words in the next step.
buildTuples :: [a] -> Int -> [([a], [a])] buildTuples  _ =  buildTuples ws n = [(key, nextWord)] ++ buildTuples (drop 1 ws) n where key = take n ws nextWord = take 1 . drop n $ ws
Here we gather the possible words that follow our chunks into lists. After this step, we will have
Map.Map [String] [String]. From here we could sample uniformly from the values after choosing a key at random. But we will be generating CDFs from each predicted word list for each chosen word chunk.
gatherPredictions :: Ord k => Map.Map k [a] -> [(k, [a])] -> Map.Map k [a] gatherPredictions map  = map gatherPredictions map ((keys,predWords):tuples) = case Map.lookup keys map of Nothing -> gatherPredictions (Map.insert keys predWords map) tuples Just v -> gatherPredictions (Map.insert keys (v ++ predWords) map) tuples
The last building blocks we need will be in getting the next key by joining the previous key with the predicted word while dropping the leftmost word. Then we can build a sentence starting at some key then choosing from the associated CDF using the given samples.
getNextKey to generate the next chunk to use for prediction while generating a list of the predicted words.
getNextKey map allKeys key sample = case Map.lookup key map of Just cdf -> (drop 1 key) ++ [snd $ sampleCDF cdf sample] Nothing -> snd $ sampleCDF allKeys sample buildSentence tuples keysCDF key  =  buildSentence tuples keysCDF key (s:samples) = take 1 key : buildSentence tuples keysCDF nextKey samples where nextKey = getNextKey tuples keysCDF key s
To utilize these extra building blocks we first build up the tuples of word chunks and next words, then gather those next words into lists. We run the same sort of processing to generate the CDFs in our mappings. To make sense of this, read the last line. On the last line, we build the sentence from our tuples of word chunks and CDFs and the CDF of word chunks starting at the first sample. For each word, we need to sample from our main mapping of word chunk keys and CDF values. You can experiment with the size of the word chunk by changing the chunk size in the call to
buildTuples. I found chunk sizes above three to mostly give verbatim source material since the number of word choices decreases given a chunk of words of increasing size.
markov_chain = filter (\w -> not (w == "")) . words . fmap Char.toLower . cleanup <$> readFile fileName >>= \ws -> do let predictions = gatherPredictions Map.empty $ buildTuples ws 3 let tuples = fmap (createCDF . createPDF . (\ws -> histogram ws Map.empty)) $ predictions time <- now let gen = Random.mkStdGen time let keysCDF = createCDF' . createPDF $ histogram' predictions let (sample, gen') = Random.randomR zeroToOne $ gen let firstSample = sampleCDF keysCDF sample let samples = take 100 $ Random.randomRs zeroToOne gen' print $ unwords . concat $ buildSentence tuples keysCDF (snd firstSample) samples main = markov_chain
Some interesting results when using Dracula as a source text with chunk size preceding:
1 a wreck and infinite pity me as we find him i could not have always wakes what we have helped 2 of crew petrofsky was missing and we had come with us it seemed to me and i were like me if 3 dear lucys death and all that sort of nonsense you might as well ask a man to eat molecules with