Text Prediction with Markov Chains in Haskell
Let’s work through a few prediction techniques that can give us text that resembles some source material. Using Haskell, we can reuse lots of code while we work up to a Markov chain. First, we will predict words at random by choosing, with replacement, from all unique words in the material. Next, we will predict using the previous technique on every word position in a sentence. Then we will implement a Markov chain which allows us to predict the next word given a chunk of words from the material. So, given a chunk of words from the source material, we keep track of the unique words that follow it. We then build a cumulative distribution function (CDF) for each list of found words. We can then build the next chunk using a word chosen at random from the cumulative distribution function.
Running code
The code in this project uses no ghc extensions or external modules. The only requirement is ghc
with the Prelude
.
Choosing from all words, with replacement
Let’s build up the file piece by piece. First we have the imports. Each is imported as qualified to make code more clear about the origins of various functions. We use Map to create histograms that are then transformed into Map.Map [String] CDF
where CDF is [(Double,"String")]
. Random is used to generate samples between zero and one. The Time is used to seed the random number generator.
import qualified Data.Map as Map
import qualified System.Random as Random
import qualified Data.Time as Time
Here is the filename of the source material in text format. We will do a little cleanup when we initially parse it but you may need to cleanup differently depending on your source material.
fileName = "dracula.txt" -- [Project Gutenberg, Dracula](https://www.gutenberg.org/files/345/345-0.txt)
We build a histogram by counting the occurences of each word and storing the word with its associated count in a Map.
histogram :: (Ord k, Num a) => [k] -> Map.Map k a -> Map.Map k a
histogram [] map = map
histogram (w:ws) map =
case Map.lookup w map of
Nothing -> histogram ws (Map.insert w 1 map)
Just v -> histogram ws (Map.insert w (v + 1) map)
Now for the workhorse of our implmentation, we need to create cumulative distribution functions for each word chunk. This will allow us to randomly generate samples between zero and one that we use to select the corresponding prediction from the CDF. First, createPDF
creates a probability density function which tells us the probability of choosing each prediction. Then createCDF
runs a cummulative sum on the probabilities that sampleCDF
can then sample from.
createPDF histo n = fmap (\v -> fromIntegral v / fromIntegral n) histo
createCDF :: Map.Map String Double -> [(Double, String)]
createCDF map = scanl ( \( cummProb, word ) (key, prob) -> ( cummProb + prob, key ) ) (0, "") (Map.toList map)
sampleCDF :: Ord a => [(a, b)] -> a -> (a, b)
sampleCDF cdf sample = head $ dropWhile ( \(p, _) -> sample > p ) cdf
Just a few helper functions. cleanup
will remove the listed characters, and now
gives us an integer value to seed the random number generator.
cleanup :: String -> String
cleanup s = filter (\c -> not (c `elem` ['\\', '\"', '\'', '.'])) s
now :: IO Int
now = (read <$> Time.formatTime Time.defaultTimeLocale "%s" <$> Time.getCurrentTime)
Now to use the pieces we’ve built up we read the file from disk, clean it up, then break it into a list of words to process. You can easily follow this process by reading the variable names. First we create a histogram and get the total number of words. We then create our PDF and CDF from the histogram, generate random samples. For each sample, we choose a prediction from the CDF.
split_by_word = do
words . cleanup <$> readFile fileName
>>= \rawWs -> do
let histo = histogram rawWs (Map.empty)
let totalWs = length rawWs
let pdf = createPDF histo totalWs
let cdf = createCDF pdf
epoch_int <- now
let gen = Random.mkStdGen epoch_int
let samples = Random.randomRs (0.0 :: Double, 1.0 :: Double) gen
print $ unwords . fmap snd . take 12 $ fmap (sampleCDF cdf) samples
main = split_by_word
Some interesting results when using Dracula as a source text:
fears how on you diary do and his She a for the
fumbling from complete it the with wife way Morris:-- in got lest
up plainly credentials; the click steel that _2 that wonder And to
Choosing from words in each position
From here, we will only make additions or modifications to the contents listed above.
We need a few additions to help us build up to the task of creating CDFs for each word position. We will add Split to break the material up into sentences and Char to make all the words lower case.
import qualified Data.Map as Map
import qualified System.Random as Random
import qualified Data.Time as Time
import qualified Data.List.Split as Split
import qualified Data.Char as Char
Next is a little helper to define our bounds when generating random samples.
zeroToOne :: (Double, Double)
zeroToOne = (0.0, 1.0)
The main difference in our algorithm is creating histograms from each word position. This is done in histogramByPosition
. If you were so inclined, you could break up the material by sentence, then transpose it to get a list of lists where the inner lists correspond to a word position.
histogramByPosition :: [[String]] -> Map.Map String Int -> Int -> Map.Map String Int
histogramByPosition sentences map pos = histogram listOfWords (Map.empty)
where
listOfWords = concat . fmap (take 1 . drop pos) $ sentences
Don’t remove punctuation because we are using it to separate the material into sentences.
cleanup :: String -> String
cleanup s = filter (\c -> not (c `elem` ['\\', '\"', '\'', ','])) s
There is a little more cleanup before we are able to start parsing the sentences. We remove unwanted punctation in cleanup
, make all words lower case, split on sentence punction, split each sentence into words, then remove empty sentence lists and list of length 2 or less. To generate the CDFs for each word position we map each word position (starting from zero) to a histogram using histogramByPosition
. Then for each histogram, we generate a PDF and a CDF so that cdfs
now holds a CDF for each word position. The final task is in generating samples and zipping them with the CDFs to process. To process each (sample, cdf) tuple, we run fmap on it and call sampleCDF
on each sample and cdf.
split_by_sentence =
(filter (\x -> (not . null $ x) && (length x > 2))
. fmap (words)
. Split.splitOneOf ".!?\n"
. fmap Char.toLower
. cleanup)
<$> readFile fileName >>= \sentences -> do
let sentenceLength = 10
let cdfs = fmap createCDF
. fmap createPDF
. fmap (histogramByPosition sentences (Map.empty))
$ [0..sentenceLength - 1]
time <- now
let gen = Random.mkStdGen time
let samples = take sentenceLength $ Random.randomRs zeroToOne gen
let sampleCDFs = zip samples cdfs
let sentence = fmap (\(sample, cdf) -> sampleCDF cdf sample) sampleCDFs
print $ unwords $ fmap snd $ sentence
main = split_by_sentence
Some interesting results when using Dracula as a source text:
i looks down harkers helsing in reading a own lloyds
he back looked yer is placed we fair that at
instead arrangements eyes would go who got i all me
Using Markov Chains
There is a significant amount of work to build up the Markov chains when compared to the previous prediction attempts. First we’ll deal with the different structure of our mapping. We will create a histogram for each key that simply gets the length of the given list. This is to build a starting point and a place to choose the next prediction randomly.
histogram' :: (Functor f, Foldable t) => f (t a) -> f Int
histogram' map = fmap length map
Next we need to account for storing the key as a list of words rather than just a single word. The last thing we need to change is in createCDF'
where we create [(Double,[String])]
as tuples of a cumulative probability and a list of words.
createCDF' map = scanl ( \( cummProb, word ) (key, prob) -> ( cummProb + prob, key ) ) (0, [""]) (Map.toList map)
Now for the key to building predictions based on a chunk of words, we use buildTuples
to take a list of words and the size of the chunks. We will keep the next word prediction as a list of String since we will be joining these words in the next step.
buildTuples :: [a] -> Int -> [([a], [a])]
buildTuples [] _ = []
buildTuples ws n = [(key, nextWord)] ++ buildTuples (drop 1 ws) n
where
key = take n ws
nextWord = take 1 . drop n $ ws
Here we gather the possible words that follow our chunks into lists. After this step, we will have Map.Map [String] [String]
. From here we could sample uniformly from the values after choosing a key at random. But we will be generating CDFs from each predicted word list for each chosen word chunk.
gatherPredictions :: Ord k => Map.Map k [a] -> [(k, [a])] -> Map.Map k [a]
gatherPredictions map [] = map
gatherPredictions map ((keys,predWords):tuples) =
case Map.lookup keys map of
Nothing -> gatherPredictions (Map.insert keys predWords map) tuples
Just v -> gatherPredictions (Map.insert keys (v ++ predWords) map) tuples
The last building blocks we need will be in getting the next key by joining the previous key with the predicted word while dropping the leftmost word. Then we can build a sentence starting at some key then choosing from the associated CDF using the given samples. buildSentence
uses getNextKey
to generate the next chunk to use for prediction while generating a list of the predicted words.
getNextKey map allKeys key sample =
case Map.lookup key map of
Just cdf -> (drop 1 key) ++ [snd $ sampleCDF cdf sample]
Nothing -> snd $ sampleCDF allKeys sample
buildSentence tuples keysCDF key [] = []
buildSentence tuples keysCDF key (s:samples) = take 1 key : buildSentence tuples keysCDF nextKey samples
where
nextKey = getNextKey tuples keysCDF key s
To utilize these extra building blocks we first build up the tuples of word chunks and next words, then gather those next words into lists. We run the same sort of processing to generate the CDFs in our mappings. To make sense of this, read the last line. On the last line, we build the sentence from our tuples of word chunks and CDFs and the CDF of word chunks starting at the first sample. For each word, we need to sample from our main mapping of word chunk keys and CDF values. You can experiment with the size of the word chunk by changing the chunk size in the call to buildTuples
. I found chunk sizes above three to mostly give verbatim source material since the number of word choices decreases given a chunk of words of increasing size.
markov_chain =
filter (\w -> not (w == ""))
. words
. fmap Char.toLower
. cleanup
<$> readFile fileName >>= \ws -> do
let predictions = gatherPredictions Map.empty $ buildTuples ws 3
let tuples = fmap (createCDF . createPDF . (\ws -> histogram ws Map.empty)) $ predictions
time <- now
let gen = Random.mkStdGen time
let keysCDF = createCDF' . createPDF $ histogram' predictions
let (sample, gen') = Random.randomR zeroToOne $ gen
let firstSample = sampleCDF keysCDF sample
let samples = take 100 $ Random.randomRs zeroToOne gen'
print $ unwords . concat $ buildSentence tuples keysCDF (snd firstSample) samples
main = markov_chain
Some interesting results when using Dracula as a source text with chunk size preceding:
1 a wreck and infinite pity me as we find him i could not have always wakes what we have helped
2 of crew petrofsky was missing and we had come with us it seemed to me and i were like me if
3 dear lucys death and all that sort of nonsense you might as well ask a man to eat molecules with