This exercise explores the quality of the $n$-gram model of language.
Find or create a monolingual corpus of 100,000 words or more. Segment it
into words, and compute the frequency of each word. How many distinct
words are there? Also count frequencies of bigrams (two consecutive
words) and trigrams (three consecutive words). Now use those frequencies
to generate language: from the unigram, bigram, and trigram models, in
turn, generate a 100-word text by making random choices according to the
frequency counts. Compare the three generated texts with actual
language. Finally, calculate the perplexity of each model.
This exercise explores the quality of the $n$-gram model of language. Find or create a monolingual corpus of 100,000 words or more. Segment it into words, and compute the frequency of each word. How many distinct words are there? Also count frequencies of bigrams (two consecutive words) and trigrams (three consecutive words). Now use those frequencies to generate language: from the unigram, bigram, and trigram models, in turn, generate a 100-word text by making random choices according to the frequency counts. Compare the three generated texts with actual language. Finally, calculate the perplexity of each model.