The Combinatorial Analysis of n-Gram Dictionaries, Coverage and Information Entropy based on the Web Corpus of English
We research n-gram dictionaries and estimate its coverage and entropy based on theweb corpus of English. We consider a method for estimating the coverage of empirically gen-erated dictionaries and an approach to address the disadvantage of low coverage. Based on theideas of Kolmogorov’s combinatorial approach, we estimate the n-gram entropy of the Englishlanguage and use mathematical extrapolation to approximate the marginal entropy. In addition,we approximate the number of all possible legal n-grams in the English language for large orderof n-grams.