Количественная оценка грамматической неоднозначности некоторых европейских языков
The grammatical ambiguity (multiple sets of grammatical features for one word form or coinciding surface forms of different words) can be of different types. We describe six classes of grammatical ambiguity: unambiguous, ambiguous by grammatical features, by part of speech, by lemma, by lemma and part of speech, and out-of-vocabulary words. These classes are presented in all languages, but the word distribution may vary significanlty. We calculate and analyse the statistics of these six ambiguity classes for a number of major European languages.We find that the distribution of words among the classes of ambiguity depends primarily on linguistic features of a language. Although it is influenced by text style and the considered vocabulary, the distinctive shape of the distribution is preserved under different conditions and differs significanlty from distributions for other languages. The fact that the shape is primarily defined by linguistic properties is corroborated by our observation that linguistically related languages demonstrate similar properties of ambiguous words. Slavic languages feature a low rate of part-of-speech ambiguous words and a high rate of words which are ambiguous by grammatical features. The former is also true for French and Italian, while the latter holds for German and Swedish, whereas both these traits are only characteristic of Slavic languages.
During experiments, we found that reduction of the grammatical feature set does not change the shape of distribution and therefore does not imitate similarity among languages. On the other hand, we found for all the languages that the top 1000 most frequent words have different distribution among ambiguity classes than the rest of the words. At the same time, for the majority of considered languages, less frequent words are less unambiguous by part of speech. In Romance and Germanic languages, the ambiguity is reduced for less frequent words. We also investigated the differences among statistics for texts of different genres in the Russian language. We found out that fiction texts are more ambiguous by part of speech than newswire, which are in turn more ambiguous by grammatical features.
Our results suggest that the quality of multilingual morphological taggers should be measured only by ambiguous words as opposed to all words. Such comparison could help eliminate differences among languages and get a more objective picture of the performance of linguistic tools.