• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Working paper

Frequency dictionary of inflectional paradigms: core Russian vocabulary

A new kind of frequency dictionary is a valuable reference for researchers and learners of Russian. It shows the grammatical profiles of nouns, adjectives and verbs, namely, the distribution of grammatical forms in the inflectional paradigm. The dictionary is based on data from the Russian National Corpus (RNC) and covers a core vocabulary (5000 most frequently used lexemes). Russian is a morphologically rich language: its noun paradigms harbor two dozen case & number forms and verb paradigms include up to 160 grammatical forms. The dictionary departs from traditional frequency lexicography in several ways: 1) word forms are arranged in paradigms, and their frequencies can be compared and ranked; 2) the dictionary is focused on the grammatical profiles of individual lexemes rather than on overall distribution of grammatical features (e.g. the fact that Future forms are used less frequently than Past forms); 3) grammatical profiles of lexical units can be compared against the mean scores of their lexico-semantic class; 4) in each part of speech or semantic class, lexemes with certain biases in grammatical profile can be easily detected (e.g. verbs used mostly in Imperative, in Past neutral, or nouns used often in plural); 5) the distribution of homonymous word forms and grammatical variants can be followed in time and within certain genres and registers. The dictionary will be a source for research in the field of Russian grammar, paradigm structure, form acquisition, grammatical semantics, as well as variation of grammatical forms. The main challenge for this initiative is the intra-paradigm and inter-paradigm homonymy of word forms in corpus data. Manual disambiguation is accurate but covers ca. 5 million words in the RNC, so the data may be sparse and possibly unreliable. Automatic disambiguation yields slightly worse results, however, a larger corpus shows more reliable data for rare word forms. A user can switch between a ‛basicʼ version which is based on a smaller collection of manually disambiguated texts, and an ‛expandedʼ version which is based on the main corpus, the newspaper corpus, the corpus of poetry and the spoken corpus (320 million words in total). The article addresses some general issues such as establishing the common basis of comparison, a level of granularity of grammatical profile, units of measurement. We suggest certain solutions related to the selection of data, corpus data processing and maintaining the online version of the frequency dictionary.