?
Сложность русских правовых текстов: методы оценки и языковые данные
Our goal is to create a model for the automatic assessment of Russian legal texts complexity. To achieve this goal, it is necessary to create a text collection; perform linguistic markup; highlight the parameters for measuring the complexity, oriented on the selected markup format. These steps are described in this paper. We briefly describe three corpora of modern Russian legal texts “CorRIDA”, “CorDes”, “CorCodex” with a total size of 8.5 million tokens. We justify the choice of linguistic markup tools (UDPipe, pymorphy2). Then we characterize the linguistic features of the complexity assessment, including: the simplest basic metrics; five readability formulas; parameters for assessing lexical complexity (TTR values, Yule’s K, the number of hapaxes, abbreviations, abstract words, etc.); parameters for assessing morphosyntactic and discursive complexity (Noun-Verb Ratio values; the number of grammemes of genitive, neuter, passive; relative sentences, appositive modifiers, lexical devices of discursive connectivity, etc.).