?
Оценка сложности русских правовых текстов: архитектура модели
The paper describes the metrics-based model for assessing complexity of Russian legal texts. The architecture of the model implies the use of 130 metrics divided into following categories: “basic metrics”, “readability formulas”, “words of different part-of-speech classes”, “n-grams of part-of-speech tags”, “frequency of lemmas”, “word-building patterns”, “grammes”, “lexical and semantic features, multi-word expressions”, “syntactic features”, “cohesion assessments”. Two metrics take into account hypertext links and the presence of vague contexts. The model is able to evaluate structural, conceptual, and hypertextual complexity, including both non-specific metrics traditionally used to predict complexity and style-specific metrics developed taking into account the peculiarities of official texts. When evaluating morphological and syntactic features, the model refers to the markup layers performed by UDPipe (“ru-syntagrus”) and pymorphy2. The model uses a number of user dictionaries, including: a list of lexical means of text deixis, a list of graphic abbreviations (1.5 thousand units), a list of acronyms (2 thousand units), a list of legal terms (10 thousand units), a list of abstract lemmas (17 thousand units), a list of lexical indicators of deontic possibility and necessity, a list of light verb constructions. The values of complexity metrics were calculated for all documents of the CorCodex law corpus, the CorDec corpus of Constitutional court decisions, and the CorRIDA corpus of local acts (about 8 million tokens in total). Annotated legal corpora, complexity metrics, and user dictionaries are available for download from plaindocument.org.