Proceedings of Third Workshop "Computational linguistics and language science",
The EPiC Series in Language and Linguistics publishes high quality collections of papers in language, linguistics and related areas.
This article is devoted to the problem of defining a genre in computer linguistics and searching for parameters that could formalize the concept of a genre. All kinds of existing typologies of genres rely on different types of features, whereas in the practice of NLP, any modern applications are adapted to learning on big data, and therefore - on text features that do not require additional non-automatic markup. Based on such text-internal features, in this article, we focus on the differentiation of various genres and their grouping on the basis of a similar distribution of features. The description of the contribution of various types of features to the final result and their interpretation are given, and also an analysis of how such features can be used to further adaptation of NLP models is provided. The materials of the "Taiga" corpus with genre annotation are used as experimental data.
In this paper, we address the problem of automatic extraction of discourse formulae. By discourse formulae (DF) we mean a special type of constructions at the discourse level, which have a fixed form and serve as a typical response in the dialogue. Unlike traditional constructions [4, 5, 6], they do not contain variables within the sequence; their slots can be found in the left-hand or right-hand statements of the speech act. We have developed the system that extracts DF from drama texts. We have compared token-based and clause- based approaches and found the latter performing better. The clause-based model involves a uniform weight vote of four classifiers and currently shows the precision of 0.30 and the recall of 0.73 (F1-score 0.42).The created module was used to extract a list of DF from 420 drama texts of XIX-XXI centuries [1, 7]. The final list contains 3000 DF, 1800 of which are unique. Further development of the project includes enhancing the module by extracting left context features and applying other models, as well as exploring what DF concept looks like in other languages
Scientific texts contain a lot of special terms, which together with their definitions present an important part of scientific knowledge to be extracted for various applications, such as text summarization, construction of glossaries and ontologies and so on. The paper reports rule-based methods developed for extracting terminological information involving recognition of term definitions, as well as detection of term occurrences within scientific or technical texts. In contrast to corpus-based terminology extraction, the developed methods are oriented to processing a single text and are based on lexico-syntactic patterns and rules representing specific linguistic information about terms in scientific texts. The formal language LSPL for specification of the patterns and rules is briefly characterized, which is supported with programming tools and used for information extraction. Two applications of the methods are discussed: formation of glossary for a given text document and subject index construction. For these applications, both collections of LSPL patterns and extraction strategies are described, and results of their experimental evaluation are given.
The paper describes two hybrid neural network models for named entity recognition (NER) in texts, namely Bi-LSTM-CRF and Gated-CNN-CRF, as well as results of experiments with them.