CLLS 2016. Computational Linguistics and Language Science. Proceedings of the Workshop on Computational Linguistics and Language Science. Moscow, Russia, April 26, 2016
As the number of digital texts increases rapidly, there is a pressing need for more advanced and diverse tools of natural language processing. While purely statistical approaches proved powerful and efficient for many NLP tasks, there are many applications that would benefit from the formal models and approaches traditional language science has to offer. With hopes to facilitate this interaction between theory and practical implementation, we are pleased to announce the workshop on Computational Linguistics and Language Science to be held in Moscow, Russia on April 25, 2016 (11 AM to 6 PM).
The paper presents an unsupervised and knowledge-free ap- proach to compound splitting. Although the research is focused on Ger- man compounds, the method is expected to be extensible to other com- pounding languages. The approach is based on the annotated suffix tree (AST) method proposed and modified by Mirkin et al. To the best of our knowledge, annotated suffix trees have not yet been used for compound splitting. The main idea of the approach is to match all the substrings of a word (suffixes and prefixes separately) against an AST, determining the longest and sufficiently frequent substring to perform a candidate split. A simplification considers only the suffixes (or prefixes) and splits a word at the beginning of the selected suffix (the longest and sufficiently frequent one). The results are evaluated by precision and recall.
This paper describes a pilot study of the problem of detecting singleton mentions in Russian texts. A noun phrase is considered a singleton mention if it is the only referent of some entity. We discuss various morphosyntactic and lexical features, some of which were used for analogous tasks for English and propose new features derived from the discourse analysis. Testing the machine learning classifiers trained with the use of proposed features, we conclude that although the quality of classifiers is significantly lower than for English, they still have rather high precision and thus can be helpful in various tasks of mention tracking.
In this paper, we present an application for formal concept analysis (FCA) by showing how it can help construct a semantic map for a lexical typological study. We show that FCA captures typological regularities, so that concept lattices automatically built from linguistic data appear to be even more informative than traditional semantic maps. While sometimes this informativeness causes unreadability of a map, in other cases, it opens up new perspectives in the field, such as the opportunity to analyze the relationship between direct and figurative lexical meanings.
The paper presents a short summary on the applications of the quantum logic categorical constructions to the natural language processing. We give a brief overview on the topic of quantum logic in general, and in natural language processing, in particular. As a result, we discuss comparison of sentences and their representation in quantum logic formalism. The examples of using quantum diagrams are considered in order to understand text analysis in terms of quantum logic techniques.
The problem of classifying text based on the deep parsing structure is addressed. An algorithm for document classification tasks where counts of words or n-grams is insufficient is proposed. The parse tree kernel method at the level of paragraphs, based on anaphora, rhetoric structure relations and communicative actions linking phrases in the parse thicket is considered.
The paper discloses a new approach to emerging technologies identi- fication, which strongly relies on capacity of big data analysis, namely text min- ing augmented by syntactic analysis techniques. The opportunities of the new big-data-augmented methodology are shown in comparison to existing results, both globally and in Russia. The integrated ontology of currently emerging tech- nologies in A&F sector is introduced. The directions and possible criteria of fur- ther enhancement and refinement of proposed methodology are contemplated.