Developing a polysynthetic language corpus: problems and solutions

Arkhangelskiy T.A.; Lander Yu.A.

?

Developing a polysynthetic language corpus: problems and solutions

Компьютерная лингвистика и интеллектуальные технологии. 2016. No. 15 (22). P. 40–49.

Arkhangelskiy T.A., Lander Yu.A.

Although there exist comprehensive morphologically annotated corpora for many morphologically rich languages, there have been no such corpora for any polysynthetic language so far. Developing a corpus of a polysynthetic language poses a range of theoretical and practical challenges for corpus linguistics. Some of these challenges have been partly addressed when developing corpora for languages with extensive morphological inventories and numerous productive derivation models such as Turkic or Uralic, while others are unique for this kind of languages. As we are currently working on a corpus of the polysynthetic West Circassian language, we had to identify these challenges and propose theoretical and practical solutions. These include the tokenization problem, which involves delimiting morphology from syntax, the problem with lemmatization and part-of-speech tagging, and a number of glossing and search issues. The solutions proposed in the paper are partly implemented and will be available for public testing when the preliminary version of the corpus is released.

Research target: Philology and Linguistics

Priority areas: humanitarian

Language: English

Full text

Text on another site

Keywords: corpus linguistics Adyghe polysynthesis Circassian languages West Circassian

Some challenges of the West Circassian polysynthetic corpus

Arkhangelskiy T., Lander Yu., / NRU HSE. Series WP BRP "Linguistics". 2015. No. 37/LNG/2015.

Added: December 15, 2015

Nominal complex in West Circassian: between morphology and syntax

Lander Yu., Studies in Language 2017 Vol. 41 No. 1 P. 76–98

The paper presents a description and an analysis of the nominal complex, a peculiar construction which includes a noun and its modifiers, in West Circassian, a polysynthetic language of the Northwest Caucasian family. The nominal complex shows properties of a single word and tends to follow the template proposed for the word in West Circassian. ...

Added: August 8, 2016

Адыгский аналитический аддитив: порядок слов и диахрония

Bagirokova I., Lander Y., Томский журнал лингвистических и антропологических исследований 2016 № 2 (12) С. 9–19

The paper deals with functioning of the analytical additive marker əčʼjə̣ / jəčʼjə̣ in the Temirgoy dialect of West Circassian (also known as Adyghe) and the Kuban dialect of Kabardian and analyses some morphosyntactic parameters, which serve to differentiate its various functions. According to the hypothesis we propose, the marker əčʼjə̣ / jəčʼjə̣ , which ...

Added: July 20, 2016

West Circassian Imperative-Optative System: A Study in a Prototype-Based Organisation of a Grammatical Domain

Lander Y., Bagirokova I., Syntaxe et Sémantique 2022 Vol. 22 No. 1 P. 57–81

West Circassian has no less than two imperatives and two optatives. Their distribution depends on various parameters such as the speaker’s control over the situation, the person of the topic and the type of a predicate. The whole system arguably can be described with respect to the universal Imperative Prototype, which reflects grammaticalization of a ...

Added: October 26, 2020

Two-faced subordination marker in West Circassian necessity constructions

Lander Yu., Bagirokova I., / NRU HSE. Series WP BRP "Linguistics". 2015. No. 38/LNG/2015.

This paper describes the behavior of a subordination marker ‑n in the modal necessity constructions in West Circassian, a polysynthetic language belonging to the Northwest Caucasian family. We show that ‑n functions as a simple suffix in the non-epistemic construction and as a phrasal affix in the epistemic construction. Hence, this morpheme violates the principle ...

Added: December 15, 2015

Corpus, multiple analyses and polysynthesis

Lander Yu., , in: Adıge filolojisi: Güncel Konular. Düzce: Düzce University, 2016.

The paper discusses several problems which have been observed during the development of the corpus of West Circassian and proposes that their solutions should involve the possibility of multiple analyses. It is argued that this is related to certain properties of the constructions under discussion which are reflected in variation observed among the speakers of ...

Added: August 16, 2017

Producing polysynthetic verb forms in West Circassian (Adyghe): an experimental study

Lander Yu., Arkhangelskiy T., / NRU HSE. Series WP BRP "Linguistics". 2015. No. 23/LNG/2015.

This paper describes a pilot experiment which was conducted by the authors with speakers of the polysynthetic West Circassian (Adyghe) language and aimed at investigating their ability to use complex verb forms that cross-reference several arguments introduced by applicative morphology. The results of the experiment support the view that complex polysynthetic words can be constructed ...

Added: April 10, 2015

Цилитивы (‘легко’ и ‘трудно’) в адыгейском языке: семантика, аргументная структура и частеречные характеристики

Lander Y., Bagirokova I., Рема 2021 № 1 С. 56–75

West Circassian displays two types of cilitive (facilitive ‘easy’ and difficilitive ‘difficult’) forms, namely noun cilitives, which describe individuals, and secondary cilitives, which describe the state of affairs. Secondary cilitives seemingly originate from noun cilitives, hence the same cilitive suffixes mark forms that are remarkably different from each other in their morphosyntax. While noun cilitives ...

Added: October 26, 2020

Adıge filolojisi: Güncel Konular

Düzce: Düzce University, 2016.

The volume includes papers presented at the international symposium "Adyghe Philology". ...

Added: August 16, 2017

Аспекты полисинтетизма: Очерки по грамматике адыгейского языка

М.: РГГУ, 2009.

Сборник включает статьи, посвященные анализу структуры полисинтетического адыгейского языка с типологической точки зрения. ...

Added: February 7, 2013

Актанты и сирконстанты в морфологии и в синтаксисе адыгейского языка

Lander Y., Вестник РГГУ. Серия: История. Филология. Культурология. Востоковедение 2015 № 1 С. 7–31

This paper discusses the morphological and syntactic means of expression of participants in morphology and syntax of West Circassian (Adyghe) focusing on the argument vs adjunct characteristics of these means. West Circassian provide evidence for the non-discretness of the argument/adjunct contrast but also shows the necessity to distinguish between argument/adjunct properties in morphological expressions and ...

Added: March 23, 2015

Deriving affix ordering in polysynthesis: Evidence from Adyghe

Korotkova N., Lander Yury, Morphology 2010 Vol. 20 No. 2 P. 299–319

This article deals with the order of verbal suffixes in Adyghe, a polysynthetic language of the Caucasus. Traditionally the structure of the Adyghe word form and the order of its affixes were described in terms of template morphology. However, we present new data demanding another, substantially different approach. We demonstrate that for the most part ...

Added: February 6, 2013

West Caucasian relative pronouns as resumptives

Lander Yu., Daniel M., Linguistics 2019 Vol. 57 No. 6 P. 1239–1270

In polysynthetic West Caucasian languages, the morphological verbal complex amounts to a clause, with all kinds of participants cross-referenced by affixes. Relativization is performed by introducing a relative affix in the cross-reference slot which corresponds to the relativized participant. However, these languages display several cross-linguistically rare features of relativization. Firstly, while under the view of ...

Added: June 28, 2018

Asymmetric word class systems and noun primacy: West Circassian and beyond

Lander Y., Bagirokova I., Journal of Linguistics 2021

In this paper we argue for the existence of an asymmetric parts-of-speech system where nouns constitute a separate word class but do not form any non-privative contrast with other content parts of speech. As a result, in a system of this kind there is no need to distinguish verbs even though there are good reasons ...

Added: October 26, 2020

Non-quantificational distributive quantifiers in Besleney Kabardian

Arkadiev P., Lander Yu., Snippets 2013 No. 27 P. 5–7

The squib discusses certain unexpected properties of nominals containing distributive universal quantifiers in Besleney Kabardian such as their capacity to appear as clausal predicates and their similarities to plural nominals. ...

Added: October 15, 2013

Интенсификатор "до ужаса" в русском языке на пути грамматикализации

Герасимов Д. В., Acta Linguistica Petropolitana. Труды института лингвистических исследований 2016 Т. XII № 1 С. 336–363

The paper presents a corpus-driven study of the Russian PP-based degree modifier do uzhasa (lit. ‘to horror’), suggesting a two-stage grammaticalization path. The first stage (presumably, XVIII–XIX c.) involves subjectification, while during the second stage, subjective readings give rise to intensifier readings through conceptual metonymy. Both stages see a host class expansion. This process is ...

Added: November 27, 2017

Корпусные инструменты в грамматических исследованиях русского языка

Lyashevskaya O., М.: Языки славянской культуры, 2016.

Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents ...

Added: March 26, 2015

Spatial Meanings and Russian Prosody: a Corpus Study

Khudyakova M., / NRU HSE. Series WP BRP "Linguistics". 2014.

The objective of this paper is to see if we can find prosodic features that can express spatial meanings on corpus material. The main two questions that we try to answer are: 1. What prosodic instruments express spatial meanings? 2. What characteristics of space are coded by prosody in Russian language? The source of the ...

Added: October 22, 2014

Morphological causatives in Abaza

Koshevoy A., / NRU HSE. Series WP BRP "Linguistics". 2018. No. 75/LNG/2018.

This paper deals with the productive morphological causative r(ə)- in Abaza (Northwest Caucasian), a highly polysynthetic ergative language. We discuss the causativization process in Abaza as well as the semantic properties of the construction and elaborate an analysis of the event structure of the Abaza morphological causatives based on the scope of adverbials. ...

Added: December 16, 2018

Труды международной конференции "Корпусная лингвистика - 2019"

СПб.: Издательство Санкт-Петербургского университета, 2019.

Сборние содержит материалы докладов, представленных на Международной научной конференции "Корпусная лингвистика-2019" 24-28 июня 2019 г. в Санкт-Петербурге. ...

Added: July 8, 2019

Прогностическая валидность глагольных форм длительного аспекта в корпусной лингвистике английского языка

Popkova E., Социосфера 2010 № 4 С. 74–81

The article discusses the most recent trends in the development of the English progressive. A corpus-based approach to linguistic research is seen as an effective means of determining reliability of the data retrieved and helps track the major diachronic dynamic in the increasing frequency of the progressive aspect that has taken place since the beginning ...

Added: November 6, 2012

Об исследовании бжедугского диалекта адыгейского языка

Lander Y., Аркадьев П. М., Moroz G., В кн.: Полевые исследования студентов РГГУ: Этнология, фольклористика, лингвистика. Вып. X. М.: РГГУ, 2015. С. 183–201.

The paper contains some basic data as well as certain new facts concerning the Bzhedug dialect of West Circassian (Adyghe) and also includes a glossed text. ...

Added: April 20, 2016

Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” (2015)

M.: Russian State University for the Humanitie, 2015.

Added: April 28, 2015

Phasal polarity in Abaza

Klyagina E., Panova A., / NRU HSE. Series WP BRP "Linguistics". 2019. No. 89/LNG/2019.

Phasal polarity (PhP) is a cross-linguistic category which includes such values as ᴀʟʀᴇᴀᴅʏ, ɴᴏᴛ ʏᴇᴛ, sᴛɪʟʟ and ɴᴏ ʟᴏɴɢᴇʀ. This paper discusses morphologically bound markers of phasal polarity in Abaza, a polysynthetic Northwest Caucasian language. We show that the Abaza PhP affixes ‑χ’a ‘already’, -s (+ negation) ‘not yet’, -rḳʷa ‘still’ and -χ (+ negation) ...

Added: December 14, 2019