Automatic dependency parsing of a learner English corpus REALEC

O. Lyashevskaya

?

Automatic dependency parsing of a learner English corpus REALEC

НИУ ВШЭ , 2017.

Lyashevskaya O., Пантелеева И. М.

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The essays are a part of students' preparation for the independent final examination similar to the international English exam. While adjusting existing dependency parsing tools to a learner data, one has to take into account to what extent students' mistakes provoke errors in the parser output. The ungrammatical and stylistically inappropriate utterances may challenge parsers' algorithms trained on grammatically appropriate written texts. In our experiments, we compared the output of the dependency parser UDpipe (trained on UD-English 2.0) with the results of manual parsing, placing a particular focus on parses of ungrammatical English clauses. We show how mistakes made by students influence the work of the parser. Overall, UDpipe performed reasonably well (UAS 92.9, LAS 91.7). The following cases cause the errors in automatic annotation a) incorrect detection of a head, b) incorrect detection of the relation type, as well as c) both. We propose some solutions which could improve the automatic output and thus make the assessment of syntactic complexity more reliable.

Research target: Computer Science Philology and Linguistics

Priority areas: humanitarian

Language: English

Full text

Keywords: учебный корпус английский язык как иностранный learner corpus universal dependencies универсальные зависимости dependency annotation of learner treebank evaluation of parser quality L2 English синтаксическая разметка корпуса разметка синтаксических зависимостей синтаксическая разметка учебного корпуса оценка качества синтаксического парсинга

Publication based on the results of:

Лексикологические исследования на базе учебного корпуса REALEC (Learner corpus REALEC: Lexicological observations) (2016)

Text collections for evaluation of Russian morphological taggers

Lyashevskaya O., Bocharov V., Sorokin A. et al., Jazykovedny Casopis 2017 Vol. 68 No. 2 P. 258-267

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single ...

Added: January 30, 2018

Межкультурное пространство: лингвистический и дидактический аспекты. Часть 2. Материалы секций "Межкультурная лингвистика", "Межкультурная транслатология" и студенческого научного форума. Пленарное заседание и секция «Межкультурная дидактика».

Scherbakova A., Издательство ПетрГУ, 2021

The paper focuses on the task of clustering essays produced by ESL (English as a Second Language) learners. The data was taken from a learner corpus REALEC. The division of texts by certain characteristics can be useful to speed up the analysis of a single corpus or access to the necessary sections of a large ...

Added: April 30, 2021

Universal Dependencies for Russian: A New Syntactic Dependencies Tagset

Lyashevskaya O., Droganova K., Zeman D. et al., / НИУ ВШЭ. Series WP BRP "Linguistics". 2016. No. 44.

This paper presents the Universal Dependencies tagset (UD v1) as a new annotation scheme for Russian treebanks. The universal list of dependency relations was adopted and extended to comply with certain language-specific syntactic constructions. The tagset was validated, converting two Russian treebanks into the UD format, UD-Russian-SynTagRus and UD-Russian-Google. ...

Added: December 14, 2016

Использование универсальных зависимостей при грамматическом разборе многоязычного текста (на примере безличного предикатива)

Lyukina E. V., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2018 Т. 16 № 2 С. 19-33

The paper is dedicated to the initiative of universal dependences (UD), with aim to develop cross-linguistically consistent annotation scheme of grammatical analysis. The purpose of this initiative is in simplification of cross-language research, unification of interlanguage linguistic typology, building a foundation for the automated multilingual systems and the universal cross-language text parser. In the first part ...

Added: April 21, 2018

REALEC learner treebank: annotation principles and evaluation of automatic parsing

Lyashevskaya O., Пантелеева И. М., , in : Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT 16). : Association for Computational Linguistics, 2017. P. 80-87.

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The original corpus is manually annotated for learners’ errors and gives information on the error span, error type, and the possible correction ...

Added: December 11, 2018

RUSSE2018: a Shared Task on Word Sense Induction for the Russian Language

Panchenko A., Lopukhina A., Ustalov D. et al., Компьютерная лингвистика и интеллектуальные технологии 2018 No. 17 P. 547-564

The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic ...

Added: June 7, 2018

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 17 июня — 20 июня 2020 г.)

М. : Изд-во РГГУ, 2020

Papers from the Annual International Conference “Dialogue” (2020). Issue 19 ...

Added: June 26, 2020

TaLC 12 - Teaching and Language Corpora Conference

[б.и.], 2016

Various issues relating to the questions of learner corpus researches and their use in teaching are presented. These include the issue of a norm in corpora whether the norm should necessarily be native and what problems a native norm may present. Learners who behave differently from native speakers do not necessarily use language incorrectly as ...

Added: December 10, 2016

Innovative Use of NLP for Building Educational Applications

Stroudsburg, PA : Association for Computational Linguistics, 2019

Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications ...

Added: October 5, 2020

Материалы 21-й Международной конференции по компьютерной лингвистике "Диалог"

М. : Изд-во РГГУ, 2015

Сборник содержит труды 21-й Международной конференции по компьютерной лингвистике. ...

Added: May 20, 2015

Comparing two “thermometers”: Impact factors of 20 leading economic journals according to Journal Citation Reports and Scopus

Pislyakov V., Scientometrics 2009 Vol. 79 No. 3 P. 541-550

Impact factors for 20 journals ranked first by Journal Citation Reports (JCR) were compared with the same indicator calculated on the basis of citation data obtained from Scopus database. A significant discrepancy was observed as Scopus, though results differed from title to title, found in general more citations than listed in JCR. This also affected ...

Added: January 25, 2013

Using TXM Platform for Research on Language Changes over Time: The Dynamics of Vocabulary and Punctuation in Russian Literary Texts

Lavrentiev A. M., Sherstinova T., Chepovskiy A. et al., Vestnik Tomskogo Gosudarstvennogo Universiteta, Filologiya 2021 Vol. 70 P. 69-89

The purpose of this paper is to test the methodological tools provided by TXM platform for research on dynamics of vocabulary and punctuation marks in diachronic corpora. TXM is a powerful text analysis software which provides both quantitative and qualitative features in a transparent open-source implementation. In this paper, we demonstrate how it can be ...

Added: June 24, 2021

Review of the book: Wilken, Rowan: Teletechnologies, Place, and Community. New York, Routledge, 2011 // Digital Icons: Studies in Russian, Eurasian and Central European New Media, No 9 (2013): 129-133.

Gusejnov G., Digital Icons: Studies in Russian, Eurasian and Central European New Media 2013 No. 9 P. 129-133

In his book, Rowan Wilken, lecturer at the University of Swinburne, Australia, makes an attempt at providing a theoretical frame for a three-dimensional problem: the relation between new technologies, communities and places. His main goal is to sculpt an understanding of the relationship between place and community, both of which are transcended by what he ...

Added: March 24, 2014

Метод автоматического создания лексико-грамматических упражнений в формате wordbank cloze

Malafeev A., Иностранные языки в высшей школе 2015 № 2 (33) С. 88-95

Language exercises are widely used in teaching foreign languages; yet, manually creating exercises is labor-intensive and time-consuming. This paper describes a method for automatically generating EFL wordbank cloze exercises. These are generated from arbitrary passages in English, which is an important advantage in terms of learner motivation; indeed, the content of the exercises can be ...

Added: September 4, 2015

Digital Russia: The Language, Culture and Politics of New Media Communication

L. : Routledge, 2014

This book provides a comprehensive analysis of the ways in which new media technologies have shaped language and communication in contemporary Russia. It traces the development of the Russian-language internet (Runet) from late-Soviet cybernetics to the advent of Twitter and explores the evolution of web-based communication practices, showing how they have both shaped and been ...

Added: December 11, 2013

Clausal complexity features in professional and student academic writing: A corpus-based analysis of texts in management and economics

Smirnova E. A., Journal of English for Academic Purposes 2020

The study is a quantitative analysis of the use of clausal complexity features in two kinds of corpora: expert corpora which comprise articles published in peer-reviewed journals in management and economics and learner corpora of students’ research papers in the same disciplines. The syntactic constructions selected for the analysis are taken from various guidebooks and ...

Added: October 20, 2019

PR в сфере культуры

Tulchinskii G. L., СПб. : Лань, 2011

В учебном пособии систематически изложены вопросы PR организации, учреждения, освещены цели, технологии этой деятельности, возможности анализа эффективности решения этих задач. В большей степени книга ориентирована на PR в деловой активности и особенно в социально-культурной некоммерческой сфере. В приложениях содержатся материалы и образцы документов, важные для практической организации PR. Книга может использоваться как для самостоятельного знакомства с ...

Added: October 5, 2012

Proceedings of the Forth International Conference on Cognitive Science

Tomsk : ., 2010

Added: November 18, 2013

Предсказания, большие данные и новые измерители: о возможности технологий компьютерной лингвистики в теоретических лингвистических исследованиях

Bonch-Osmolovskaya A. A., Вопросы языкознания 2016 № 2 С. 100-120

Статья посвящена обзору работ последних лет, в которых теоретическая исследовательская задача решается с помощью методов или инструментов, используемых в компьютерной лингвистике. В обзоре проводится подробный анализ того, как именно с помощью применения того или иного инструмента или метода можно получить новые знания о природе языка. В частности, выделяются два основных направления, развитие которых в рамках ...

Added: April 14, 2015

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 29 мая — 1 июня 2019 г.)

М. : Издательский центр «Российский государственный гуманитарный университет», 2019

The book includes 64 papers submitted to the International conference in computer linguistics and intellectual technologies Dialogue 2019 and presents a broad spectrum of theoretical and applied research of natural language description, language simulation, and creation of applied computer technologies. ...

Added: October 16, 2019

23rd Conference of Open Innovations Association FRUCT, FRUCT 2018

IEEE Computer Society, 2018

23rd IEEE FRUCT Conference. ...

Added: November 1, 2020

The 26th International Conference on Computational Linguistics (COLING 2016)

[б.и.], 2016

Added: December 1, 2016

Тринадцатая национальная конференция по искусственному интеллекту с международным участием КИИ-2012 (16-20 октября 2012 г., г. Белгород, Россия). Том 2

Белгород : Белгородский государственный технологический университет им. В.Г. Шухова, 2012

Важность проведения очередной тринадцатой национальной конференции по искусственному интеллекту (КИИ-2012) обусловлена необходимостью обмена научной информацией и последними достижениями в данной области. В обсуждении фундаментальных теоретических и прикладных проблем, возникающих при создании интеллектуальных систем, принимают участие ведущие ученые и специалисты из академических институтов, научных и промышленных организаций, а также вузов России, стран ближнего и дальнего зарубежья. ...

Added: November 13, 2012

Учебно-методическое пособие English for Specific Purposes: Computer Security

Baranovskaya T., Klepko E. Y., Резниченко Е. М. et al., М. : Издательский дом ГУ-ВШЭ, 2009

Данное учебное пособие предназначено для студентов 3 курса факультета бизнес-информатики и соответствует требованиям программы подготовки бакалавров по направлению 080700.62 «Бизнес-информатика». Книга представляет собой первую часть курса и рассчитана на работу в первом и втором модулях. На третьем курсе программой предусмотрено изучение профессионально-ориентированного английского (English for specific purposes), что обусловило выбор тематики – компьютерная безопасность. Пособие ...

Added: May 14, 2013