Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT 16)
The volume includes papers presented at the 16th International Workshop on Treebanks and Linguistic Theories (TLT), which brings together developers and users of linguistically annotated natural language corpora. As ‘treebanks’ we consider any pairing of natural language data (spoken or written) with annotations of linguistic structure at various levels of analysis, ranging from e.g. morpho-phonology to discourse. The articles address all aspects of treebank design, development, and use, including reflections on the design of linguistic annotations, methodology studies, resource announcements or updates, annotation or conversion tool development, and reports on treebank usage.
The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The original corpus is manually annotated for learners’ errors and gives information on the error span, error type, and the possible correction of the mistake provided by experts. The syntactic dependency annotation adds more value to learner corpora since it makes it possible to explore the interaction of syntax and different types of errors. Also, it helps to assess the syntactic complexity of learners’ texts. While adjusting existing dependency parsing tools, one has to take into account to what extent students’ mistakes provoke errors in the parser output. The ungrammatical and stylistically inappropriate utterances may challenge parsers’ algorithms trained on grammatically appropriate academic texts. In our experiments, we compared the output of the dependency parser Ud-pipe (trained on ud-english 2.0) with the results of manual parsing, placing a particular focus on parses of ungrammatical English clauses. We show how mistakes made by students influence the work of the parser. Overall, Ud-pipe performed reasonably well (UAS 92.9, LAS 91.7). We provide the analysis of several cases of erroneous parsing which are due to the incorrect detection of a head, on the one hand, and with the wrong choice of the relation type, on the other hand. We propose some solutions which could improve the automatic output and thus make the syntax-based learner corpus research and assessment of the syntactic complexity more reliable. The REALEC treebank is freely available under the CC BY-SA 3.0 licence.