Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance

A. B. Kutuzov

?

Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance

P. 63–68.

Kutuzov A. B.

The present paper introduces approach to improve English-Russian sentence alignment, based on POS-tagging of automatically aligned (by HunAlign) source and target texts. The initial hypothesis is tested on a corpus of bitexts. Sequences of POS tags for each sentence (exactly, nouns, adjectives, verbs and pronouns) are processed as “words” and Damerau-Levenshtein distance between them is computed. This distance is then normalized by the length of the target sentence and is used as a threshold between supposedly mis-aligned and “good” sentence pairs. The experimental results show precision 0.81 and recall 0.8, which allows the method to be used as additional data source in parallel corpora alignment. At the same time, this leaves space for further improvement.

Language: English

Full text

Keywords: parallel corpora sentence alignment

Publication based on the results of:

Corpus technologies in linguistic and interdisciplinary studies (2013)

In book

Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

Association for Computational Linguistics, 2013.

Обзор семейства конструкций с функцией «понижения агенса» в славянских языках

Plungian V., Подгорная А. Д., Славистика 2023 Т. 27 № 2 С. 54–70

В данной работе представлен обзор конструкций, выполняющих функцию «понижения агенса», в славянских языках, что включает причастный пассив, субъектный имперсонал с кратким пассивным причастием (на -no/to), форма с континуантом праславянского *sę, в разных языках демонстрирующая свойства пассива или имперсонала, конструкции с глаголом в форме 3-го лица мн.ч. и ед.ч. (ср.р.), универсальные употребления 2-го лица ед.ч., 1-го ...

Added: June 6, 2024

Параллельный корпус как грамматическая база данных и Новый Завет как параллельный корпус (предисловие)

Plungian V., Acta Linguistica Petropolitana. Труды института лингвистических исследований 2023 Т. 19 № 3 С. 15–38

Статья является одновременно предисловием и теоретическим введением к последующим статьям данного специального выпуска журнала ALP. В статье дается общая характеристика проекта подготовки базы данных типологически релевантных грамматических контекстов на основе параллельного корпуса переводов Нового Завета. Также приводится краткий обзор состава и содержания публикуемых далее статей. ...

Added: February 1, 2024

Корпусное исследование конкуренции конструкций с функцией «понижения агенса» в славянских языках

Plungian V., Подгорная А. Д., Studia Slavica 2022 Т. 67 № 1-2 С. 115–131

В статье рассматриваются конструкции с функцией «понижения агенса» и их переводные эквиваленты на материале параллельного корпуса романа М. А. Булгакова «Мастер и Маргарита» в переводах на польский, чешский, болгарский, сербский и немецкий языки. Под данным ярлыком объединяются средства, лишающие агенс привилегированного коммуникативного статуса, что проявляется в его реализации в нехарактерной синтаксической позиции, полном опущении или ...

Added: November 8, 2023

К семантике и типологии аквизитивной модальности: шведское orka и его синонимы на фоне русского языка

Vladimir Plungian, Åkerman Sarkisian K., Scando-Slavica 2023 Т. 69 № 1 С. 3–24

The so-called acquisitive modality or actuality, describing a successful realization of an action, is a largely understudied type of modal values. It is of significant interest both from the viewpoint of its grammaticalization paths and of a wide and rich variety of its lexical expression. The article discusses one of the main representatives of this ...

Added: November 8, 2023

Эпистемологический потенциал переводных текстов (на материале русско-японского параллельного корпуса художественных произведений)

Strizhak U., Вестник Московского университета. Серия 22: Теория перевода 2023 Т. 16 № 1 С. 93–109

переводные тексты, параллельный корпус, эпистемологический потенциал, язык перевода, агентивность, японский язык target texts, parallel corpus, epistemological potential, target language, agentivity, Japanese language ...

Added: September 9, 2023

Цель перемещения в Евангелии от Луки: к усовершенствованию процедуры выделения прототипических контекстов

Filatov K., Acta Linguistica Petropolitana. Труды института лингвистических исследований 2023 Т. 19 № 3 С. 39–74

This paper aims to describe the sampling procedure of the prototypical contexts, containing the Goal of translational motion in the Gospel of Luke. The survey was based on five texts of Gospel of Luke: Koine Greek, English, Russian, Tabassaran and Sahidic Coptic. The main improvement of the sampling technique described are the listing of not ...

Added: November 18, 2022

Quantitative Analysis of Passives with Agent Phrase Based on Multilingual Parallel Data

Нестеренко Л. В., , in: Post-Proceedings of the 5th Conference Digital Humanities in the Nordic Countries (DHN 2020)Issue 2865.: [б.и.], 2021. P. 5–15.

Added: November 22, 2021

Automatic data collection in lexical typology

Ryzhova D., Melnik A. A., Ершов И. А. et al., , in: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2018”.: [б.и.], 2018. P. 619–636.

The paper addresses an issue of an automatic data collection for lexical typological studies in the Frame approach paradigm. A research in this framework is based on the analysis of distributional properties of the lexemes in question. Hence, questionnaires for such studies consist of typical contexts where lexical items from a given semantic domain can ...

Added: October 17, 2018

The Poetic Corpus of Russian: Where the Poems are Written

Sichinava D., Orekhov B., , in: Proceedings of the Second Workshop on Corpus-Based Research in the Humanities CRH-2, 25-26 January 2018 Vienna, Austria.: Wien: Gerastree Proceedings, 2018. P. 201–205.

The paper discusses the marking of the composition location in the Poetic Corpus of Russian that enables customizing subcorpora by these locations and subsequent search by this parameter. The place names indicated by the authors are extracted, tagged and “normalized”, that is, all the different versions of names and minor locations are boiled down to ...

Added: August 30, 2018

On the development of a Latvian-Russian parallel corpus

Perkova N., Sichinava D., Frontiers in Artificial Intelligence and Applications 2016 Vol. 289 P. 130–135

This paper presents the current status of the Latvian-Russian parallel corpus, which is an ongoing project within the Russian National Corpus. It discusses the existing parallel corpora including Latvian texts, availability of sources and the main principles and tools of alignment and morphological annotation, as well as further plans for developing the corpus. ...

Added: August 30, 2018

Инструменты корпусного анализа в обучении иностранному языку

Gorina O. G., Вестник Томского государственного университета 2018 Т. 22 № 435 С. 187–194

As was initially suggested by data-driven teaching pioneers not only the researcher, but also the learner should be given the chance of studying language through corpus or get access to authentic linguistic data. Working on that assumption,the article elaborates on the potential of corpus analysis for the purpose of L2 teaching. Firstly, a succession of ...

Added: January 21, 2018

Параллельные белорусско-русский и русско-белорусский корпусы: совместный проект Национального корпуса русского языка

Sichinava D., Arkhangelskiy T., В кн.: Корпусы национальных языков: модели и технологии. Труды Казанской школы по компьютерной и когнитивной лингвитике TEL-2012.: Каз.: Издательство «Фэн» Академии наук Республики Татарстан, 2012. С. 54–60.

Added: April 23, 2013

Russian Learner Parallel Corpus as a Tool for Translation Studies

Kutuzov A. B., Kunilovskaya M. A., Oschepkov A. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 30 мая–3 июня 2012 г.). В 2 томахТ. 1: Основная программа конференции. Вып. 11.: М.: Российский государственный гуманитарный университет, 2012. P. 362–369.

The paper presents a project aimed at the development of a Russian Learner Parallel Corpus, discusses the existing analogues, describes the current status and the tasks in which it could be used. The existing parallel corpora contain (comparatively) “correct” translations; whereas the aim of the present project is to create a sufficiently large corpus of ...

Added: February 13, 2013