POS tagger evaluation for the automated text analysis and identification of learner error

?

POS tagger evaluation for the automated text analysis and identification of learner error

Ch. 6. P. 44–49.

Vinogradova O. I., Buzanov A., Генералова С. А., Overnikova D., Смилга В. К., Сигдел Э. С.

Working with learner corpora requires elaborate NLP techniques such as POS-annotation. In this article a team of computational linguists presents their experience of choosing a POS-tagger for precise and effortless annotation of .txt files with Python3. Russian Error-Annotated Learner English Corpus (REALEC) is the underlying corpora to which text features the POS-tagger has to respond. After identifying four most promising Part of Speech Taggers our team conducted several sets of test and applied various criteria for evaluation of the taggers precision, speed and compatibility with Python scripts that are already used for the research. The description of tests and statistics along with evaluation of POS taggers such as PatternTagger, NLTK, SpaCy and TreeTagger and the conclusion our team arrived at are presented in the following article.

Language: English

Full text

Publication based on the results of:

Automated Detection of Writing Inaccuracies for Students of English in Russia (2019)

In book

ПРОСТРАНСТВО НАУЧНЫХ ИНТЕРЕСОВ: ИНОСТРАННЫЕ ЯЗЫКИ И МЕЖКУЛЬТУРНАЯ КОММУНИКАЦИЯ - СОВРЕМЕННЫЕ ВЕКТОРЫ РАЗВИТИЯ И ПЕРСПЕКТИВЫ

Вып. 3. , Буки Веди, 2019.

Granular computing-based deep learning for text classification

Behzadidoost R., Mahan F., Izadkhah H., Information Sciences 2024 Vol. 652 Article 119746

Granular computing involves a comprehensive process that encompasses theories, methodologies, and techniques to solve complex problems, rather than being just an algorithm. As the volume of generated data continues to grow rapidly, data-driven problems have become increasingly complex. Although deep learning models have outperformed traditional machine learning models in solving complex problems, there is still room for enhancing their performance. ...

Added: March 12, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

INCOMA Ltd, 2021.

Added: January 28, 2026

Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Association for Computational Linguistics, 2025.

Added: November 17, 2025

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Anton R., Mikhalchuk M., Rahmatullaev T. et al., , in: Findings of the Association for Computational Linguistics: NAACL 2025.: Association for Computational Linguistics, 2025. P. 7757–7764.

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens — especially stopwords, articles, and commas — consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis ...

Added: November 6, 2025

Исследования благополучия с помощью передовых методов обработки естественного языка (NLP): перспективы и ограничения

Voevodina E., Современная зарубежная психология 2025 Т. 14 № 3 С. 172–181

Context and relevance. Well-being research faces methodological limitations of conventional psychometric measures, criticized for poor ecological validity, limited information yield, and inadequate capture of multidimensional construct of well-being. Advanced natural language processing (NLP) technologies offer solutions to these constraints. Objective. To evaluate opportunities and challenges of transformer-based NLP for well-being research. Methods and materials. We conducted an analytical review of ...

Added: October 9, 2025

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Tartu: University of Tartu Library, 2025.

The third workshop on resources and representations for under-resourced languages and domains was held in Tallinn, Estonia, on March 2nd, 2025. The workshop was conducted in person but also provided an option for online participation. In alignment with the goals of the previous two workshops in 2020 and 2023, RESOURCEFUL-2025 explored the role of resource ...

Added: July 17, 2025

Automation of Forensic Authorship Attribution: Problems and Prospects

Romanova T. V., Khomenko A., Legal Issues in the Digital Age 2022 Vol. 3 No. 2 P. 90–115

The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with ob jectification of the available data with the help of mathematical statistics. The algo rithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of ...

Added: March 12, 2025

HSE NLP Team at MEDIQA-CORR 2024 Task: In-Prompt Ensemble with Entities and Knowledge Graph for Medical Error Correction

Tutubalina E., Valiev A., Association for Computational Linguistics 2024 P. 470–482

This paper presents our LLM-based system designed for the MEDIQA-CORR @ NAACL-ClinicalNLP 2024 Shared Task 3, focusing on medical error detection and correction in medical records. Our approach consists of three key components: entity extraction, prompt engineering, and ensemble. First, we automatically extract biomedical entities such as therapies, diagnoses, and biological species. Next, we explore ...

Added: December 13, 2024

Data-driven approach to curriculum analysis

Iu. Nasu, M.S. Drobinin, M.S. Efanov et al., Proceedings of the Institute for System Programming of the RAS 2024 Vol. 36 No. 2 P. 83–90

The choice of an educational program is momentous in young people's lives. Given the shortage of time after exams, applicants usually do not have time to analyze possible educational tracks. Furthermore, it requires a thorough study of learning plans. This research addresses the problem proposing the algorithm to data-driven curriculum analysis based on natural language ...

Added: December 11, 2024

Bridging Gaps in Russian Language Processing: AI and Everyday Conversations

Tatiana Sherstinova, Nikolay Mikhaylovskiy, Evgenia Kolpashchikova et al., , in: Proceedings of the 35th Conference of Open Innovations Association FRUCT, 24-26 April 2024, Tampere, FinlandIssue 1.: FRUCT Oy, 2024. P. 253–258.

Contemporary advancements in NLP and neural network techniques are paving the way to enhance and harness traditional linguistic resources and corpora, as well as expand the methods of applying neural networks for complex language material. Thus, a weak point for both theoretical and applied linguistic tasks is the processing of spontaneous everyday speech. Two experiments ...

Added: November 29, 2024

Proceedings of the 3rd Workshop on NLP Applications to Field Linguistics (Field Matters 2024)

Bangkok: Association for Computational Linguistics, 2024.

Added: November 13, 2024

A Language Model for Grammatical Error Correction in L2 Russian

Remnev N., Obiedkov S., Rakhilina E. V. et al., / Series Computer Science "arxiv.org". 2023.

Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, ...

Added: October 30, 2024

Language model interpretation as an exploration tool: on the way to understand better

Поздняков Д. В., / Series " ". 2025.

Model interpretation is very important when it comes to deteting hidden biases, ensuring model safety and trustworthiness. More and more interpretation methods are emerging. Focusing on the case of black-box transformer-based NLP model, for each considered interpretation application we provide an overview of existing tools and methods. We conclude that two trends will be central ...

Added: September 30, 2024

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Andreev N., Shirnin A., Mikhailov V. et al., , in: Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024).: Association for Computational Linguistics, 2024. P. 215–219.

Added: September 24, 2024

Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Association for Computational Linguistics, 2024.

Welcome to the Fourth Workshop on Scholarly Document Processing (SDP) at ACL 2024. As the body of scholarly literature grows, automated methods in NLP, text mining, information retrieval, document understanding etc. are needed to address issues of information overload, disinformation, reproducibility, and more. Though progress has been made, there are significant unique challenges to processing ...

Added: September 24, 2024

Distractor Generation for Lexical Questions Using Learner Corpus Data

Nikita Login, Jazykovedny Casopis 2023 Vol. 74 No. 1 P. 345–356

Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of ...

Added: September 16, 2024

L1 Influence on the Use of the English Present Perfect: A Corpus Analysis of Russian and Spanish Learners’ Essays

Perez-Guerra J., Smirnova E. A., Journal of Language and Education 2024 Vol. 10 No. 1 P. 101–114

Mastering verbal tenses, especially those expressing aspect, in a second language presents a challenge as learners frequently link the semantic nuances of verbal forms in their second language (L2) to the characteristics of the verbal systems in their native languages (L1). This study explores the impact of L1 on the usage of the English Present ...

Added: March 3, 2024

Classification of Short Scientific Texts

I. K. Kusakin, Fedorets O. V., A. Y. Romanov, Scientific and Technical Information Processing 2023 Vol. 50 No. 3 P. 176–183

This paper discusses modern approaches to natural language processing and the application of machine learning models to the task of classifying short scientific texts in Russian. This study is devoted to the analysis of methods for vectorization of textual information, selection of a model for scientific paper clas- sification, and training of linguistic model BERT ...

Added: November 4, 2023