Lemmatisation for under-resourced languages with sequence-to-sequence learning: A case of Early Irish

O. Dereza

doi:10.29007/cxtl

Publications

?

Lemmatisation for under-resourced languages with sequence-to-sequence learning: A case of Early Irish

P. 113–124.

Dereza O.

Lemmatisation, which is one of the most important stages of text preprocessing, consists in grouping the inflected forms of a word together so they can be analysed as a single item. This task is often considered solved for most modern languages irregardless of their morphological type, but the situation is dramatically different for ancient languages. Rich inflectional system and high level of orthographic variation common to these languages together with lack of resources make lemmatising historical data a challenging task. It becomes more and more important as manuscripts are being extensively digitized now, but still remains poorly covered in literature. In this work, I compare a rule-based and a neural network based approach to lemmatisation in case of Early Irish data.

Keywords: artificial neural networks NLP automatic morphological analysis Early Irish under-resourced languages lemmatisation sequence-to-sequence models

In book

Proceedings of Third Workshop "Computational linguistics and language science"

Wohlgenannt G., von Waldenfels R., Toldova S., Rakhilina E. V., Lyashevskaya O., Loukachevitch N. V., Artemova E. Issue 4. , Manchester: EasyChair, 2019.

Hebb-Inspired Low Rank Adapters for Large Language Models Fine-Tuning

Alexander Demidovskij, Artyom Tugaryov, Igor Salnikov et al., , in: PRICAI 2025: Trends in Artificial Intelligence: 22nd Pacific Rim International Conference on Artificial Intelligence, PRICAI 2025, Wellington, New Zealand, November 17–21, 2025, Proceedings, Part IIIVol. 16453.: Springer, 2026. P. 603–612.

The backpropagation method is the predominant method for pre-training and fine-tuning of Large Language models. At the same time, it is considerably demanding in terms of memory and hardware. Therefore, it makes fine-tuning and pre-training very expensive, harmful for the environment due to the large carbon footprint, and raises the blocks for the development of ...

Added: April 21, 2026

PRICAI 2025: Trends in Artificial Intelligence: 22nd Pacific Rim International Conference on Artificial Intelligence, PRICAI 2025, Wellington, New Zealand, November 17–21, 2025, Proceedings, Part III

Springer, 2026.

This proceedings contain the papers presented at the 22nd Pacific Rim International Conference on Artificial Intelligence (PRICAI), held on November 17–21, 2025 in Wellington, New Zealand. PRICAI 2025 was co-hosted with the 40th International Conference on Image and Vision Computing New Zealand (IVCNZ 2025) and the annual conference of the New Zealand Artificial Intelligence Researchers ...

Added: April 21, 2026

Granular computing-based deep learning for text classification

Behzadidoost R., Mahan F., Izadkhah H., Information Sciences 2024 Vol. 652 Article 119746

Granular computing involves a comprehensive process that encompasses theories, methodologies, and techniques to solve complex problems, rather than being just an algorithm. As the volume of generated data continues to grow rapidly, data-driven problems have become increasingly complex. Although deep learning models have outperformed traditional machine learning models in solving complex problems, there is still room for enhancing their performance. ...

Added: March 12, 2026

Semi-automatic annotation of brain vessels in magnetic resonance angiography images

Bernadotte A, Elfimov N., Menshikov I., Scientific data 2025 Vol. 13 No. 41

Accurate segmentation of brain vessels in magnetic resonance angiography (MRA) is essential for surgical procedures. Neural networks are powerful tools for medical image segmentation, but their development requires well-annotated datasets. However, publicly available MRA datasets with detailed vessel annotations are scarce. We present a dataset of 100 manually annotated brain MRA images from the IXI ...

Added: February 25, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

INCOMA Ltd, 2021.

Added: January 28, 2026

Тесты как инструменты оценивания в вузах: трудности и решения

Antipkina I., Иванущенко А. В., Калабина И. А. et al., Мир психологии. Научно-методический журнал 2025 № 4(123) С. 295–316

Low-quality test items pose significant risks of biased and inaccurate assessment in higher education. In this study, multi-disciplinary test banks were examined, first, using classical test theory and then using a Large Language Model (Grok). Our findings reveal a number of problems in university test items due to methodological shortcomings rather than content inaccuracies. Based ...

Added: January 22, 2026

Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Association for Computational Linguistics, 2025.

Added: November 17, 2025

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Anton R., Mikhalchuk M., Rahmatullaev T. et al., , in: Findings of the Association for Computational Linguistics: NAACL 2025.: Association for Computational Linguistics, 2025. P. 7757–7764.

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens — especially stopwords, articles, and commas — consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis ...

Added: November 6, 2025

Исследования благополучия с помощью передовых методов обработки естественного языка (NLP): перспективы и ограничения

Voevodina E., Современная зарубежная психология 2025 Т. 14 № 3 С. 172–181

Context and relevance. Well-being research faces methodological limitations of conventional psychometric measures, criticized for poor ecological validity, limited information yield, and inadequate capture of multidimensional construct of well-being. Advanced natural language processing (NLP) technologies offer solutions to these constraints. Objective. To evaluate opportunities and challenges of transformer-based NLP for well-being research. Methods and materials. We conducted an analytical review of ...

Added: October 9, 2025

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Tartu: University of Tartu Library, 2025.

The third workshop on resources and representations for under-resourced languages and domains was held in Tallinn, Estonia, on March 2nd, 2025. The workshop was conducted in person but also provided an option for online participation. In alignment with the goals of the previous two workshops in 2020 and 2023, RESOURCEFUL-2025 explored the role of resource ...

Added: July 17, 2025

Формирование требований к технологическим параметрам серийного производства на основе нейросетевого подхода

Yasnitsky L., Голдобин М. А., Прикладная информатика 2025 Т. 20 № 3(117) С. 85–100

Currently, artificial intelligence methods are widely used in the practice of serial production enterprises. They are used to detect defects, classify and eliminate them, identify the causes of defects, predict the quality and properties of the resulting product, select optimal parameters of the production process, and identify and study its patterns. However, outside the field ...

Added: July 10, 2025

Экономические и социальные аспекты атомной энергетики в условиях развития технологий искусственного интеллекта

Podchufarov A., Galkina A. N., Ванина С. С. et al., Экономика и управление: проблемы, решения 2025 Т. 5 № 4 С. 61–74

Under modern conditions, the introduction of artificial intelligence technologies is becoming a significant factor in the development of high-tech industries. The article presents the results of a study of the prospects for the use of intelligent analytical systems in nuclear energy. The experience of foreign countries is analyzed and the features of successful projects using ...

Added: June 5, 2025

Where Do Large Learning Rates Lead Us?

Sadrtdinov I., Kodryan M., Pokonechny E. et al., , in: 38th Conference on Neural Information Processing Systems (NeurIPS 2024).: [б.и.], 2024. P. 58445–58479.

Added: February 19, 2025

Big Data Analytics Approach with Multiple Text Types: The Case of the Computer Gaming

Aleksandr Belov, Zakharov F., Litvinenko E. et al., , in: International IoT, Electronics and Mechatronics Conference, Volume 2. Proceedings of IEMTRONICS 2024. LNEE, volume 1228Vol. 1228.: Springer Publishing Company, 2025. P. 275–287.

Added: January 26, 2025

Afanasev I., Lyashevskaya O., , in: Structuring Lexical Data and Digitising Dictionaries: Grammatical Theory, Language Processing and Databases in Historical Linguistics.: Boston, Leiden: Brill, 2024. P. 13–35.

Added: January 7, 2025

HSE NLP Team at MEDIQA-CORR 2024 Task: In-Prompt Ensemble with Entities and Knowledge Graph for Medical Error Correction

Tutubalina E., Valiev A., Association for Computational Linguistics 2024 P. 470–482

This paper presents our LLM-based system designed for the MEDIQA-CORR @ NAACL-ClinicalNLP 2024 Shared Task 3, focusing on medical error detection and correction in medical records. Our approach consists of three key components: entity extraction, prompt engineering, and ensemble. First, we automatically extract biomedical entities such as therapies, diagnoses, and biological species. Next, we explore ...

Added: December 13, 2024

Data-driven approach to curriculum analysis

Iu. Nasu, M.S. Drobinin, M.S. Efanov et al., Proceedings of the Institute for System Programming of the RAS 2024 Vol. 36 No. 2 P. 83–90

The choice of an educational program is momentous in young people's lives. Given the shortage of time after exams, applicants usually do not have time to analyze possible educational tracks. Furthermore, it requires a thorough study of learning plans. This research addresses the problem proposing the algorithm to data-driven curriculum analysis based on natural language ...

Added: December 11, 2024