TAPE: Assessing Few-shot Russian Language Understanding

E. Taktasheva; Shavrina T.; Fenogenova A.; Shevelev D.; Katricheva N.; M. Tikhonova; Akhmetgareeva A.; Zinkevich O.; Bashmakova A.; Iordanskaia S.; Spiridonova A.; Kurenshchikova V.; Artemova E.; V. Mikhailov

doi:10.18653/v1/2022.findings-emnlp.183

?

TAPE: Assessing Few-shot Russian Language Understanding

P. 2472–2497.

Taktasheva E., Shavrina T., Fenogenova A., Shevelev D., Katricheva N., Tikhonova M., Akhmetgareeva A., Zinkevich O., Bashmakova A., Iordanskaia S., Spiridonova A., Kurenshchikova V., Artemova E., Mikhailov V.

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE’s design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (https://tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

Keywords: NLP language modeling benchmark

In book

Findings of the Association for Computational Linguistics: EMNLP 2022

Association for Computational Linguistics, 2022.

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Shavrina T., Fenogenova A., Emelyanov A. et al., , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).: Association for Computational Linguistics, 2020. P. 4717–4726.

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark – RussianSuperGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical ...

Added: June 14, 2026

Granular computing-based deep learning for text classification

Behzadidoost R., Mahan F., Izadkhah H., Information Sciences 2024 Vol. 652 Article 119746

Granular computing involves a comprehensive process that encompasses theories, methodologies, and techniques to solve complex problems, rather than being just an algorithm. As the volume of generated data continues to grow rapidly, data-driven problems have become increasingly complex. Although deep learning models have outperformed traditional machine learning models in solving complex problems, there is still room for enhancing their performance. ...

Added: March 12, 2026

HoTPP benchmark: Are we good at the long horizon events forecasting?

Karpukhin I., Shipilov F., Savchenko A., Neurocomputing 2026 Vol. 672 Article 132771

Forecasting multiple future events within a given time horizon is essential for applications in finance, retail, social networks, and healthcare. This problem is typically addressed using Marked Temporal Point Processes (MTPP), which provide a principled framework for modeling both event timing and event labels. While most existing research focuses on predicting only the next event, forecasting distant future ...

Added: February 25, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

INCOMA Ltd, 2021.

Added: January 28, 2026

ComputAgeBench: Epigenetic Aging Clocks Benchmark

Kriukov D., Efimov E., Kuzmina E. et al., , in: KDD '25: Proceedings of the 31th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Volume 2.: Association for Computing Machinery (ACM), 2025. P. 5560–5570.

The success of clinical trials of longevity drugs relies heavily on identifying integrative health and aging biomarkers, such as biological age. Epigenetic aging clocks predict the biological age of individuals using their DNA methylation profiles, commonly retrieved from blood samples. However, there is no standardized methodology to validate and compare epigenetic clock models. We propose ComputAgeBench, ...

Added: January 12, 2026

Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Association for Computational Linguistics, 2025.

Added: November 17, 2025

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Anton R., Mikhalchuk M., Rahmatullaev T. et al., , in: Findings of the Association for Computational Linguistics: NAACL 2025.: Association for Computational Linguistics, 2025. P. 7757–7764.

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens — especially stopwords, articles, and commas — consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis ...

Added: November 6, 2025

Исследования благополучия с помощью передовых методов обработки естественного языка (NLP): перспективы и ограничения

Voevodina E., Современная зарубежная психология 2025 Т. 14 № 3 С. 172–181

Context and relevance. Well-being research faces methodological limitations of conventional psychometric measures, criticized for poor ecological validity, limited information yield, and inadequate capture of multidimensional construct of well-being. Advanced natural language processing (NLP) technologies offer solutions to these constraints. Objective. To evaluate opportunities and challenges of transformer-based NLP for well-being research. Methods and materials. We conducted an analytical review of ...

Added: October 9, 2025

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Tartu: University of Tartu Library, 2025.

The third workshop on resources and representations for under-resourced languages and domains was held in Tallinn, Estonia, on March 2nd, 2025. The workshop was conducted in person but also provided an option for online participation. In alignment with the goals of the previous two workshops in 2020 and 2023, RESOURCEFUL-2025 explored the role of resource ...

Added: July 17, 2025

Evaluating the Pragmatic Competence of Large Language Models in Detecting Mitigated and Unmitigated Types of Disagreement

Shulginov V., Hasan Berkcan Şimşek, Sergei Kudriashov et al., , in: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue” (2025)Issue 23.: [б.и.], 2025. P. 345–360.

This study presents a framework for evaluating the effectiveness of language models (LLMs) in detecting disagreement across a wide range of pragmatic strategies, from mitigated forms to overt verbal aggression. Special attention is given to complex cases of implicit manifestations of irony and sarcasm, which pose significant challenges for both automated analysis and interpersonal communication. ...

Added: April 30, 2025

Bi-objective Workflow Scheduling in the Cloud: What is the Real State-of-the-Art?

Yury Semenov, Oleg Sukhoroslov, , in: Supercomputing. 10th Russian Supercomputing Days, RuSCDays 2024, Moscow, Russia, September 23–24, 2024, Revised Selected Papers, Part II* 2.: Springer, 2025. P. 20–31.

Workflow scheduling in the cloud is a challenging multi-objective optimization problem where an efficient scheduling algorithm is required to optimize both performance and cost. Despite the huge body of work on designing workflow scheduling algorithms, the differences in the experiment settings, VM instances, sets of baseline algorithms, and the choice of reference point for hypervolume ...

Added: April 25, 2025

OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities

Razzhigaev A., Kurkin M., Goncharova E. et al., , in: Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP.: Association for Computational Linguistics, 2024. P. 183–195.

We introduce OmniDialog — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent ...

Added: February 21, 2025

MERA: A Comprehensive LLM Evaluation in Russian

Fenogenova A., Chervyakov, A., Martynov N. et al., , in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2024Vol. 1: Long Papers.: Bangkok: Association for Computational Linguistics, 2024. P. 9920–9948.

Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). However, despite researchers’ attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, ...

Added: February 17, 2025

RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

Taktasheva E., Bazhukov M., Koncha K. et al., , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing.: Association for Computational Linguistics, 2024. P. 9268–9299.

Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and ...

Added: January 2, 2025