LM-Polygraph: Uncertainty Estimation for Language Models

Fadeeva E.; Vashurin R.; A. Tsvigun; Vazhentsev A.; Petrakov S.; Fedyanin K.; Daniil Vasilev; E. Goncharova; Panchenko A.; Panov M.; Baldwin T.; Shelmanov A.

doi:10.18653/v1/2023.emnlp-demo.41

?

LM-Polygraph: Uncertainty Estimation for Language Models

P. 446 –461.

Fadeeva E., Vashurin R., Tsvigun A., Vazhentsev A., Petrakov S., Fedyanin K., Daniil Vasilev, Goncharova E., Panchenko A., Panov M., Baldwin T., Shelmanov A.

Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often “hallucinate”, i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.

Language: English

DOI

Text on another site

Keywords: Uncertainty Estimation LLM

In book

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Singapore: Association for Computational Linguistics, 2023.

Optimizing Computational Infrastructure for Large Language Models in Bioinformatics: A Case Study

Beknazarov N., , in: Parallel Computational Technologies, 19th International Conference, PCT 2025, Moscow, Russia, April 8–10, 2025, Revised Selected Papers. (CCIS, volume 2891)Vol. 2891.: Springer, 2026. P. 3–16.

This paper addresses the challenge of efficiently training Large Language Models (LLMs) on large-scale, sparse omics datasets in high-performance computing (HPC) environments. Using over 1000 BED tracks as a representative data source, we propose a method combining interval-based chunked storage, sparse matrix transformation, and parallel data loading, integrated within a PyTorch Lightning training framework. Our ...

Added: May 19, 2026

When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs

Seleznyov M., Chaichuk M., Ershov G. et al., , in: Findings of the Association for Computational Linguistics: EMNLP 2025.: Association for Computational Linguistics, 2025. P. 20370–20385.

Large Language Models (LLMs) are highly sensitive to subtle, non-semantic variations in prompt phrasing and formatting. In this work, we present the first systematic evaluation of 4 methods for improving prompt robustness within a unified experimental framework. We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural ...

Added: February 3, 2026

Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework

Ganeeva V., Khrabrov K., Kadurin A. et al., Journal of Cheminformatics 2025 No. 17 Article 164

The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the ...

Added: February 3, 2026

Aspect-Based Sentiment Analysis Using Large Language Models on Museum Visitor Reviews

Anastasia V. Kolmogorova, Elizaveta R. Kulikova, Vladislav V. Lobanov, Supercomputing Frontiers and Innovations 2025 Vol. 12 No. 3 P. 121–140

Museum reviews provide rich insight into visitor preferences and can drive useful change within institutions, yet they have attracted little attention in sentiment research owing to limited commercial interest and the multi-thematic nature of reviews. In this study we analysed over 12 000 reviews in Russian for 15 museum sites collected from nine different platforms. ...

Added: November 30, 2025

AutoJudge: Judge Decoding Without Manual Annotation

Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov et al., , in: 39th Conference on Neural Information Processing Systems (NeurIPS 2025).: NeurIPS, 2025. P. 94605–94642.

We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify the generated tokens that affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster.Our approach relies ...

Added: November 6, 2025

Strategizing with AI: Insights from a Beauty Contest Experiment

Iuliia Alekseenko, Dagaev D., Sofiia Paklina et al., Journal of Economic Behavior and Organization 2025 Vol. 240 Article 107330

Added: November 6, 2025

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Anton R., Mikhalchuk M., Rahmatullaev T. et al., , in: Findings of the Association for Computational Linguistics: NAACL 2025.: Association for Computational Linguistics, 2025. P. 7757–7764.

We introduce methods to quantify how Large Language Models (LLMs) encode and store contextual information, revealing that tokens often seen as minor (e.g., determiners, punctuation) carry surprisingly high context. Notably, removing these tokens — especially stopwords, articles, and commas — consistently degrades performance on MMLU and BABILong-4k, even if removing only irrelevant tokens. Our analysis ...

Added: November 6, 2025

Исследования благополучия с помощью передовых методов обработки естественного языка (NLP): перспективы и ограничения

Voevodina E., Современная зарубежная психология 2025 Т. 14 № 3 С. 172–181

Context and relevance. Well-being research faces methodological limitations of conventional psychometric measures, criticized for poor ecological validity, limited information yield, and inadequate capture of multidimensional construct of well-being. Advanced natural language processing (NLP) technologies offer solutions to these constraints. Objective. To evaluate opportunities and challenges of transformer-based NLP for well-being research. Methods and materials. We conducted an analytical review of ...

Added: October 9, 2025

Оценка моделей LLM по степени готовности решать задачи управления в области ESG

Storchevoy M., Mylnikov L., Чернышев В. В. et al., / SSRN. Серия "Working Papers". 2025.

Внимание к охране природы принимает все большую значимость для бизнеса с одной стороны в связи с ужесточением в природоохранном законодательстве, а с другой в связи с использованием ESG рейтингов при принятии решений о коммерческой деятельности компаний. Составление рейтинга LLM систем, способных оказывать консультационные услуги в области природоохраны и ESG, позволяет осуществить выбор такой системы для ...

Added: September 18, 2025

Цифровой театр абсурда: могут ли нейросети поставить новую научную проблему перед психологией? Кейс-сравнение ChatGPT и DeepSeek

Хашутогова У. П., Berezner T., Poddiakov A., Новые психологические исследования 2025 № 3 С. 100–125

The rapid advancement of artificial intelligence technologies has drawn increasing attention from psychological researchers. While neural networks are being integrated into nearly all domains of human activity, the boundaries of their applicability remain unclear — particularly regarding the originality and practical value of the content they generate. Proponents advocate for their widespread adoption, whereas skeptics ...

Added: September 4, 2025

Interpreting Metaphorical Language: A Challenge to Artificial Intelligence

Skrynnikova I.V., Вестник Волгоградского государственного университета. Серия 2: Языкознание 2025 Vol. 23 No. 5 P. 99–107

In recent years, numerous studies have pointed to the ability of artificial intelligence (AI) to generate and analyze expressions of natural language. However, the question of whether AI is capable of actually interpreting human language, rather than imitating its understanding, remains open. Metaphors, being an integral part of human language, as both a common figure ...

Added: August 1, 2025

Comparative Study of LoRA and Full Fine-Tuning in Large Language Models

E.V. Surikova, E.A. Sabidaeva, , in: Параллельные вычислительные технологии – XIX всероссийская конференция с международным участием, ПаВТ'2025, г. Москва, 8–10 апреля 2025 г. Короткие статьи и описания плакатов.: Челябинск: Издательский центр ЮУрГУ, 2025. P. 90–98.

Added: July 3, 2025

HR-Tech Automation: A Case Study of Resume Design using GenAI Technologies

Suleykin, A., Babenko, R., Panfilov, P., , in: Proceedings of the 35th International DAAAM Virtual Symposium ''Intelligent Manufacturing & Automation''Vol. 1.: NY: DAAAM International Vienna, 2024. Ch. 20 P. 0157–0164.

Added: April 5, 2025

OmniDialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities

Razzhigaev A., Kurkin M., Goncharova E. et al., , in: Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP.: Association for Computational Linguistics, 2024. P. 183–195.

We introduce OmniDialog — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent ...

Added: February 21, 2025

MERA: A Comprehensive LLM Evaluation in Russian

Fenogenova A., Chervyakov, A., Martynov N. et al., , in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2024Vol. 1: Long Papers.: Bangkok: Association for Computational Linguistics, 2024. P. 9920–9948.

Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). However, despite researchers’ attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, ...

Added: February 17, 2025

Your Transformer is Secretly Linear

Razzhigaev A., Mikhalchuk M., Goncharova E. et al., , in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2024Vol. 1: Long Papers.: Bangkok: Association for Computational Linguistics, 2024. P. 5376–5384.

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models like GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering an almost perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed, due to a consistently low transformer layer output ...

Added: February 17, 2025

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models

Razzhigaev A., Mikhalchuk M., Goncharova E. et al., , in: Findings of the Association for Computational Linguistics: EACL 2024.: Association for Computational Linguistics, 2024. P. 868–874.

Added: February 17, 2025

ChatGPT, текст, информация: критический анализ

Komashko M. N., Труды по интеллектуальной собственности 2024 Т. 50 № 3 С. 118–128

The paper deals with theory and practice issues related to such type of artificial intelligence as large language models, in particular, ChatGPT. The main attention is paid to spheres of human activity, in which the exchange of information stated in the form of text is of the greatest importance: science, education and journalism (media sphere). The ...

Added: December 29, 2024

Automated Speech Act Annotation in a Russian Spoken Corpus Using Large Language Models: A Comparative Study

Sherstinova T., Viktoria Firsanova, , in: PROCEEDING OF THE 36TH CONFERENCE OF FRUCT ASSOCIATION.: [б.и.], 2024. P. 912–920.

The research focuses on the automatic annotation of a linguistic corpus using large language models (LLMs). Annotating a corpus is a crucial step in its creation, as it determines the practical scope and applications of the resource being developed. This study explores the annotation of oral speech transcripts at the pragmatic level using speech acts ...

Added: November 29, 2024

A Novel Psychometrics-Based Approach to Developing Professional Competency Benchmark for Large Language Models

Kardanova E., Ivanova A., Tarasova K. et al., / Series cs.CL "Computation and Language (cs.CL); Artificial Intelligence (cs.AI)". 2024.

The era of large language models (LLM) raises questions not only about how to train models, but also about how to evaluate them. Despite numerous existing benchmarks, insufficient attention is often given to creating assessments that test LLMs in a valid and reliable manner. To address this challenge, we accommodate the Evidence-centered design (ECD) methodology ...

Added: November 5, 2024

Automatic generation of physics items with Large Language Models (LLMs)

Moses Oluoke Omopekunola, Elena Yu. Kardanova, REID (Research and Evaluation in Education) 2024 Vol. 10 No. 2 P. 168–185

High-quality items are essential for producing reliable and valid assessments, offering valuable insights for decision-making processes. As the demand for items with strong psychometric properties increases for both summative and formative assessments, automatic item generation (AIG) has gained prominence. Research highlights the potential of large language models (LLMs) in the AIG process, noting the positive ...

Added: October 14, 2024

GPT3RecBot: a universal chatbot recommender of movies, books and music in Telegram

Lashinin O., Bykov K., Ananyeva M. et al., , in: Proceedings of the Fifth Knowledge-aware and Conversational Recommender Systems Workshop co-located with 17th ACM Conference on Recommender Systems (RecSys 2023)Vol. 3560.: CEUR Workshop Proceedings, 2023. P. 35–43.

Recent advances in large language models have extended their potential use cases to different domains. Models such as ChatGPT have an extensive internal knowledge base that enables them to provide answers to various domain-specific queries. In this paper, we explore the potential use of OpenAI’s GPT3.5 model as a conversational recommender system. We designed a ...

Added: December 2, 2023