• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • Primal-Dual Stochastic Mirror Descent for MDPs
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
June 25, 2026
HSE Researchers Make Aldehydes Perform Dual Function
Chemists from HSE University have discovered a way to carry out a reductive addition reaction without using an external reducing agent. Instead, the required 'resource' is supplied by the aldehyde itself, one of the reaction participants. This approach helps prevent unwanted side reactions, reduces toxicity, and simplifies the production and synthesis of organic molecules, including those used in the manufacture of medicines. The study has been published in Journal of Catalysis.
June 25, 2026
HSE Scientists Explain Why Findings in Autism Research Differ
Researchers from the Cognitive Health and Intelligence Centre at HSE University conducted the first-ever systematic review of studies on the specifics of emotion-from-motion perception in autism. The review showed that differences found between autistic and non-autistic individuals are largely associated with the experimental design and the types of tasks given to study participants. The review findings have been published in Research in Autism.
June 22, 2026
‘In Science, You Are Your Own Boss
Polina Nasledskova is interested in identifying gaps in linguistics and topics that have been overlooked by other researchers. In an interview for the  Young Scientists of HSE University project, she spoke about rare ordinal numerals in Nakh-Daghestanian languages, the benefits of knitting for concentration, and the beauty of the Patriarshy Bridge.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Primal-Dual Stochastic Mirror Descent for MDPs

P. 9723–9740.
Tiapkin D., Alexander Gasnikov

We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs). For this purpose, some variant of Stochastic Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. An important detail is the ability to use inexact values of functional constraints and compute the value of dual variables. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method. Using this algorithm, we get the first parallel algorithm for mixing average-reward MDPs with a generative model without reduction to discounted MDP. One of the main features of the presented method is low communication costs in a distributed centralized setting, even with very large networks.

Language: English
Full text
Text on another site
Keywords: reinforcement learningstochastic optimization

In book

International Conference on Artificial Intelligence and Statistics, 28-30 March 2022, A Virtual Conference
Vol. 151: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. , PMLR, 2022.
Similar publications
Разработка микросервиса ADP для идентификации источников выбросов на основе машинного обучения с подкреплением
Kychkin A., Chernitsin I., Прикладная информатика 2026 № 1(121) С. 40–58
The results of the development of a software microservice embedded in atmospheric air quality monitoring systems to support the identification of industrial pollution sources are presented. The emission and subsequent spread of harmful substances in the lower layers of the atmosphere is dynamic and characterized by high uncertainty due to the specific features of technological ...
Added: April 23, 2026
Artificial Neural Networks and Machine Learning. ICANN 2025 International Workshops and Special Sessions: 34th International Conference on Artificial Neural Networks, Kaunas, Lithuania, September 9–12, 2025, Proceedings, Part V
Cham: Springer, 2025.
This book constitutes the refereed proceedings of 34th International Workshops which were held in conjunction with the 34th International Conference on Artificial Neural Networks and Machine Learning, ICANN 2025, held in Kaunas, Lithuania, September 9–12, 2025.   The 20 full papers and 8 abstracts included in this workshop volume were carefully reviewed and selected from 42 submissions. ...
Added: September 29, 2025
Analysis of a Company Model in Conditions of Unstable Demand Using Reinforcement Learning Methods
Delev A., Semakov S., , in: 2025 8th International Conference on Artificial Intelligence and Big Data (ICAIBD).: IEEE, 2025. P. 318–322.
Profit is one of the most important economic indicators of a company’s performance, and for every company it is necessary to allocate resources in such a way as to obtain the maximum possible profit. The profit maximization problem is usually a dynamic optimization problem. This article discusses an approach to solving the production expansion problem ...
Added: August 25, 2025
Pseudo-collusion in a centralized algorithmic financial market
Pastushkov A., Boulatov A., Finance Research Letters 2025 Vol. 83 Article 107671
Recent studies have increasingly explored whether reinforcement learning algorithms can give rise to cooperative behavior that results in non-competitive pricing across various market settings. In financial markets, Cartea et al. (2022) show that market makers using multi-armed bandit (MAB) algorithms generally converge to competitive pricing in quote-driven over-the-counter (OTC) markets, barring some unlikely exceptions where ...
Added: June 19, 2025
The beer game bullwhip effect mitigation: a deep reinforcement learning approach
Rozhkov M., Alyamovskaya N., Zakhodiakin G., International Journal of Production Research 2025 Vol. 63 No. 18 P. 6630–6647
This article investigates the application of reinforcement learning (RL) methods to optimise a four-echelon linear supply chain model with stochastic demand. The proposed supply chain configuration is largely based on the production-distribution supply chain of the MIT Supply Chain Beer Game. We show that RL can significantly improve ordering efficiency and overall supply chain performance. ...
Added: March 24, 2025
Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact
Kornilov N., Gasnikov A., Dvurechensky P. et al., Computational Management Science 2023 Article 37
We present two easy-to-implement gradient-free/zeroth-order methods to optimize a stochastic non-smooth function accessible only via a black-box. The methods are built upon efficient first-order methods in the heavy-tailed case, i.e., when the gradi- ent noise has infinite variance but bounded (1 + 𝜅)-th moment for some 𝜅 ∈ (0, 1]. The first algorithm is based ...
Added: February 7, 2025
Deep Reinforcement Learning-Based Congestion Control for File Transfer over QUIC
Blokhin A., Kalev V., Pusev R. et al., , in: 2024 IEEE International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON).: Novosibirsk: IEEE, 2024. P. 25–30.
Congestion control is one of the key mechanisms of communication in QUIC protocol which controls how much data and at which rate can be send to an endpoint at particular moment of time for better use of shared network resources and avoids moving into congestive collapse state. In this work we tackle the problem of ...
Added: December 18, 2024
Vaidya’s method for convex stochastic optimization problems in small dimension
Gladin E., Gasnikov A., Ermakova E., Mathematical notes 2022 Vol. 112 No. 1 P. 183–190
The paper deals with a general problem of convex stochastic optimization in a space of small dimension (for example, 100 variables). It is known that for deterministic problems of convex optimization in small dimensions, the methods of centers of gravity type (for example, Vaidya’s method) provide the best convergence. For stochastic optimization problems, the question ...
Added: November 29, 2024
Метод эллипсоидов для задач выпуклой стохастической оптимизации малой размерности
Gladin E., Зайнуллина К. Э., Компьютерные исследования и моделирование 2021 Т. 13 № 6 С. 1137–1147
The article considers minimization of the expectation of convex function. Problems of this type often arise in machine learning and a variety of other applications. In practice, stochastic gradient descent (SGD) and similar procedures are usually used to solve such problems. We propose to use the ellipsoid method with mini-batching, which converges linearly and can ...
Added: November 29, 2024
Generative Flow Networks as Entropy-Regularized RL
Tiapkin D., Morozov N., Naumov A. et al., , in: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), 2-4 May 2024, Palau de Congressos, Valencia, Spain. PMLR: Volume 238Vol. 238.: Valencia: PMLR, 2024. P. 4213–4221.
The recently proposed generative flow networks (GFlowNets) are a method of training a policy to sample compositional discrete objects with probabilities proportional to a given reward via a sequence of actions. GFlowNets exploit the sequential nature of the problem, drawing parallels with reinforcement learning (RL). Our work extends the connection between RL and GFlowNets to ...
Added: June 22, 2024
Gradient-free Federated Learning Methods with l1 and l2-randomization for Non-smooth Convex Stochastic Optimization Problems
Alashqar B., Gasnikov A., Dvinskikh D. et al., Computational Mathematics and Mathematical Physics 2023 Vol. 63 P. 1600–1653
This paper studies non-smooth problems of convex stochastic optimization. Using the smoothing technique based on the replacement of the function value at the considered point by the averaged function value over a ball (in l1-norm or l2-norm) of a small radius centered at this point, and then the original problem is reduced to a smooth problem (whose ...
Added: March 27, 2024
Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance
Kornilov N., Shamir O., Lobanov A. et al., , in: Advances in Neural Information Processing Systems 36 (NeurIPS 2023).: Curran Associates, Inc., 2023. P. 64083–64102.
Added: March 26, 2024
Model-free Posterior Sampling via Learning Rate Randomization
Tiapkin D., Belomestny D., Calandriello D. et al., , in: Advances in Neural Information Processing Systems 36 (NeurIPS 2023).: Curran Associates, Inc., 2023. P. 73719–73774.
Added: February 17, 2024
Reinforcement Procedure for Randomized Machine Learning
Yuri S. Popkov, Dubnov Y. A., Alexey Yu. Popkov, Mathematics 2023 Vol. 11 No. 17 Article 3651
This paper is devoted to problem-oriented reinforcement methods for the numerical implementation of Randomized Machine Learning. We have developed a scheme of the reinforcement procedure based on the agent approach and Bellman’s optimality principle. This procedure ensures strictly monotonic properties of a sequence of local records in the iterative computational procedure of the learning process. ...
Added: February 5, 2024
Orthogonal Directions Constrained Gradient Method: from non-linear equality constraints to Stiefel manifold
Schechtman S., Tiapkin D., Muehlebach M. et al., , in: Proceedings of Machine Learning Research: Volume 195: The Thirty Sixth Annual Conference on Learning Theory, 12-15 July 2023, Bangalore, IndiaVol. 195: The Thirty Sixth Annual Conference on Learning Theory, 12-15 July 2023, Bangalore, India.: PMLR, 2023. P. 1228–1258.
We consider the problem of minimizing a non-convex function over a smooth manifold M. We propose a novel algorithm, the Orthogonal Directions Constrained Gradient Method (ODCGM), which only requires computing a projection onto a vector space. ODCGM is infeasible but the iterates are constantly pulled towards the manifold, ensuring the convergence of ODCGM towards M. ...
Added: December 1, 2023
Fast Rates for Maximum Entropy Exploration
Tiapkin D., Belomestny D., Calandriello D. et al., , in: Proceedings of the 40th International Conference on Machine Learning: Volume 202: International Conference on Machine Learning, 23-29 July 2023, Honolulu, Hawaii, USAVol. 202: International Conference on Machine Learning, 23-29 July 2023, Honolulu, Hawaii, USA.: PMLR, 2023. P. 34161–34221.
Added: December 1, 2023
Sharp Deviations Bounds for Dirichlet Weighted Sums with Application to analysis of Bayesian algorithms
Tiapkin D., Belomestny D., Naumov A. et al., Working papers by Cornell University. Series math "arxiv.org" 2023 Article 2304.03056
In this work, we derive sharp non-asymptotic deviation bounds for weighted sums of Dirichlet random variables. These bounds are based on a novel integral representation of the density of a weighted Dirichlet sum. This representation allows us to obtain a Gaussian-like approximation for the sum distribution using geometry and complex analysis methods. Our results generalize ...
Added: June 28, 2023
Variance Reduction for Policy-Gradient Methods via Empirical Variance Minimization
Belomestny D., Kaledin M., Golubev A., /. 2022.
Policy-gradient methods in Reinforcement Learning(RL) are very universal and widely applied in practice but their performance suffers from the high variance of the gradient estimate. Several procedures were proposed to reduce it including actor-critic(AC) and advantage actor-critic(A2C) methods. Recently the approaches have got new perspective due to the introduction of Deep RL: both new control ...
Added: April 14, 2023
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit