Linearly Converging Error Compensated SGD

Eduard Gorbunov; Kovalev D.; Makarenko D.; Richtarik P.

Publications

?

Linearly Converging Error Compensated SGD

P. 20889–20900.

Eduard Gorbunov, Kovalev D., Makarenko D., Richtarik P.

Language: English

Full text

Text on another site

Keywords: quantization stochastic optimization variance reduction convex optimization distributed optimization error compensation delayed updates

In book

Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

Curran Associates, Inc., 2020.

Local SGD: Unified Theory and New Efficient Methods

Gorbunov E., Hanzely F., Richtarik P., , in: International Conference on Artificial Intelligence and Statistics, 13-15 April 2021, VirtualVol. 130.: PMLR, 2021. Ch. 130 P. 3556–3564.

Added: October 25, 2021

Low-Variance Black-Box Gradient Estimates for the Plackett-Luce Distribution

Gadetsky A., Struminsky K., Robinson C. et al., , in: Thirty-Fourth AAAI Conference on Artificial IntelligenceVol. 34.: AAAI Press, 2020. P. 10126–10135.

Added: October 11, 2020

Solving Convex Min-Min Problems with Smoothness and Strong Convexity in One Group of Variables and Low Dimension in the Other

Gladin E., Alkousa M., Gasnikov A., Automation and Remote Control 2021 Vol. 82 P. 1679–1691

The article deals with some approaches to solving convex problems of the min-min type with smoothness and strong convexity in only one of the two groups of variables. It is shown that the proposed approaches based on Vaidya’s method, the fast gradient method, and the accelerated gradient method with variance reduction have linear convergence. It ...

Added: November 29, 2024

On the Complexity of Approximating Wasserstein Barycenters

Kroshnin A., Tupitsa Nazarii, Dvinskikh D. et al., , in: Proceedings of Machine Learning ResearchVol. 97: International Conference on Machine Learning, 9-15 June 2019, Long Beach, California, USA.: PMLR, 2019. P. 3530–3540.

We study the complexity of approximating the Wasserstein barycenter of m discrete measures, or histograms of size n, by contrasting two alternative approaches that use entropic regularization. The first approach is based on the Iterative Bregman Projections (IBP) algorithm for which our novel analysis gives a complexity bound proportional to $m n^2 / \epsilon^2$ to approximate the original non-regularized barycenter. ...

Added: June 11, 2019

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108

Eduard Gorbunov, Hanzely F., Richtarik P., PMLR, 2020.

In this paper we introduce a unified analysis of a large family of variants of proximal stochastic gradient descent (SGD) which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities. We show that our framework includes methods with and without the following tricks, and ...

Added: December 7, 2020

Vaidya’s method for convex stochastic optimization problems in small dimension

Gladin E., Gasnikov A., Ermakova E., Mathematical notes 2022 Vol. 112 No. 1 P. 183–190

The paper deals with a general problem of convex stochastic optimization in a space of small dimension (for example, 100 variables). It is known that for deterministic problems of convex optimization in small dimensions, the methods of centers of gravity type (for example, Vaidya’s method) provide the best convergence. For stochastic optimization problems, the question ...

Added: November 29, 2024

Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

Gorbunov E., Danilova M., Shibaev I. et al., / Series arXiv:2106.05958 "arXiv:2106.05958". 2021.

Thanks to their practical efficiency and random nature of the data, stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it ...

Added: October 25, 2021

Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance

Kornilov N., Shamir O., Lobanov A. et al., , in: Advances in Neural Information Processing Systems 36 (NeurIPS 2023).: Curran Associates, Inc., 2023.

In this paper, we consider non-smooth stochastic convex optimization with two function evaluations per round under infinite noise variance. In the classical setting when noise has finite variance, an optimal algorithm, built upon the batched accelerated gradient method, was proposed in (Gasnikov et. al., 2022). This optimality is defined in terms of iteration and oracle ...

Added: March 26, 2024

Метод эллипсоидов для задач выпуклой стохастической оптимизации малой размерности

Gladin E., Зайнуллина К. Э., Компьютерные исследования и моделирование 2021 Т. 13 № 6 С. 1137–1147

The article considers minimization of the expectation of convex function. Problems of this type often arise in machine learning and a variety of other applications. In practice, stochastic gradient descent (SGD) and similar procedures are usually used to solve such problems. We propose to use the ellipsoid method with mini-batching, which converges linearly and can ...

Added: November 29, 2024

Decentralized and parallel primal and dual accelerated methods for stochastic convex programming problems

Dvinskikh D., Gasnikov A., Journal of Inverse and Ill-posed problems 2021 Vol. 29 No. 3 P. 385–405

We introduce primal and dual stochastic gradient oracle methods for decentralized convex optimization problems. Both for primal and dual oracles, the proposed methods are optimal in terms of the number of communication steps. However, for all classes of the objective, the optimality in terms of the number of oracle calls per node takes place only ...

Added: October 29, 2021

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

Gorbunov E., Danilova M., Gasnikov A., , in: Advances in Neural Information Processing Systems 33 (NeurIPS 2020).: Curran Associates, Inc., 2020. P. 15042–15053.

Added: December 7, 2020

Empirical Variance Minimization with Applications in Variance Reduction and Optimal Control

Belomestny Denis, Iosipoi L., Paris Q. et al., Bernoulli: a journal of mathematical statistics and probability 2022 Vol. 28 No. 2 P. 1382–1407

We study the problem of empirical minimization for variance-type functionals over functional classes. Sharp non-asymptotic bounds for the excess variance are derived under mild conditions. In particular, it is shown that under some restrictions imposed on the functional class fast convergence rates can be achieved including the optimal non-parametric rates for expressive classes in the ...

Added: April 17, 2022

Optimal Tensor Methods in Smooth Convex and Uniformly Convex Optimization

Gasnikov A., , in: Proceedings of Machine Learning Research Vol. 99: Conference on Learning Theory, 25-28 June 2019, Phoenix, AZ, USA. PMLR, 2019.: PMLR, 2019..

We consider convex optimization problems with the objective function having Lipshitz-continuous p-th order derivative, where p ≥ 1. We propose a new tensor method, which closes the gap between the lower O ε − 2 3p+1 and upper O ε − 1 p+1 iteration complexity bounds for this class of optimization ...

Added: June 13, 2019

Decentralized personalized federated learning: Lower bounds and optimal algorithm for all personalization modes

Sadiev A., Borodich E., Beznosikov A. et al., EURO Journal on Computational Optimization 2022 Vol. 10 Article 100041

This paper considers the problem of decentralized, personalized federated learning. For centralized personalized federated learning, a penalty that measures the deviation from the local model and its average, is often added to the objective function. However, in a decentralized setting this penalty is expensive in terms of communication costs, so here, a different penalty — ...

Added: October 28, 2022

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Puchkin N., Gorbunov E., Kutuzov N. et al., , in: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), 2-4 May 2024, Palau de Congressos, Valencia, Spain. PMLR: Volume 238Vol. 238.: Valencia: PMLR, 2024. P. 856–864.

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than 𝑂(𝐾^{−2(𝛼−1)/𝛼}), when the stochastic gradients have finite 𝛼-th moment, 𝛼∈(1,2]. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, ...

Added: April 22, 2024

Optimal Tensor Methods in Smooth Convex and Uniformly Convex Optimization

Gasnikov A., Dvurechensky P., Gorbunov E. et al., , in: Conference on Learning Theory, 25-28 June 2019, Phoenix, USAVol. 99.: [б.и.], 2019. P. 1374–1391.

We consider convex optimization problems with the objective function having Lipshitz-continuous p-th order derivative, where p≥1. We propose a new tensor method, which closes the gap between the lower Ω( ε^(−2/(3p+1)) and upper O( ε^(−1/p+1)) iteration complexity bounds for this class of optimization problems. We also consider uniformly convex functions, and show how the proposed ...

Added: October 31, 2020

Dual Approaches to the Minimization of Strongly Convex Functionals with a Simple Structure under Affine Constraints

Anikin A., Gasnikov A., Dvurechensky P. et al., Computational Mathematics and Mathematical Physics 2017 Vol. 57 No. 8 P. 1262–1276

A strongly convex function of simple structure (for example, separable) is minimized under affine constraints. A dual problem is constructed and solved by applying a fast gradient method. The necessary properties of this method are established relying on which, under rather general conditions, the solution of the primal problem can be recovered with the same ...

Added: November 29, 2018

Towards accelerated rates for distributed optimization over time-varying networks

Rogozin A., Lukoshkin V., Gasnikov A. et al., / Series arXiv "math". 2020.

We study the problem of decentralized optimization over time-varying networks with strongly convex smooth cost functions. In our approach, nodes run a multi-step gossip procedure after making each gradient update, thus ensuring approximate consensus at each iteration, while the outer loop is based on accelerated Nesterov scheme. The algorithm achieves precision ε>0 in O(sqrt{κ_g}χlog2(1/ε)) communication ...

Added: October 7, 2020

Stochastic saddle-point optimization for the Wasserstein barycenter problem

Tiapkin D., Gasnikov A., Dvurechensky P., Optimization Letters 2022 Vol. 16 No. 7 P. 2145–2175

We consider the population Wasserstein barycenter problem for random probability measures supported on a finite set of points and generated by an online stream of data. This leads to a complicated stochastic optimization problem where the objective is given as an expectation of a function given as a solution to a random optimization problem. We ...

Added: October 16, 2022

Numerical methods for the resource allocation problem in networks

Ivanova A., Пасечнюк Д., Dvurechensky P. et al., / Cornell University. Серия "Working papers by Cornell University". 2019.

In this paper, we consider the resource allocation problem in a network with a large number of connections which are used by a huge number of users. The resource allocation problem, which we consider is a maximization problem with linear inequality constraints. To solve this problem we construct the dual problem and propose to use ...

Added: October 23, 2020

Neural Networks Compression for Language Modeling

Grachev A., Ignatov D. I., Savchenko A., , in: Pattern Recognition and Machine Intelligence. 7th International Conference, PReMI 2017, Kolkata, India, December 5-8, 2017, Proceedings. Lecture Notes in Computer Science book series (LNCS, volume 10597).: Springer, 2017. P. 351–357.

In this paper, we consider several compression techniques for the language modeling problem based on recurrent neural networks (RNNs). It is known that conventional RNNs, e.g., LSTM-based networks in language modeling, are characterized with either high space complexity or substantial inference time. This problem is especially crucial for mobile applications, in which the constant interaction with ...

Added: October 14, 2018

Variance reduction for additive functionals of Markov chains via martingale representations

Belomestny D., Moulines E., Samsonov S., Statistics and Computing 2022 Vol. 32 No. 1 Article 16

In this paper, we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete-time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the stationary distribution (and even any type of ergodicity) or specific structure of the underlying density. By rigorously analyzing the ...

Added: August 31, 2020

Optimal distributed convex optimization on slowly time-varying graphs

Rogozin A., Uribe C., Gasnikov A. et al., IEEE Transactions on Control of Network Systems 2020 Vol. 7 No. 2 P. 829–841

We study optimal distributed first-order optimization algorithms when the network (i.e., communication constraints between the agents) changes with time. This problem is motivated by scenarios where agents experience network malfunctions. We provide a sufficient condition that guarantees a convergence rate with optimal (up to logarithmic terms) dependencies on the network and function parameters if the ...

Added: October 7, 2020

Extensions of vertex algebras. Constructions and applications

Feigin B. L., Russian Mathematical Surveys 2017 Vol. 72 No. 4 P. 707–763

This paper discusses the main known constructions of vertex operator algebras. The starting point is the lattice algebra. Screenings distinguish subalgebras of lattice algebras. Moreover, one can construct extensions of vertex algebras. Combining these constructions gives most of the known examples. A large class of algebras with big centres is constructed. Such algebras have applications ...

Added: November 5, 2020