On Structured Prediction Theory with Calibrated Convex Surrogate Losses

A. Osokin; Bach F.; Lacoste-Julien S.

?

On Structured Prediction Theory with Calibrated Convex Surrogate Losses

P. 302–313.

Osokin A., Bach F., Lacoste-Julien S.

We provide novel theoretical insights on structured prediction in the context of efficient convex surrogate loss minimization with consistency guarantees. For any task loss, we construct a convex surrogate that can be optimized via stochastic gradient descent and we prove tight bounds on the so-called "calibration function" relating the excess surrogate risk to the actual risk. In contrast to prior related work, we carefully monitor the effect of the exponential number of classes in the learning guarantees as well as on the optimization complexity. As an interesting consequence, we formalize the intuition that some task losses make learning harder than others, and that the classical 0-1 loss is ill-suited for structured prediction.

Language: English

Full text

Text on another site

Keywords: theory of computation and machine learning structured prediction convex optimization

In book

Advances in Neural Information Processing Systems 30 (NIPS 2017)

Montreal: Curran Associates, 2017.

Quantifying Learning Guarantees for Convex but Inconsistent Surrogates

Struminsky K., Lacoste-Julien S., Osokin A., , in: Advances in Neural Information Processing Systems 31 (NIPS 2018). [б.и.], 2018. P. 1–9.

We study consistency properties of machine learning methods based on minimizing convex surrogates. We extend the recent framework of Osokin et al. (2017) for the quantitative analysis of consistency properties to the case of inconsistent surrogates. Our key technical contribution consists in a new lower bound on the calibration function for the quadratic surrogate, which ...

Added: October 29, 2018

Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs

Osokin A., Alayrac J., Lukasewitz I. et al., , in: Proceedings of Machine Learning Research. Proceedings of the International Conference on Machine Learning (ICML 2016)Vol. 48. NY: [б.и.], 2016. P. 885–925.

In this paper, we propose several improvements on the block-coordinate Frank-Wolfe (BCFW) algorithm from Lacoste-Julien et al. (2013) recently used to optimize the structured support vector machine (SSVM) objective in the context of structured prediction, though it has wider applications. The key intuition behind our improvements is that the estimates of block gaps maintained by ...

Added: October 19, 2017

Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems

Puchkin N., Gorbunov E., Kutuzov N. et al., , in: Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024), 2-4 May 2024, Palau de Congressos, Valencia, Spain. PMLR: Volume 238Vol. 238. Valencia: PMLR, 2024. P. 856–864.

We consider stochastic optimization problems with heavy-tailed noise with structured density. For such problems, we show that it is possible to get faster rates of convergence than 𝑂(𝐾^{−2(𝛼−1)/𝛼}), when the stochastic gradients have finite 𝛼-th moment, 𝛼∈(1,2]. In particular, our analysis allows the noise norm to have an unbounded expectation. To achieve these results, we stabilize stochastic gradients, ...

Added: April 22, 2024

Optimization of the fluid model of scheduling: local predictions

Bogachev T., / Cornell University. Series math "arxiv.org". 2022.

In this research a continuous model for resource allocations in a queuing system is considered and a local prediction on the system behavior is developed. As a result we obtain a set of possible cases, some of which lead to quite clear optimization problems. Currently, the main result of this research direction is an algorithm ...

Added: October 21, 2022

Proceedings of Machine Learning Research

Kovalev D., Shulgin E., Richtarik P. et al., PMLR, 2021.

We propose ADOM – an accelerated method for smooth and strongly convex decentralized optimization over time-varying networks. ADOM uses a dual oracle, i.e., we assume access to the gradient of the Fenchel conjugate of the individual loss functions. Up to a constant factor, which depends on the network structure only, its communication complexity is the ...

Added: October 31, 2021

Linearly Converging Error Compensated SGD

Eduard Gorbunov, Kovalev D., Makarenko D. et al., , in: Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Curran Associates, Inc., 2020. P. 20889–20900.

Added: December 7, 2020

Self-concordant analysis of Frank-Wolfe algorithms

Dvurechensky P., Ostroukhov P., Safin K. et al., , in: International Conference on Machine Learning (ICML 2020)Vol. 119. PMLR, 2020.

Projection-free optimization via different variants of the Frank-Wolfe (FW), a.k.a. Conditional Gradient method has become one of the cornerstones in optimization for machine learning since in many cases the linear minimization oracle is much cheaper to implement than projections and some sparsity needs to be preserved. In a number of applications, e.g. Poisson inverse problems ...

Added: October 31, 2020

Improved Complexity Bounds in Wasserstein Barycenter Problem

Dvinskikh D., Tiapkin D., , in: Proceedings of Machine Learning Research Volume 130: International Conference on Artificial Intelligence and Statistics. [б.и.], 2021. P. 1738–1746.

In this paper, we focus on computational aspects of the Wasserstein barycenter problem. We propose two algorithms to compute Wasserstein barycenters of discrete measures. The first algorithm, based on mirror prox with a specific norm, meets the complexity of celebrated accelerated iterative Bregman projections (IBP), however, with no limitations in contrast to the (accelerated) IBP, which is ...

Added: November 2, 2022

On Primal and Dual Approaches for Distributed Stochastic Convex Optimization over Networks

Dvinskikh D., Gorbunov E., Gasnikov A. et al., , in: 2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 2019. P. 7435–7440.

We introduce primal and dual stochastic gradient oracle methods for distributed convex optimization problems over networks. We show that the proposed methods are optimal (in terms of communication steps) for primal and dual oracles. Additionally, for a dual stochastic oracle, we propose a new analysis method for the rate of convergence in terms of duality ...

Added: February 5, 2021

Oracle Complexity Separation in Convex Optimization

Ivanova A., Dvurechensky P., Vorontsova E. et al., Journal of Optimization Theory and Applications 2022 Vol. 193 No. 1-3 P. 462–490

Many convex optimization problems have structured objective functions written as a sum of functions with different oracle types (e.g., full gradient, coordinate derivative, stochastic gradient) and different arithmetic operations complexity of these oracles. In the strongly convex case, these functions also have different condition numbers that eventually define the iteration complexity of first-order methods and ...

Added: October 28, 2022

SEARNN: Training RNNs with global-local losses

Leblond R., Alayrac J., Osokin A. et al., , in: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018). [б.и.], 2018. P. 1–16.

We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an ...

Added: October 29, 2018

Lower and upper bounds for the largest Lyapunov exponent of matrices

Protasov V., Jungers R., Linear Algebra and its Applications 2013 Vol. 438 No. 11 P. 4448–4468

We introduce a new approach to evaluate the largest Lyapunov exponent of a family of nonnegative matrices. The method is based on using special positive homogeneous functionals on , which gives iterative lower and upper bounds for the Lyapunov exponent. They improve previously known bounds and converge to the real value. The rate of convergence ...

Added: February 23, 2016

Stochastic Spectral and Conjugate Descent Methods

Kovalev D., Eduard Gorbunov, Gasanov E. et al., , in: Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Neural Information Processing Systems Foundation, 2018. P. 3358–3367.

The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra ...

Added: December 7, 2020

Optimal Tensor Methods in Smooth Convex and Uniformly Convex Optimization

Gasnikov A., Dvurechensky P., Gorbunov E. et al., , in: Conference on Learning Theory, 25-28 June 2019, Phoenix, USAVol. 99. [б.и.], 2019. P. 1374–1391.

We consider convex optimization problems with the objective function having Lipshitz-continuous p-th order derivative, where p≥1. We propose a new tensor method, which closes the gap between the lower Ω( ε^(−2/(3p+1)) and upper O( ε^(−1/p+1)) iteration complexity bounds for this class of optimization problems. We also consider uniformly convex functions, and show how the proposed ...

Added: October 31, 2020

Application of the nested convex programming to the optimal power flow in MT-HVDC grids

Garces A., Azhmyakov V., IFAC-PapersOnLine 2020 Vol. 53 No. 2 P. 13173–13177

This paper deals with an application of the nested convex programming to the optimal power flow (OPF) in multi-terminal high-voltage direct-current grids (MT-HVDC). The real-world optimization problem under consideration is non-convex. This fact implies some possible inconsistencies of the conventional numerical minimization algorithms (such as interior point method). Moreover, the constructive numerical treatment of this ...

Added: October 30, 2021

Accelerated Gradient-Free Optimization Methods with a Non-Euclidean Proximal Operator

Vorontsova E., Gasnikov A., Dvurechensky P. et al., Automation and Remote Control 2019 Vol. 80 No. 8 P. 1487–1501

We propose an accelerated gradient-free method with a non-Euclidean proximal operator associated with the p-norm (1 ⩽ p ⩽ 2). We obtain estimates for the rate of convergence of the method under low noise arising in the calculation of the function value. We present the results of computational experiments. ...

Added: December 10, 2019

Universal intermediate gradient method for convex problems with inexact oracle

Kamzolov D., Dvurechensky P., Gasnikov A., Optimization Methods and Software 2021 Vol. 36 No. 6 P. 1289–1316

In this paper, we propose new first-order methods for minimization of a convex function on a simple convex set. We assume that the objective function is a composite function given as a sum of a simple convex function and a convex function with inexact Hölder-continuous subgradient. We propose Universal Intermediate Gradient Method. Our method enjoys ...

Added: August 4, 2020

Mapping of enclosed buildings using mobile radio tomography

Ingacheva A., Kokhan V., Osipov D., , in: Proceedings of the 32nd European Conference on Modelling and Simulation (ECMS 2018),Wilhelmshaven, Germany 22 – 25 May 2018. NY: Curran Associates, Inc., 2018. P. 183–189.

In this paper we consider the task of inner objects mapping for the building with a bunch of moving around it autonomous agents which use narrow beam of radio waves using WiFi frequency (2.4 GHz). Linear model of pixel-wise radio waves attenuation is considered. SIRT algorithm with TV and Tikhonov regularizations is used for the ...

Added: April 6, 2019

Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance

Kornilov N., Shamir O., Lobanov A. et al., , in: Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023.

In this paper, we consider non-smooth stochastic convex optimization with two function evaluations per round under infinite noise variance. In the classical setting when noise has finite variance, an optimal algorithm, built upon the batched accelerated gradient method, was proposed in (Gasnikov et. al., 2022). This optimality is defined in terms of iteration and oracle ...

Added: March 26, 2024

Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau

Moulines E., Pereyra M., Durmus A., SIAM Journal on Imaging Sciences 2018 Vol. 11 No. 1 P. 473–506

Modern imaging methods rely strongly on Bayesian inference techniques to solve challenging imaging problems. Currently, the predominant Bayesian computation approach is convex optimization, which scales very efficiently to high-dimensional image models and delivers accurate point estimation results. However, in order to perform more complex analyses, for example, image uncertainty quantification or model selection, it is ...

Added: December 11, 2018

Decentralize and randomize: Faster algorithm for Wasserstein barycenters

Dvurechensky P., Dvinskikh D., Gasnikov A. et al., , in: Advances in Neural Information Processing Systems 31 (NeurIPS 2018). Neural Information Processing Systems Foundation, 2018. P. 10760–10770.

We study the decentralized distributed computation of discrete approximations for the regularized Wasserstein barycenter of a finite set of continuous probability measures distributedly stored over a network. We assume there is a network of agents/machines/computers, and each agent holds a private continuous probability measure and seeks to compute the barycenter of all the measures in ...

Added: October 31, 2020

Numerical methods for the resource allocation problem in networks

Ivanova A., Пасечнюк Д., Dvurechensky P. et al., / Cornell University. Серия "Working papers by Cornell University". 2019.

In this paper, we consider the resource allocation problem in a network with a large number of connections which are used by a huge number of users. The resource allocation problem, which we consider is a maximization problem with linear inequality constraints. To solve this problem we construct the dual problem and propose to use ...

Added: October 23, 2020

Oracle Complexity Separation in Convex Optimization

Ivanova A., Gasnikov A., Dvurechensky P. et al., / Working papers by Cornell University. Series "Optimization and Control". 2020.

Ubiquitous in machine learning regularized empirical risk minimization problems are often composed of several blocks which can be treated using different types of oracles, e.g., full gradient, stochastic gradient or coordinate derivative. Optimal oracle complexity is known and achievable separately for the full gradient case, the stochastic gradient case, etc. We propose a generic framework ...

Added: October 25, 2020

A Stochastic Derivative Free Optimization Method with Momentum

Eduard Gorbunov, Bibi A., Sener O. et al., , in: Proceedings of the 8th International Conference on Learning Representations (ICLR 2020). ICLR, 2020. P. 1–28.

We consider the problem of unconstrained minimization of a smooth objective function in $\mathbb{R}^d$ in setting where only function evaluations are possible. We propose and analyze stochastic zeroth-order method with heavy ball momentum. In particular, we propose, SMTP, a momentum version of the stochastic three-point method (STP) Bergou et al. (2019). We show new complexity ...

Added: December 7, 2020