### Abstract

Classical stochastic gradient methods are well suited for minimizing expected-value objective functions. However, they do not apply to the minimization of a nonlinear function involving expected values or a composition of two expected-value functions, i.e., the problem min _{x}E_{v}[f_{v}(E_{w}[ g_{w}(x)]) ]. In order to solve this stochastic composition problem, we propose a class of stochastic compositional gradient descent (SCGD) algorithms that can be viewed as stochastic versions of quasi-gradient method. SCGD update the solutions based on noisy sample gradients of f_{v}, g_{w} and use an auxiliary variable to track the unknown quantity E_{w}[g_{w}(x) ]. We prove that the SCGD converge almost surely to an optimal solution for convex optimization problems, as long as such a solution exists. The convergence involves the interplay of two iterations with different time scales. For nonsmooth convex problems, the SCGD achieves a convergence rate of O(k^{-1 / 4}) in the general case and O(k^{-2 / 3}) in the strongly convex case, after taking k samples. For smooth convex problems, the SCGD can be accelerated to converge at a rate of O(k^{-2 / 7}) in the general case and O(k^{-4 / 5}) in the strongly convex case. For nonconvex problems, we prove that any limit point generated by SCGD is a stationary point, for which we also provide the convergence rate analysis. Indeed, the stochastic setting where one wants to optimize compositions of expected-value functions is very common in practice. The proposed SCGD methods find wide applications in learning, estimation, dynamic programming, etc.

Original language | English (US) |
---|---|

Pages (from-to) | 419-449 |

Number of pages | 31 |

Journal | Mathematical Programming |

Volume | 161 |

Issue number | 1-2 |

DOIs | |

State | Published - Jan 1 2017 |

### All Science Journal Classification (ASJC) codes

- Software
- Mathematics(all)

## Fingerprint Dive into the research topics of 'Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions'. Together they form a unique fingerprint.

## Cite this

*Mathematical Programming*,

*161*(1-2), 419-449. https://doi.org/10.1007/s10107-016-1017-3