In this series of posts, I will present our recent work with Francis Bach on the optimal rates for $M$-estimators with self-concordant-like losses. The term “$M$-estimator” is more commonly used in the statistical community; in the learning theory community one more often talks about empirical risk minimization. I will mostly use the statistical terminology to stress the connections to the classical asymptotic results.

## Setting

We work in the familiar statistical learning setting: the goal is to predict the label $Y \in \mathcal{Y}$ as some function of the dot product $\eta = X^\top \theta$ between the predictor $X \in \mathbb{R}^d$ and parameter $\theta \in \mathbb{R}^d$. This is formalized as minimization of expected risk $L(\theta) := \mathbb{E}[\ell(Y,X^\top\theta)],$ where $\ell: \mathcal{Y} \times \mathbb{R} \to \mathbb{R}$ is some loss function, and $\mathbb{E}[\cdot]$ is expectation over the distribution $\mathcal{P}$ of the observation $Z = (X,Y)$. Let $\theta_*$ be a minimizer of $L(\theta)$, which will prove to be unique in our situation. Since the distribution $\mathcal{P}$ is unknown, we cannot find $\theta_*$ directly. Instead we observe an i.i.d. sample $(X_1,Y_1), …, (X_n, Y_n) \sim \mathcal{P},$ and compute the empirical counterpart of $\theta_*$, a minimizer $\widehat \theta_n$ of empirical risk $L_n(\theta) := \frac{1}{n}\sum_{i=1}^n \ell(Y_i,X_i^\top\theta).$ This setting includes regression in the case $\mathcal{Y} = \mathbb{R}$ (usually one also takes $\ell(y,y’) = \varphi(y-y’)$ for some function $\varphi: \mathbb{R} \to \mathbb{R}$, e.g., the squared loss) and classification in the case $\mathcal{Y} = \{0,1\}$.

To understand our somewhat unusual loss parametrization, note that when $\ell(y,\eta) = - \log p_{\eta}(y)$ for some probability distribution $p_{\eta}(\cdot)$ on $\mathcal{Y}$ parametrized by $\eta \in \mathbb{R}$, we have a conditional likelihood model for $Y$ given $X^{\top} \theta$, and then $\widehat \theta_n$ becomes the maximum likelihood estimator (MLE). This includes, in particular, least squares and logistic regression. More generally, this is the case in generalized linear models (GLMs) with canonical parametrization, where the loss takes the form \begin{equation} \label{def:glm} \ell(y,\eta) = - y\eta + a(\eta) - b(y), \end{equation} where $a(\eta): \mathbb{R} \to \mathbb{R}$, called the cumulant, normalizes $\ell(y,\eta)$ to be a valid negative log-likelihood: $a(\eta) = \log \int_{\mathcal{Y}} \exp(y\eta + b(y)) \, \text{d}y.$

More generally, we can talk about quasi-MLE where one removes the implicit assumption that $\mathcal{Y}$ is indeed generated by one of the distributions $p_{\eta}(\cdot)$ for some value $\eta_*$. In this case, the asymptotic properties of $\widehat\theta_n$ have long been studied in the statistical community. Our goal in this work was to extend this classical theory to the finite-sample setting, and beyond the quasi-MLE setup, that is, obtain results for general $M$-estimators. I will now give a brief overview of this theory.

## Background: asymptotic theory

The classical theory of local asymptotic normality (LAN) considers the limit $n \to \infty$ with fixed $d$, and relies on the local regularity assumptions, requiring that $L(\theta)$ is sufficiently smooth at $\theta_*$, so that $\nabla L(\theta_*) = 0$, and the excess risk $L(\theta) - L(\theta_*)$ can be well-approximated by its 2nd-order Taylor expansion, or the squared prediction distance $\frac{1}{2} \Vert \theta - \theta_* \Vert_{\mathbf{H}}^2 := \frac{1}{2} \Vert \mathbf{H}^{1/2}(\theta - \theta_*) \Vert^2,$ where $\mathbf{H}$ is the risk Hessian at the optimum: $\mathbf{H} := \nabla^2 L(\theta_*),$ and one usually assumes that $\mathbf{H} \succ 0$. Note that in the case of least squares, we simply have $L(\theta) - L(\theta_*) = \frac{1}{2} \Vert\theta - \theta_*\Vert_{\mathbf{H}}.$ These assumptions allow to derive the local asymptotic normality of quasi MLE, or Fisher’s theorem: \begin{equation} \label{eq:lan-fisher} \sqrt{n}\mathbf{H}^{1/2}(\widehat\theta_n - \theta_*) \rightsquigarrow \mathcal{N}(0, \mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}), \tag{1} \end{equation} where $\rightsquigarrow$ is convergence in the law, and $\mathbf{G}$ is the covariance matrix of the loss gradient at $\theta_*$: $\mathbf{G} := \mathbb{E} \left[ \nabla_\theta \ell(Y,X^\top\theta_*) \otimes \nabla_\theta \ell(Y,X^\top\theta_*) \right].$ In well-specified MLE, one has the Bartlett identity $\mathbf{G} = \mathbf{H}$; then, $\mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}$ is the identity, and $\text{Tr}[\mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}] = d.$ In the general case, we can define the effective dimension $d_{eff} := \text{Tr} [\mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}],$ and hope that it is not much larger than $d$, i.e., the model is only “moderately” misspecified. In some cases this is known to be true: for example, in least-squares regression one has $d_{eff} \le \kappa_X \cdot \kappa_Y \cdot d,$ irrespectively of the true distribution of the noise, whenever $X$ and $Y$ have bounded kurtoses $\kappa_X, \kappa_Y$: $\mathbb{E}[(Y-\mathbb{E}[Y])^4]^{1/4} \le \kappa_Y \cdot \mathbb{E}[(Y-\mathbb{E}[Y])^2]^{1/2},$ $\mathbb{E}[\left\langle u, X-\mathbb{E}[X] \right\rangle^4]^{1/4} \le \kappa_X \cdot \mathbb{E}[\left\langle u, X-\mathbb{E}[X] \right\rangle^2]^{1/2}, \quad \forall u \in \mathbb{R}^d.$

Returning to the results of the LAN theory, \eqref{eq:lan-fisher} implies that the scaled prediction error $n\Vert \widehat\theta_n-\theta_*\Vert_{\mathbf{H}}^2$ asymptotically has the generalized chi-square distribution – namely, it distributed as the square of $\mathcal{N}(0, \mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2})$. Moreover, it can also be obtained that $2n[L(\widehat\theta_n) - L(\theta_*)]$ has the same asymptotic distribution. Using the standard chi-square deviation bound from (Laurent and Massart, 2000), we can summarize this result in terms of asymptotic confidence bounds for the excess risk and prediction distance: \begin{equation} \label{eq:crb-prob} \boxed{ \begin{aligned} L(\widehat\theta_n) - L(\theta_*) &\approx \frac{1}{2} \Vert\theta - \theta_*\Vert_{\mathbf{H}}^2 \approx \frac{d_{eff} (1 + \sqrt{2\log(1/\delta)})^2}{2n}, \end{aligned} } \tag{$\star$} \end{equation}

where $\approx$ hides the $o(1/n)$ factor when $n \to \infty$.

The analysis leading to \eqref{eq:crb-prob} can be done in three steps:

1. First, one can easily control the squared “natural” norm of the score, $\Vert\nabla L_n(\theta_*)\Vert_{\mathbf{H}^{-1}}^2$, by the central limit theorem, since it is the average of i.i.d. quantities – squared norms of the gradients $\Vert \nabla_{\theta} \ell(Y_i,X_i^\top \theta_*) \Vert_{\mathbf{H}}^2, \quad 1 \le i \le n.$

2. Then one can prove that as $n \to \infty$, the empirical risk can be approximated by its 2nd-order Taylor expansion $L_n(\theta_*) + \left\langle \nabla L_n(\theta_*),\theta - \theta_*\right\rangle + \frac{1}{2} \Vert \widehat\theta_n - \theta_* \Vert^2_{\mathbf{H}_n}$ where $\mathbf{H}_n = \nabla^2 L_n(\theta_*)$ is the empirical Hessian at $\theta_*$. Since $\mathbf{H}_n$ converges to $\mathbf{H}$ in the positive-semidefinite sense, this allows to localize the estimate: \begin{align} 0 &\ge L_n(\widehat\theta_n) - L_n(\theta_*) \\ &\approx \left\langle \nabla L_n(\theta_*),\widehat \theta_n - \theta_*\right\rangle + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}_n}^2 \\ &\approx \left\langle \nabla L_n(\theta_*),\widehat \theta_n - \theta_*\right\rangle + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \label{eq:localization-chain}\tag{2}\\ &= \left\langle \mathbf{H}^{-1/2} \nabla L_n(\theta_*), \mathbf{H}^{1/2}(\widehat \theta_n - \theta_*)\right\rangle + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \\ &\ge -\Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}} \Vert\widehat \theta_n - \theta_* \Vert_{\mathbf{H}} + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2. \end{align} where the transition to the third line uses that $\mathbf{H}_n$ converges to $\mathbf{H}$, and the final transition is by Cauchy-Schwarz. As a result, we arrive at $\frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \le 2 \Vert\nabla L_n(\theta_*)\Vert_{\mathbf{H}^{-1}}^2.$

3. Once the localization is achieved, one can similarly control the excess risk $L(\widehat\theta_n) - L(\theta_*)$ through its second-order Taylor approximation $\frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2$. For this, one shows that the Hessian at $\theta$, $\mathbf{H}(\theta) := \nabla^2 L(\theta),$ remains nearly constant in the Dikin ellipsoid $\Theta_{r}(\theta_*) = \left\{ \theta \in \mathbb{R}^d: \Vert\theta - \theta_*\Vert_{\mathbf{H}} \le r \right\}$ with a constant radius $r = O(1)$, where by near-constant’’ we mean that $c \mathbf{H}(\theta_*) \preccurlyeq \mathbf{H}(\theta) \preccurlyeq C\mathbf{H}(\theta_*)$ in the positive-semidefinite sense for some constants $0 < c \le 1$ and $C \ge 1$, concisely written as $\mathbf{H}(\theta) \asymp \mathbf{H}(\theta_*).$ This classical fact, whose proof we omit here, can be obtained from the relation between the second and third moments of the calibrated design – the vector $\tilde X$ satisfying $\mathbb{E}[\tilde X \tilde X^{\top}] = \mathbf{H}$: $\tilde X(\theta_*) := [\ell_{\eta}''(Y,X^\top \theta_*)]^{1/2} X.$

## Finite-sample setup: the challenge

When we want to prove a non-asymptotic analogue of \eqref{eq:crb-prob} in the finite-sample setup, the first step of the asymptotic ‘‘recipe’’ remains more or less the same: one can simply use Hoeffding’s inequality instead of the central limit theorem to control $\Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2$ under subgaussian assumptions on the loss gradients. Also, the third step relies to a non-statistical argument since the sample size does not figure in it at all, so there are no changes here as well. Thus, the challenge is in the second (localization) step: now we must prove that $L_n(\widehat\theta_n) - L_n(\theta_*)$ is close to its second-order Taylor approximation centered at $\theta_*$, without taking the limit. As we will see a bit later, this could be reduced to showing that $\mathbf{H}_n(\theta)$ is near-constant, with high probability, uniformly over the Dikin ellipsoid $\Theta_{c}(\theta_*)$ with constant radius. Since the same property is known to be for the non-random Hessian $\mathbf{H}(\theta)$, our task boils down to bounding the uniform deviations of $\mathbf{H}_n(\theta)$ from $\mathbf{H}(\theta)$ on $\Theta_{c}(\theta_*)$. Generally, this task is rather complicated, and requires the advanced theory of empirical processes together with some global assumptions on the whole domain of $\theta$, see, e.g., (Spokoiny, 2012). However, in some cases it can be made simpler through the delicate use of self-concordance. This concept was introduced by Nemirovski and Nesterov (1994) in the context of interior-point methods, and brought to the attention of the statistical learning community by Bach (2010).

Our next goal is to understand the role of this concept in the finite-sample analysis.

## Simple case: least-squares

In the simplest case of least-squares, that is, when $\ell(Y,X^\top \theta) = \frac{1}{2} (Y - X^\top\theta)^2,$ the analysis is rather straightforward since $L_n(\theta)$ and $L(\theta)$ are quadratic forms. All that is needed is to apply a concentration inequality to $\Vert\nabla L_n(\theta_*)\Vert_{\mathbf{H}}$, and make sure that $\mathbf{H}_n \asymp \mathbf{H}$ to guarantee that the first transition in the chain \eqref{eq:localization-chain} remains valid. Specifically, recalling the definition of calibrated design $\tilde X(\theta)$, we see that In least-squares it coincides with $X$ at any point $\theta$, hence, $\mathbf{H} \equiv \mathbb{E}[X X^{\top}], \quad \mathbf{H}_n \equiv \frac{1}{n}\sum_{i=1}^n X_i X_i^{\top}.$ Thus, the analysis is thus reduced to controlling the deviations of a single sample covariance matrix – that of $X$ – from its expectation. This can be done using the well-known result from (Vershynin, 2012): assuming that the decorrelated design $\mathbf{H}^{-1/2} X$ is $K$-subgaussian, that is, for any direction $u \in \mathbb{R}^d$ it holds $\mathbb{E}\left[\exp \left\langle u, \mathbf{H}^{-1/2} X \right\rangle \right] \le \exp(K^2\Vert u\Vert_2^2/2),$ we have $\mathbf{H}_n \asymp \mathbf{H}$ with probability at least $1-\delta$ as soon as $\boxed{ n \gtrsim K^4 (d + \log(1/\delta)), }$ where $a \gtrsim b$ is a shorthand for $b = O(a)$. When combined together, this gives $O(d/n)$ rate in the regime $n = \Omega(d)$. Next we will show how to extend this result, first obtained in (Hsu et al., 2012), beyond the case of least squares.

## Localization lemma

Before we embark on self-concordance, I will demonstrate that the localization of $\widehat\theta_n$ in the Dikin ellipsoid $\Theta_r(\theta_*)$ of some radius $r$ is guaranteed if we show the uniform approximation bound $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*), \quad \theta \in \Theta_r(\theta_*).$ In what follows, we assume that the loss is convex. Also, we assume that the (decorrelated) calibrated design $\mathbf{H}^{-1/2} \tilde X(\theta_*)$, is $K_2$-subgaussian, and for some $\delta \in (0,1)$ it holds $n \gtrsim K_2^4 (d + \log(1/\delta)),$ so that we can apply the result of (Vershynin, 2014), and with probability $1-\delta$ identify the empirical and true Hessians at a single point $\theta_*$: $\mathbf{H}_n \asymp \mathbf{H}$ (recall that we use $\mathbf{H}$ and $\mathbf{H}_n$, without parentheses, as shorthands for $\mathbf{H}(\theta_*)$ and $\mathbf{H}_n(\theta_*)$). We then have an auxiliary result called the Localization Lemma.

Localization lemma. Suppose that $n \gtrsim K_2^4 (d + \log(1/\delta))$, and for some $r \ge 0$ it holds $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*), \, \forall \theta \in \Theta_{r}(\theta_*).$ Then, for any $r_0 \le r$, the following holds: whenever $\Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2 \lesssim r_0^2,$ we have that $\widehat\theta_n$ belongs to $\Theta_{r_0}(\theta_*)$, and moreover, $L(\widehat \theta_n) - L(\theta_*) \lesssim \Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2 \lesssim \Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2.$

Proof sketch. By definition, $L_n(\widehat\theta_n) \le L_n(\theta_*)$. Assume that $\widehat\theta_n$ is not in $\Theta_{r_0}(\theta_*)$, and choose the point $\bar \theta_n$ on the segment $[\theta_*,\widehat\theta_n]$ such that $\bar \theta_n$ is precisely on the boundary of $\Theta_{r_0}(\theta_*)$, so that $\Vert\bar \theta_n - \theta_*\Vert_{\mathbf{H}} = r_0.$ Note that by convexity of the level sets of $L_n(\theta)$, we still have $L_n(\bar\theta_n) \le L_n(\theta_*).$ On the other hand, by the intermediate value theorem, for some $\theta’_n$ belonging to the segment $[\theta_*, \bar\theta_n]$, and hence to $\Theta_{r_0}(\theta_*)$, it holds \begin{align} 0 &\ge L_n(\bar\theta_n) - L_n(\theta_*) \\ &= \left\langle \nabla L_n(\theta_*), \bar\theta_n - \theta_* \right\rangle + \frac{1}{2} \Vert \bar\theta_n - \theta_*\Vert_{\mathbf{H}_n(\theta’_n)}^2 \\ &\approx \left\langle \nabla L_n(\theta_*), \bar\theta_n - \theta_* \right\rangle + \frac{1}{2} \Vert \bar\theta_n - \theta_* \Vert_{\mathbf{H}_n}^2 \\ &\approx \left\langle \nabla L_n(\theta_*), \bar\theta_n - \theta_* \right\rangle + \frac{1}{2} \Vert \bar\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \\ &\ge - \Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}} \Vert\bar\theta_n - \theta_* \Vert_{\mathbf{H}} + \frac{1}{2} \Vert \bar\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \\ &= - r_0 \Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}} + \frac{r_0^2}{2}, \end{align} where we use $a \approx b$ as a shorthand for saying that $a$ and $b$ are within a multiplicative constant factor from each other, and used that~$\mathbf{H}_n(\theta’_n) \asymp \mathbf{H}_n(\theta_*)$ in the second line. Rearranging the terms, we arrive at a contradiction, so in fact $\widehat\theta_n$ must belong to $\Theta_{r_0}(\theta_*)$. This proves the first claim of the lemma. Now that we know that $\widehat\theta_n \in \Theta_{r_0}(\theta_*)$, we can also prove that $\Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2 \lesssim \Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2$ by replacing $\bar \theta_n$ with $\widehat\theta_n$ in the above chain of inequalities. Finally, $L(\widehat \theta_n) - L(\theta_*) \lesssim \Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2$ also follows from the intermediate value theorem. using that $\widehat \theta_n$ belongs to the ellipsoid $\Theta_r(\theta_*)$ in which $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*)$, and applying the sample covariance matrix concentration result to $\mathbf{H}_n(\theta_*)$. $\blacksquare$

Recall that the gradient of the empirical risk is the average of i.i.d. random vectors, $\nabla L_n(\theta_*) = \frac{1}{n} \sum_{i = 1}^n \nabla_{\theta} \ell(Y_i,X_i^\top \theta_*),$ and each $\nabla_{\theta} \ell(Y_i,X_i^\top \theta)$ has covariance $\mathbf{G}$. Assuming that decorrelated gradients $\mathbf{G}^{-1/2}\nabla_{\theta} \ell(Y_i,X_i^\top \theta_*)$ are $K_1$-subgaussian, Bernstein inequality implies, with probability $\ge 1-\delta$, $\Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2 \lesssim \frac{K_1^2 d_{eff} \log(1/\delta)}{n}.$ Hence, the localization lemma implies that if we can guarantee that $\mathbf{H}_n(\theta)$ is near-constant over the Dikin ellipsoid of radius $r$, the sample size sufficient to guarantee a finite-sample analogue of \eqref{eq:crb-prob} is $\boxed{ n \gtrsim \max \left\{ K_2^4 (d+ \log(1/\delta)), \; \color{red}{r^2} K^2 K_1^2 {\color{blue}{d_{eff}}} \log(1/\delta) \right\}. }$ The first bound guarantees reliable estimation of the risk curvature at the optimum, and is the same as in linear regression, so we have reasons to suggest that it is unavoidable. On the other hand, the second bound is related to the fact that the loss is not quadratic, and dominates the first one, assuming $d_{eff} = O(d)$, unless $r$ – the radius of the Dikin ellipsoid in which $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*)$ with high probability – is constant. In the next post, we will see how self-concordance leads to Hessian approximation bounds of this type with $r = O(\sqrt{d})$ which results in the $O(d \cdot d_{eff})$ sample size, and how this can be improved to $r = O(1)$ and $n = O(\max(d,d_{eff}))$ with a more subtle argument.

1. B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Ann. Statist., 28:5(2000), 1302-1338.

2. V. Spokoiny. Parametric estimation. Finite sample theory. Ann. Statist., 40:6(2012), 2877-2909.

3. A. Nemirovski and Yu. Nesterov. Interior-point polynomial algorithms in convex programming. Society for Industrial and Applied Mathematics, Philadelphia, 1994.

4. F. Bach. Self-concordant analysis for logistic regression. Electron. J. Stat., 4(2010), 384-414.

5. R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing: Theory and Applications, 210–268. Cambridge University Press, 2012.

6. D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. COLT, 2012.