Look at the CornersResearch blog of Dmitrii M. Ostrovskii
http://ostrodmit.github.io/blog/
Finite-sample analysis of $M$-estimators via self-concordance, II<p>This is the second of two posts where I present <a href="https://arxiv.org/abs/1810.06838"><strong>our recent work with Francis Bach</strong></a> on the optimal finite-sample rates for $M$-estimators.
Recall that in the previous post, we have proved the Localization Lemma which states the following: stability of the empirical risk Hessian $\mathbf{H}_n(\theta)$ on the Dikin ellipsoid with radius $r$,
\[
\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*), \, \forall \theta \in \Theta_{r}(\theta_*),
\]
guarantees that once the <em>score</em> $\Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2$ reaches $\Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2 \lesssim r^2,$
one has the desired excess risk bound:
\[
L(\widehat \theta_n) - L(\theta_*) \lesssim \Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2 \lesssim \Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2.
\]
I will now show how self-concordance allows to obtain such guarantees for $\mathbf{H}_n(\theta)$.</p>
<h2 id="self-concordance">Self-concordance</h2>
<p>Recall that $\ell(y,\eta)$ is the loss of predicting a label $y \in \mathcal{Y}$ with $\eta = X^\top \theta \in \mathbb{R}$. We assume that $\ell(y,\eta)$ is convex in its second argument, and introduce
the following definition:</p>
<blockquote>
<p><strong>Definition 1</strong>. The loss $\ell(y,\eta)$ is <strong>self-concordant (SC)</strong> if for any $(y,\eta) \in \mathcal{Y} \times \mathbb{R}$ it holds
\[
\boxed{
|\ell'''_\eta(y,\eta)| \le [\ell''_\eta(y,\eta)]^{3/2}.
}
\]</p>
</blockquote>
<p>While the above definition is homogeneous in $\eta$, the next one is not, since the power $3/2$ is removed:</p>
<blockquote>
<p><strong>Definition 2</strong>. The loss $\ell(y,\eta)$ is <strong>pseudo self-concordant (PSC)</strong> if instead it holds
\[
\boxed{
|\ell'''_\eta(y,\eta)| \le \ell''_\eta(y,\eta).
}
\]</p>
</blockquote>
<p>The first definition is inspired by <a href="https://epubs.siam.org/doi/book/10.1137/1.9781611970791?mobileUi=0">[3]</a> where similar functions (in $\mathbb{R}^d$) were introduced in the context of interior-point algorithms.
As we are about to see, the homogeneity of this definition in $\eta$ allows to qualify the precision of local quadratic approximations of empirical risk in affine-invariant manner, which is a natural requirement if we recall that $M$-estimators are <em>themselves</em> affine-invariant.
This is violated in the second definition; will see that this leads to the increased sample size to guarantee the optimal excess risk for PSC losses, under the extra assumption.
On the other hand, PSC losses are somewhat more common.
I will now give several examples of SC and PSC losses.</p>
<ol>
<li>
<p>One family of examples arises in generalized linear models (GLMs). In particular, <strong><em>the logistic loss</em></strong>
\[
\ell(y,\eta) = \frac{\log(1 + e^{-y\eta})}{\log 2},
\]
is PSC (see <a href="https://projecteuclid.org/euclid.ejs/1271941980">[4]</a>).
In fact, one can consider other GLM losses with canonical parameter, given by
\[
\ell(y,\eta) = - y\eta + a(\eta) - b(y),
\]
where the <em>cumulant</em> $a(\eta)$ can be written as
\[
a(\eta) = \mathbb{E}_{p_\eta}[Y]
\]
where
$
p_\eta(y) \propto e^{y\eta + b(y)}
$
is the density that the model imposes on $Y$. Thus, for GLMs self-concordance can be interpreted in terms of this model distribution of $Y$: for $s \in \{2,3\}$ we have
\[
\ell_\eta^{(s)}(y,\eta) = a^{(s)}(\eta) = \mathbb{E}_{p_\eta} [(Y - \mathbb{E}_{p_\eta}Y)^s].
\]
Then, (pseudo) self-concordance specifies the relation between the central moments of $p_\eta(\cdot)$.
In particular, we see that the logistic loss is PSC simply because $\mathcal{Y} = \{0,1\}$, and hence
\[
|a'''(\eta)| \le \mathbb{E}_{p_\eta} |(Y - \mathbb{E}_{p_\eta}[Y])^3| \le \mathbb{E}_{p_\eta} [(Y - \mathbb{E}_{p_\eta}[Y])^2] = a''(\eta).
\]
Another example of a PSC loss arises in <em>Poisson regression</em> where the model assumption is that $Y \sim \text{Poisson}(e^\eta)$ which corresponds to $a(\eta) = \exp(\eta)$.
Unfortunately, this implies heavy-tailed distribution of the calibrated design $\tilde X(\theta)$ under the model distribution which creates additional difficulties – see our paper for more details.</p>
</li>
<li>
<p>Another family of PSC losses arises in <strong><em>robust estimation</em></strong>. Here, $\ell(y,\eta) = \varphi(y-\eta)$ with $\varphi(\cdot)$ being convex, even, 1-Lipschitz, and satisfying $\varphi''(0) = 1$.
The prototypic example of a robust loss is the <em>Huber loss</em> corresponding to
\[
\varphi(t) =
\left\{
\begin{align}
&{t^2}/{2},
&\quad|t| &\le 1, \\
&\tau t - 1/2,
&\quad |t| &> 1.
\end{align}
\right.
\]
However, $\varphi'''(t)$ doesn’t exist at $t = \pm 1$, so it is not SC nor PSC.
Fortunately, the Huber loss can be well-approximated by some <em>pseudo-Huber losses</em> that turn out to be PSC (see the figure below):
\begin{align}
\label{def:pseudo-huber}
\varphi(t) &= \log\left(\frac{\exp(t) + \exp(-t)}{2}\right), \\
\varphi(t) &= \sqrt{1 + {t^2}}-1.
\end{align}</p>
</li>
<li>
<p>Finally, we can construct SC losses using the classical result from <a href="https://epubs.siam.org/doi/book/10.1137/1.9781611970791?mobileUi=0">[3]</a> that self-concordance is preserved under taking the Fenchel conjugate.</p>
<blockquote>
<p>The Fenchel conjugate of a self-concordant function $\varphi^*: D \to \mathbb{R}$, where $D \subseteq \mathbb{R}$, is also self-concordant.</p>
</blockquote>
</li>
</ol>
<p>This fact allows to construct SC losses with derivatives ranging over a given interval, by computing the convex conjugate of the logarithmic barrier of a given subset of $\mathbb{R}$. In particular, taking the log-barrier on $[-1,1]$, we obtain an analogue of Huber’s loss, and the log-barrier on $[-1,0]$ results in the analogue of the logistic loss (see the figure below).
Note that in the latter case, our loss does not upper bound for the 0-1 loss. However, its negative part grows as a logarithm, and using the well-known calibration theory from (Bartlett et al., 2006), one can show that the related expected risk still well-approximates the probability of misclassification.</p>
<center>
<figure>
<img src="/blog/figs/robust-for-blog.png" alt="Robust regression loss" width="360" /> <img src="/blog/figs/class-for-blog.png" alt="Robust classification loss" width="360" />
<figcaption>
Our self-concordant analogues of the Huber and logistic losses.
</figcaption>
</figure>
</center>
<h2 id="integration-argument-and-the-basic-result">Integration argument and the basic result</h2>
<p>By the Localization Lemma, our task reduces to proving that the empirical Hessian is stable, with high probability, in the Dikin ellipsoid $\Theta_r(\theta_*)$ with a <em>constant</em> radius $r$.
On the other hand, self-concordance easily allows to get
\[
r = O\left(\frac{1}{\sqrt{d}}\right)
\]
using a simple integration technique sketched below. This leads to the bound
\[
n \gtrsim d \cdot d_{eff},
\]
up to the dependency on $\delta$ and subgaussian constants.</p>
<p>1.
Indeed, let the loss be SC, and recall that the Hessian of the empirical risk at point $\theta$ is
\[
\mathbf{H}_n(\theta) = \frac{1}{n} \sum_{i=1}^n \ell''(Y_i,X_i^\top\theta) X_i X_i.
\]
Hence, we can compare $\mathbf{H}_n(\theta)$ with $\mathbf{H}_n(\theta_*)$ by comparing $\ell''(Y_i,X_i^\top\theta)$ with $\ell''(Y_i,X_i^\top\theta_*)$.</p>
<p>2.
Integrating $|\ell'''(y,\eta)| \le [\ell''(y,\eta)]^{3/2}$ from $\eta_* = X^\top \theta_*$ to $\eta = X^\top \theta$, we arrive at
\[
\frac{1}{(1+[\ell''(y,\eta_*)]^{1/2}|\eta - \eta_*|)^2} \le \frac{\ell''(y,\eta)}{\ell''(y,\eta_*)} \le \frac{1}{(1-[\ell’‘(y, \eta_*)]^{1/2}|\eta - \eta_*|)^2},
\]
or equivalently,
\[
\frac{1}{(1+|\langle \tilde X(\theta_*), \theta - \theta_* \rangle|)^2} \le \frac{\ell''(Y,X^\top \theta)}{\ell''(Y,X^\top \theta_*)} \le \frac{1}{(1- |\langle \tilde X(\theta_*), \theta - \theta_* \rangle|)^2}.
\]
3.
The ratio is bounded if $|\langle \tilde X(\theta_*), \theta-\theta_* \rangle| \le c < 1$. By Cauchy-Schwarz, it suffices that
\[
\Vert\tilde X(\theta_*)\Vert_{\mathbf{H}(\theta_*)^{-1}} \Vert(\theta-\theta_*)\Vert_{\mathbf{H}(\theta_*)} \le c.
\]
On the other hand, $\mathbf{H}(\theta_*)^{-1/2}\tilde X(\theta_*)$ is a $K_2$-subgaussian vector in $\mathbb{R}^d$, therefore,
\[
\Vert\tilde X(\theta_*)\Vert_{\mathbf{H}(\theta_*)^{-1}} = \Vert\mathbf{H}(\theta_*)^{-1/2} \tilde X(\theta_*)\Vert_{2} = O(K_2\sqrt{d}).
\]
Thus, we can guarantee that with high probability, $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*)$ for any $\theta$ such that
\[
\Vert(\theta-\theta_*)\Vert_{\mathbf{H}(\theta_*)} \le \frac{c}{K_2 \sqrt{d}},
\]
i.e., in the Dikin ellipsoid $\Theta_r(\theta_*)$ with radius $r \lesssim \frac{K_2}{\sqrt{d}}$.
Combining this with the Localization Lemma and some union bounds, we obtain the following result:</p>
<blockquote>
<p><strong>Theorem 1.</strong> Assume that the loss is self-concordant, and $\mathbf{G}^{-1/2}\nabla \ell_\theta(Y,X^\top \theta_*),$ $\mathbf{H}^{1/2}\tilde X(\theta_*)$ are, correspondingly, $K_1$- and $K_2$-
subgaussian. Then, with probability at least $1-\delta$ it holds
\[
\label{eq:main-ineq}
\boxed{
L(\widehat\theta_n) - L(\theta_*) \lesssim \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \lesssim \frac{K_1^2 d_{eff} \log(1/\delta)}{n}
}
\]
provided that
\begin{equation}
\label{eq:n-mine-complex}
\boxed{
n \gtrsim \max\left\{K_2^4 \left( {\color{blue} d} + \log\left(1/\delta\right)\right), \; {\color{red} d} K_1^2 K_2^2 {\color{blue}{d_{eff}}}\log\left(d/{\delta}\right)\right\}.
}
\end{equation}</p>
</blockquote>
<p>We see that the critical sample size grows as the <strong>product</strong> of $d_{eff}$ and $d$, since we can only guarantee that $\mathbf{H}_n(\theta)$ is stable in the small Dikin ellipsoid with $O(1/\sqrt{d})$ radius.
In fact, this result can be improved. But before showing how this can be achieved, let me consider the case of PSC losses.</p>
<h2 id="pseudo-self-concordant-losses">Pseudo self-concordant losses</h2>
<p>In the case of PSC losses, a integration argument – but this time integrating $|\ell'''(y,\eta)| \le \ell''(y,\eta)$ – implies that in order to bound the ratio of the second derivatives, we must ensure that
\[
\Vert X\Vert_{\mathbf{H}^{-1}} \Vert(\theta-\theta_*)\Vert_{\mathbf{H}}\le c,
\]
that is, the calibrated design $\tilde X(\theta_*) = [\ell''_\eta(Y,X^\top \theta_*)]^{1/2}X$ gets replaced with $X$.
Intuitively, this is due to the lack of the extra square root of $\ell''_\eta(y,\eta)$ in <strong>Definition 2</strong>.
To control $\Vert X\Vert_{\mathbf{H}(\theta_*)^{-1}}$, we can introduce the standard assumption from <a href="https://projecteuclid.org/euclid.ejs/1271941980">[4]</a>, relating $\mathbf{H}(\theta_*)$ and the second-moment matrix of the design, $\boldsymbol{\Sigma} := \mathbb{E}[X X^\top]$, as follows:
\[
\boldsymbol{\Sigma} \preccurlyeq \rho \mathbf{H}.
\]
By a similar argument, we show that under this additional assumption, and assuming that $\boldsymbol{\Sigma}^{-1/2}X$ is $K$-subgaussian, the critical sample size from <strong>Theorem 1</strong> gets replaced with
\begin{equation}
\label{eq:n-fake-complex}
\boxed{
n \gtrsim \max\left\{K_2^4 \left(\color{blue}{d} + \log\left({1}/{\delta}\right)\right), \; \color{red}{\rho d} K^2 K_1^2 \color{blue}{d_{eff}} \log\left(d/{\delta}\right)\right\};
}
\tag{PSC}
\end{equation}
essentially, it becomes $\rho$ times larger.
Note that the only <em>generic</em> upper bound available for $\rho$ is
\[
\rho \le \frac{1}{\inf_{\eta} \ell''_{\eta}(y,\eta)},
\]
where $\eta$ ranges over the set $\left\{\eta(\theta) = X^{\top} \theta, \;\theta \in \Theta \right\},$ and $\Theta$ is the set of possible predictors. In particular, in logistic regression with $\Vert X \Vert_2 \le D$ and $\Theta = \{\theta \in \mathbb{R}^d: \Vert\theta\Vert_2 \le R\}$, this gives
\[
\rho \lesssim \exp({RD}).
\]
Moreover, this bound is achievable on a certain (quite artificial) distribution of $X$ as shown in (Hazan et al., 2014).
However, the <em>actual</em> value of $\rho$ depends on the data distribution, and is moderate when this distribution is not chosen adversarially as discussed in (Bach and Moulines, 2013).
In fact, in the paper we show that in logistic regression with Gaussian design $X \sim \mathcal{N}(0, \boldsymbol{\Sigma})$ one has
\[
\rho \lesssim 1+\Vert\theta_*\Vert_{\boldsymbol{\Sigma}}^3.
\]
Still, $\rho \gg 1$ when $\Vert\theta_*\Vert_{\boldsymbol{\Sigma}} \gg 1$, and our construction of self-concordant analogues of the Huber and logistic losses remains useful.</p>
<h2 id="near-linear-bounds-via-a-covering-argument">Near-linear bounds via a covering argument</h2>
<p>In fact, we can get rid of the extra $O(d)$ factor in the previous bounds for the critical sample size, and obtain the bounds that scale near-linearly in $\max(d, d_{eff})$.
In particular, in the case of self-concordant losses the critical sample size in the case of SC losses can be reduced to
\begin{equation}
\label{eq:n-mine-improved}
\boxed{
n \gtrsim \max\left\{\bar K_2^4 d\log\left({d}/\delta\right), \; K_1^2 \bar K_2^6 d_{eff} \log\left(1/\delta\right)\right\}.
}
\tag{SC$^*$}
\end{equation}
This is done via a more delicate argument, as explained below, and under the mild extra assumption that the calibrated design
$
\tilde X(\theta)
$
is $\bar K_2$-subgaussian, when multiplied by $\mathbf{H}(\theta)$, at every point $\theta$ of the unit Dikin ellipsoid $\Theta_1(\theta_*)$, with $\bar K_2$ independent of $\theta$.
For pseudo self-concordant losses, the second bound gets inflated by $\rho$; on the other hand, the radius of the Dikin ellipsoid in which $\tilde X(\theta)$ is required to be uniformly subgaussian decreases by the factor of $1/\sqrt{\rho}$. In fact, the extra assumption appears to be essetial.
For example, in the paper we show that in logistic regression with $X \sim \mathcal{N}(0,\boldsymbol{\Sigma})$, one has
\[
\bar K_2 \lesssim K_2 + 1,
\]
where $K_2$ is the subgaussian constant of $\mathbf{H}(\theta_*)^{-1/2}\tilde X(\theta_*)$, and the two assumptions are equivalent.</p>
<center>
<figure>
<img src="/blog/figs/covering-cropped.png" alt="Covering the Dikin ellipsoid" width="600" />
<figcaption>
Covering the unit Dikin ellipsoid with smaller ones.
</figcaption>
</figure>
</center>
<p>Let us briefly explain the main ideas behind the proof of the improved bounds (see the figure above).
First of all, recall that the extra $d$ factor in the previous bounds appeared because we used self-concordance of the <em>individual losses</em>, which only allowed to prove stability of the empirical Hessian in a small Dikin ellipsoid with radius $O(1/\sqrt{d})$.
This factor would be eliminated if we managed to show a probabilistic bound
\[
\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*),
\]
uniformly over the Dikin ellipsoid $\Theta_{c}(\theta_*)$ with a constant radius.
Such bound could have been obtained via an integration argument, should we have self-concordance of $L_n(\theta)$.
However, unlike the individual losses, <em>the empirical risk is not self-concordant</em>, we have to come up with a more subtle argument.
In a nutshell, this argument combines the following ingredients:</p>
<ol>
<li>
<p><em>Self-concordance of the expected risk</em> $L(\theta)$ on $\Theta_c(\theta_*)$, with $c = 1/\bar K_2^{3/2}$, which follows from the subgaussian assumption about $\tilde X(\theta)$ and the relation between the moments of a subgaussian vector. This guarantees that for any $\theta \in \Theta_c(\theta_*)$ it holds
\[
\mathbf{H}(\theta) \asymp \mathbf{H}(\theta_*).
\]</p>
</li>
<li>
<p><em>Self-concordance of the individual losses</em> which guarantees, for any $\theta_0 \in \mathbb{R}^d$ and $\theta \in \Theta_{1/\sqrt{d}}(\theta_0)$, that
\[
\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_0).
\]</p>
</li>
<li>
<p>Finally, a <em>covering argument</em>, in which $\Theta_c(\theta_*)$ is covered with smaller $O(1/\sqrt{d})$-ellipsoids. The problem then reduces to the control of the supremum
\[
\sup_{\theta_0 \in \mathcal{N}} \Vert \mathbf{H}(\theta_0)^{-1/2} \mathbf{H}_n(\theta_0) \mathbf{H}(\theta_0)^{-1/2}\Vert,
\]
where $\Vert\cdot\Vert$ is the operator norm, and $\mathcal{N}$ is the epsilon-net corresponding to the covering of $\Theta_c(\theta_*)$.
This gives the extra $O(\log d)$ factor in \eqref{eq:n-mine-improved} since $\log(|\mathcal{N}|) = O(d \log d)$.</p>
</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>We have demonstrated how to obtain fast rates for the excess risk of $M$-estimators in finite-sample regimes.
Our analysis can handle $M$-estimators with losses satisfying certain self-concordance-type assumptions that hold in some generalized linear models, notably logistic regression, and in robust estimation.
These assumptions allow to control the precision of local quadratic approximations for the empirical risk.</p>
<p>One topic that remained beyond the scope of this post is the $\ell_1$-regularized estimators in sparse high-dimensional regimes. There, we have not yet established the improved rates that should scale near-linearly with the sparsity of $\theta_*$. Another possible extension are matrix-parametrized models that arise in covariance matrix estimation and independent component analysis, and application of our techniques to algorithmically efficient procedures such as stochastic approximation.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p>B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.
<em>Ann. Statist.</em>, 28:5(2000), 1302-1338.</p>
</li>
<li>
<p>V. Spokoiny. Parametric estimation. Finite sample theory.
<em>Ann. Statist.</em>, 40:6(2012), 2877-2909.</p>
</li>
<li>
<p>A. Nemirovski and Yu. Nesterov. Interior-point polynomial algorithms in convex programming.
<em>Society for Industrial and Applied Mathematics, Philadelphia, 1994.</em></p>
</li>
<li>
<p>F. Bach. Self-concordant analysis for logistic regression.
<em>Electron. J. Stat.</em>, 4(2010), 384-414.</p>
</li>
<li>
<p>R. Vershynin. Introduction to the non-asymptotic analysis of random matrices.
<em>Compressed Sensing: Theory and Applications, 210–268</em>. Cambridge University Press, 2012.</p>
</li>
<li>
<p>D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. <em>COLT, 2012.</em></p>
</li>
<li>
<p>P. Bartlett, M. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds.
<em>J. Am. Stat. Assoc.</em>, 101:473(2006), 138–156.</p>
</li>
<li>
<p>E. Hazan, T. Koren, and K. Levy. Logistic regression: tight bounds for stochastic and online optimization.
<em>COLT, 2014.</em></p>
</li>
<li>
<p>F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$.
<em>NIPS, 2013.</em></p>
</li>
</ol>
Tue, 13 Nov 2018 00:00:00 +0000
http://ostrodmit.github.io/blog/2018/11/13/self-concordance-part-2/
http://ostrodmit.github.io/blog/2018/11/13/self-concordance-part-2/Finite-sample analysis of $M$-estimators via self-concordance, I<p>In this series of posts, I will present our <a href="https://arxiv.org/abs/1810.06838"><strong>recent work with Francis Bach</strong></a> on the optimal rates for $M$-estimators with self-concordant-like losses. The term “$M$-estimator” is more commonly used in the statistical community; in the learning theory community one more often talks about empirical risk minimization.
I will mostly use the statistical terminology to stress the connections to the classical asymptotic results.</p>
<h2 id="setting">Setting</h2>
<p>We work in the familiar statistical learning setting: the goal is to predict the <em>label</em> $Y \in \mathcal{Y}$ as some function of the dot product $\eta = X^\top \theta$ between the <em>predictor</em> $X \in \mathbb{R}^d$ and <em>parameter</em> $\theta \in \mathbb{R}^d$. This is formalized as minimization of <strong>expected risk</strong>
\[
L(\theta) := \mathbb{E}[\ell(Y,X^\top\theta)],
\]
where $\ell: \mathcal{Y} \times \mathbb{R} \to \mathbb{R}$ is some <em>loss function</em>, and $\mathbb{E}[\cdot]$ is expectation over the distribution $\mathcal{P}$ of the observation $Z = (X,Y)$. Let $\theta_*$ be a minimizer of $L(\theta)$, which will prove to be unique in our situation.
Since the distribution $\mathcal{P}$ is unknown, we cannot find $\theta_*$ directly. Instead we observe an i.i.d. sample
\[
(X_1,Y_1), …, (X_n, Y_n) \sim \mathcal{P},
\]
and compute the empirical counterpart of $\theta_*$, a minimizer $\widehat \theta_n$ of <strong>empirical risk</strong>
\[
L_n(\theta) := \frac{1}{n}\sum_{i=1}^n \ell(Y_i,X_i^\top\theta).
\]
This setting includes regression in the case $\mathcal{Y} = \mathbb{R}$ (usually one also takes $\ell(y,y’) = \varphi(y-y’)$ for some function $\varphi: \mathbb{R} \to \mathbb{R}$, e.g., the squared loss) and classification in the case $\mathcal{Y} = \{0,1\}$.</p>
<p>To understand our somewhat unusual loss parametrization, note that when
$
\ell(y,\eta) = - \log p_{\eta}(y)
$
for some probability distribution $p_{\eta}(\cdot)$ on $\mathcal{Y}$ parametrized by $\eta \in \mathbb{R}$, we have a conditional likelihood model for $Y$ given $X^{\top} \theta$, and then $\widehat \theta_n$ becomes the maximum likelihood estimator (MLE). This includes, in particular, <em>least squares</em> and <em>logistic regression</em>. More generally, this is the case in <em>generalized linear models</em> (GLMs) with canonical parametrization, where the loss takes the form
\begin{equation}
\label{def:glm}
\ell(y,\eta) = - y\eta + a(\eta) - b(y),
\end{equation}
where $a(\eta): \mathbb{R} \to \mathbb{R}$, called <em>the cumulant</em>, normalizes $\ell(y,\eta)$ to be a valid negative log-likelihood:
\[
a(\eta) = \log \int_{\mathcal{Y}} \exp(y\eta + b(y)) \, \text{d}y.
\]</p>
<p>More generally, we can talk about <em>quasi-MLE</em> where one removes the implicit assumption that $\mathcal{Y}$ is indeed generated by one of the distributions $p_{\eta}(\cdot)$ for some value $\eta_*$. In this case, the asymptotic properties of $\widehat\theta_n$ have long been studied in the statistical community. Our goal in this work was to extend this classical theory to the <em>finite-sample</em> setting, and beyond the quasi-MLE setup, that is, obtain results for general $M$-estimators. I will now give a brief overview of this theory.</p>
<h2 id="background-asymptotic-theory">Background: asymptotic theory</h2>
<p>The classical theory of local asymptotic normality (LAN) considers the limit $n \to \infty$ with fixed $d$, and relies on the local regularity assumptions, requiring that $L(\theta)$ is sufficiently smooth at $\theta_*$, so that $\nabla L(\theta_*) = 0$, and <em>the excess risk</em>
$
L(\theta) - L(\theta_*)
$
can be well-approximated by its 2nd-order Taylor expansion, or the squared <em>prediction distance</em>
\[
\frac{1}{2} \Vert \theta - \theta_* \Vert_{\mathbf{H}}^2 := \frac{1}{2} \Vert \mathbf{H}^{1/2}(\theta - \theta_*) \Vert^2,
\]
where $\mathbf{H}$ is the risk Hessian at the optimum:
\[
\mathbf{H} := \nabla^2 L(\theta_*),
\]
and one usually assumes that $\mathbf{H} \succ 0$.
Note that in the case of least squares, we simply have
\[
L(\theta) - L(\theta_*) = \frac{1}{2} \Vert\theta - \theta_*\Vert_{\mathbf{H}}.
\]
These assumptions allow to derive the <em>local asymptotic normality</em> of quasi MLE, or Fisher’s theorem:
\begin{equation}
\label{eq:lan-fisher}
\sqrt{n}\mathbf{H}^{1/2}(\widehat\theta_n - \theta_*) \rightsquigarrow \mathcal{N}(0, \mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}),
\tag{1}
\end{equation}
where $\rightsquigarrow$ is convergence in the law, and $\mathbf{G}$ is the covariance matrix of the loss gradient at $\theta_*$:
\[
\mathbf{G} := \mathbb{E} \left[ \nabla_\theta \ell(Y,X^\top\theta_*) \otimes \nabla_\theta \ell(Y,X^\top\theta_*) \right].
\]
In well-specified MLE, one has the <em>Bartlett identity</em> $\mathbf{G} = \mathbf{H}$; then, $\mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}$ is the identity, and
\[
\text{Tr}[\mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}] = d.
\]
In the general case, we can define the <em>effective dimension</em>
\[
d_{eff} := \text{Tr} [\mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2}],
\]
and hope that it is not much larger than $d$, i.e., the model is only “moderately” misspecified.
In some cases this is known to be true: for example, in least-squares regression one has
$
d_{eff} \le \kappa_X \cdot \kappa_Y \cdot d,
$
irrespectively of the true distribution of the noise, whenever $X$ and $Y$ have bounded kurtoses $\kappa_X, \kappa_Y$:
\[
\mathbb{E}[(Y-\mathbb{E}[Y])^4]^{1/4} \le \kappa_Y \cdot \mathbb{E}[(Y-\mathbb{E}[Y])^2]^{1/2},
\]
\[
\mathbb{E}[\left\langle u, X-\mathbb{E}[X] \right\rangle^4]^{1/4} \le \kappa_X \cdot \mathbb{E}[\left\langle u, X-\mathbb{E}[X] \right\rangle^2]^{1/2}, \quad \forall u \in \mathbb{R}^d.
\]</p>
<p>Returning to the results of the LAN theory, \eqref{eq:lan-fisher} implies that the scaled prediction error $n\Vert \widehat\theta_n-\theta_*\Vert_{\mathbf{H}}^2$ asymptotically has the generalized chi-square distribution – namely, it distributed as the square of $\mathcal{N}(0, \mathbf{H}^{-1/2} \mathbf{G} \mathbf{H}^{-1/2})$.
Moreover, it can also be obtained that $2n[L(\widehat\theta_n) - L(\theta_*)]$ has the same asymptotic distribution.
Using the standard chi-square deviation bound from (Laurent and Massart, 2000), we can summarize this result in terms of asymptotic confidence bounds for the excess risk and prediction distance:
\begin{equation}
\label{eq:crb-prob}
\boxed{
\begin{aligned}
L(\widehat\theta_n) - L(\theta_*)
&\approx \frac{1}{2} \Vert\theta - \theta_*\Vert_{\mathbf{H}}^2 \approx \frac{d_{eff} (1 + \sqrt{2\log(1/\delta)})^2}{2n},
\end{aligned}
}
\tag{$\star$}
\end{equation}</p>
<p>where $\approx$ hides the $o(1/n)$ factor when $n \to \infty$.</p>
<p>The analysis leading to \eqref{eq:crb-prob} can be done in three steps:</p>
<ol>
<li>
<p>First, one can easily control the squared “natural” norm of the score, $\Vert\nabla L_n(\theta_*)\Vert_{\mathbf{H}^{-1}}^2$, by the central limit theorem, since it is the average of i.i.d. quantities – squared norms of the gradients
\[
\Vert \nabla_{\theta} \ell(Y_i,X_i^\top \theta_*) \Vert_{\mathbf{H}}^2, \quad 1 \le i \le n.
\]</p>
</li>
<li>
<p>Then one can prove that as $n \to \infty$, the empirical risk can be approximated by its 2nd-order Taylor expansion
\[
L_n(\theta_*) + \left\langle \nabla L_n(\theta_*),\theta - \theta_*\right\rangle + \frac{1}{2} \Vert \widehat\theta_n - \theta_* \Vert^2_{\mathbf{H}_n}
\]
where
\[
\mathbf{H}_n = \nabla^2 L_n(\theta_*)
\]
is the empirical Hessian at $\theta_*$. Since $\mathbf{H}_n$ converges to $\mathbf{H}$ in the positive-semidefinite sense, this allows to <em>localize</em> the estimate:
\begin{align}
0 &\ge L_n(\widehat\theta_n) - L_n(\theta_*) \\
&\approx \left\langle \nabla L_n(\theta_*),\widehat \theta_n - \theta_*\right\rangle + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}_n}^2 \\
&\approx \left\langle \nabla L_n(\theta_*),\widehat \theta_n - \theta_*\right\rangle + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \label{eq:localization-chain}\tag{2}\\
&= \left\langle \mathbf{H}^{-1/2} \nabla L_n(\theta_*), \mathbf{H}^{1/2}(\widehat \theta_n - \theta_*)\right\rangle + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \\
&\ge -\Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}} \Vert\widehat \theta_n - \theta_* \Vert_{\mathbf{H}} + \frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2.
\end{align}
where the transition to the third line uses that $\mathbf{H}_n$ converges to $\mathbf{H}$, and the final transition is by Cauchy-Schwarz.
As a result, we arrive at
\[
\frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \le 2 \Vert\nabla L_n(\theta_*)\Vert_{\mathbf{H}^{-1}}^2.
\]</p>
</li>
<li>
<p>Once the localization is achieved, one can similarly control the excess risk $L(\widehat\theta_n) - L(\theta_*)$ through its second-order Taylor approximation $\frac{1}{2} \Vert\widehat\theta_n - \theta_*\Vert_{\mathbf{H}}^2$.
For this, one shows that the Hessian at $\theta$,
\[
\mathbf{H}(\theta) := \nabla^2 L(\theta),
\]
remains nearly constant in the <strong>Dikin ellipsoid</strong>
\[
\Theta_{r}(\theta_*) = \left\{ \theta \in \mathbb{R}^d: \Vert\theta - \theta_*\Vert_{\mathbf{H}} \le r \right\}
\]
with a constant radius $r = O(1)$, where by ``near-constant’’ we mean that
\[
c \mathbf{H}(\theta_*) \preccurlyeq \mathbf{H}(\theta) \preccurlyeq C\mathbf{H}(\theta_*)
\]
in the positive-semidefinite sense for some constants $0 < c \le 1$ and $C \ge 1$, concisely written as
\[
\mathbf{H}(\theta) \asymp \mathbf{H}(\theta_*).
\]
This classical fact, whose proof we omit here, can be obtained from the relation between the second and third moments of the <em>calibrated design</em> – the vector $\tilde X$ satisfying $\mathbb{E}[\tilde X \tilde X^{\top}] = \mathbf{H}$:
\[
\tilde X(\theta_*) := [\ell_{\eta}''(Y,X^\top \theta_*)]^{1/2} X.
\]</p>
</li>
</ol>
<h2 id="finite-sample-setup-the-challenge">Finite-sample setup: the challenge</h2>
<p>When we want to prove a non-asymptotic analogue of \eqref{eq:crb-prob} in the finite-sample setup, the first step of the asymptotic ‘‘recipe’’ remains more or less the same: one can simply use Hoeffding’s inequality instead of the central limit theorem to control $\Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2$ under subgaussian assumptions on the loss gradients.
Also, the third step relies to a non-statistical argument since the sample size does not figure in it at all, so there are no changes here as well.
Thus, the challenge is in the second (localization) step: now we must prove that $L_n(\widehat\theta_n) - L_n(\theta_*)$ is close to its second-order Taylor approximation centered at $\theta_*$, without taking the limit. As we will see a bit later, this could be reduced to showing that $\mathbf{H}_n(\theta)$ is near-constant, with high probability, uniformly over the Dikin ellipsoid $\Theta_{c}(\theta_*)$ with constant radius. Since the same property is known to be for the non-random Hessian $\mathbf{H}(\theta)$, our task boils down to bounding the uniform deviations of $\mathbf{H}_n(\theta)$ from $\mathbf{H}(\theta)$ on $\Theta_{c}(\theta_*)$. Generally, this task is rather complicated, and requires the advanced theory of empirical processes together with some global assumptions on the whole domain of $\theta$, see, e.g., (Spokoiny, 2012).
However, in some cases it can be made simpler through the delicate use of <strong>self-concordance</strong>.
This concept was introduced by Nemirovski and Nesterov (1994) in the context of interior-point methods, and brought to the attention of the statistical learning community by Bach (2010).</p>
<p>Our next goal is to understand the role of this concept in the finite-sample analysis.</p>
<h2 id="simple-case-least-squares">Simple case: least-squares</h2>
<p>In the simplest case of <em>least-squares,</em> that is, when
$\ell(Y,X^\top \theta) = \frac{1}{2} (Y - X^\top\theta)^2,$
the analysis is rather straightforward since $L_n(\theta)$ and $L(\theta)$ are quadratic forms. All that is needed is to apply a concentration inequality to $\Vert\nabla L_n(\theta_*)\Vert_{\mathbf{H}}$, and make sure that $\mathbf{H}_n \asymp \mathbf{H}$ to guarantee that the first transition in the chain \eqref{eq:localization-chain} remains valid.
Specifically, recalling the definition of calibrated design $\tilde X(\theta)$, we see that In least-squares it coincides with $X$ at any point $\theta$, hence,
\[
\mathbf{H} \equiv \mathbb{E}[X X^{\top}], \quad \mathbf{H}_n \equiv \frac{1}{n}\sum_{i=1}^n X_i X_i^{\top}.
\]
Thus, the analysis is thus reduced to controlling the deviations of a <em>single</em> sample covariance matrix – that of $X$ – from its expectation. This can be done using the well-known result from (Vershynin, 2012): assuming that the decorrelated design $\mathbf{H}^{-1/2} X$ is $K$-<em>subgaussian</em>, that is, for any direction $u \in \mathbb{R}^d$ it holds
\[
\mathbb{E}\left[\exp \left\langle u, \mathbf{H}^{-1/2} X \right\rangle \right] \le \exp(K^2\Vert u\Vert_2^2/2),
\]
we have $\mathbf{H}_n \asymp \mathbf{H}$
with probability at least $1-\delta$ as soon as
\[
\boxed{
n \gtrsim K^4 (d + \log(1/\delta)),
}
\]
where $a \gtrsim b$ is a shorthand for $b = O(a)$.
When combined together, this gives $O(d/n)$ rate in the regime $n = \Omega(d)$. Next we will show how to extend this result, first obtained in (Hsu et al., 2012), beyond the case of least squares.</p>
<h2 id="localization-lemma">Localization lemma</h2>
<p>Before we embark on self-concordance, I will demonstrate that the localization of $\widehat\theta_n$ in the Dikin ellipsoid $\Theta_r(\theta_*)$ of some radius $r$ is guaranteed if we show the uniform approximation bound
\[
\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*), \quad \theta \in \Theta_r(\theta_*).
\]
In what follows, we assume that the loss is convex.
Also, we assume that the (decorrelated) calibrated design $\mathbf{H}^{-1/2} \tilde X(\theta_*)$, is $K_2$-subgaussian, and for some $\delta \in (0,1)$ it holds
\[
n \gtrsim K_2^4 (d + \log(1/\delta)),
\]
so that we can apply the result of (Vershynin, 2014), and with probability $1-\delta$ identify the empirical and true Hessians at a <em>single point</em> $\theta_*$: $\mathbf{H}_n \asymp \mathbf{H}$ (recall that we use $\mathbf{H}$ and $\mathbf{H}_n$, without parentheses, as shorthands for $\mathbf{H}(\theta_*)$ and $\mathbf{H}_n(\theta_*)$).
We then have an auxiliary result called the Localization Lemma.</p>
<blockquote>
<p><strong>Localization lemma.</strong>
Suppose that $n \gtrsim K_2^4 (d + \log(1/\delta))$, and for some $r \ge 0$ it holds
\[
\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*), \, \forall \theta \in \Theta_{r}(\theta_*).
\]
Then, for any $r_0 \le r$, the following holds: whenever
\[
\Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2 \lesssim r_0^2,
\]
we have that $\widehat\theta_n$ belongs to $\Theta_{r_0}(\theta_*)$, and moreover,
\[
L(\widehat \theta_n) - L(\theta_*) \lesssim \Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2 \lesssim \Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2.
\]</p>
</blockquote>
<p><strong>Proof sketch.</strong>
By definition, $L_n(\widehat\theta_n) \le L_n(\theta_*)$. Assume that $\widehat\theta_n$ is not in $\Theta_{r_0}(\theta_*)$, and choose the point $\bar \theta_n$ on the segment $[\theta_*,\widehat\theta_n]$ such that $\bar \theta_n$ is precisely on the boundary of $\Theta_{r_0}(\theta_*)$, so that
\[
\Vert\bar \theta_n - \theta_*\Vert_{\mathbf{H}} = r_0.
\]
Note that by convexity of the level sets of $L_n(\theta)$, we still have $L_n(\bar\theta_n) \le L_n(\theta_*).$
On the other hand, by the intermediate value theorem, for some $\theta’_n$ belonging to the segment $[\theta_*, \bar\theta_n]$, and hence to $\Theta_{r_0}(\theta_*)$, it holds
\begin{align}
0
&\ge L_n(\bar\theta_n) - L_n(\theta_*) \\
&= \left\langle \nabla L_n(\theta_*), \bar\theta_n - \theta_* \right\rangle + \frac{1}{2} \Vert \bar\theta_n - \theta_*\Vert_{\mathbf{H}_n(\theta’_n)}^2 \\
&\approx \left\langle \nabla L_n(\theta_*), \bar\theta_n - \theta_* \right\rangle + \frac{1}{2} \Vert \bar\theta_n - \theta_* \Vert_{\mathbf{H}_n}^2 \\
&\approx \left\langle \nabla L_n(\theta_*), \bar\theta_n - \theta_* \right\rangle + \frac{1}{2} \Vert \bar\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \\
&\ge - \Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}} \Vert\bar\theta_n - \theta_* \Vert_{\mathbf{H}} + \frac{1}{2} \Vert \bar\theta_n - \theta_*\Vert_{\mathbf{H}}^2 \\
&= - r_0 \Vert \nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}} + \frac{r_0^2}{2},
\end{align}
where we use $a \approx b$ as a shorthand for saying that $a$ and $b$ are within a multiplicative constant factor from each other, and used that~$\mathbf{H}_n(\theta’_n) \asymp \mathbf{H}_n(\theta_*)$ in the second line.
Rearranging the terms, we arrive at a contradiction, so in fact $\widehat\theta_n$ must belong to $\Theta_{r_0}(\theta_*)$.
This proves the first claim of the lemma.
Now that we know that $\widehat\theta_n \in \Theta_{r_0}(\theta_*)$, we can also prove that
\[
\Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2 \lesssim \Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2
\]
by replacing $\bar \theta_n$ with $\widehat\theta_n$ in the above chain of inequalities. Finally,
\[
L(\widehat \theta_n) - L(\theta_*) \lesssim \Vert\widehat \theta_n - \theta_*\Vert_{\mathbf{H}}^2
\]
also follows from the intermediate value theorem. using that $\widehat \theta_n$ belongs to the ellipsoid $\Theta_r(\theta_*)$ in which $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*)$, and applying the sample covariance matrix concentration result to $\mathbf{H}_n(\theta_*)$. $\blacksquare$</p>
<h2 id="why-constant-radius">Why constant radius?</h2>
<p>Recall that the gradient of the empirical risk is the average of i.i.d. random vectors,
\[
\nabla L_n(\theta_*) = \frac{1}{n} \sum_{i = 1}^n \nabla_{\theta} \ell(Y_i,X_i^\top \theta_*),
\]
and each $\nabla_{\theta} \ell(Y_i,X_i^\top \theta)$ has covariance $\mathbf{G}$. Assuming that decorrelated gradients $\mathbf{G}^{-1/2}\nabla_{\theta} \ell(Y_i,X_i^\top \theta_*)$ are $K_1$-subgaussian, Bernstein inequality implies, with probability $\ge 1-\delta$,
\[
\Vert\nabla L_n(\theta_*) \Vert_{\mathbf{H}^{-1}}^2 \lesssim \frac{K_1^2 d_{eff} \log(1/\delta)}{n}.
\]
Hence, the localization lemma implies that if we can guarantee that $\mathbf{H}_n(\theta)$ is near-constant over the Dikin ellipsoid of radius $r$, the sample size sufficient to guarantee a finite-sample analogue of \eqref{eq:crb-prob} is
\[
\boxed{
n \gtrsim \max \left\{ K_2^4 (d+ \log(1/\delta)), \; \color{red}{r^2} K^2 K_1^2 {\color{blue}{d_{eff}}} \log(1/\delta) \right\}.
}
\]
The first bound guarantees reliable estimation of the risk curvature at the optimum, and is the same as in linear regression, so we have reasons to suggest that it is unavoidable.
On the other hand, the second bound is related to the fact that the loss is not quadratic, and dominates the first one, assuming $d_{eff} = O(d)$, unless $r$ – the radius of the Dikin ellipsoid in which $\mathbf{H}_n(\theta) \asymp \mathbf{H}_n(\theta_*)$ with high probability – is <em>constant</em>.
In the next post, we will see how self-concordance leads to Hessian approximation bounds of this type with $r = O(\sqrt{d})$ which results in the $O(d \cdot d_{eff})$ sample size, and how this can be improved to $r = O(1)$ and
$
n = O(\max(d,d_{eff}))
$
with a more subtle argument.</p>
<h2 id="references">References</h2>
<ol>
<li>
<p>B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection.
<em>Ann. Statist.</em>, 28:5(2000), 1302-1338.</p>
</li>
<li>
<p>V. Spokoiny. Parametric estimation. Finite sample theory.
<em>Ann. Statist.</em>, 40:6(2012), 2877-2909.</p>
</li>
<li>
<p>A. Nemirovski and Yu. Nesterov. Interior-point polynomial algorithms in convex programming.
<em>Society for Industrial and Applied Mathematics, Philadelphia, 1994.</em></p>
</li>
<li>
<p>F. Bach. Self-concordant analysis for logistic regression.
<em>Electron. J. Stat., 4(2010), 384-414.</em></p>
</li>
<li>
<p>R. Vershynin. Introduction to the non-asymptotic analysis of random matrices.
<em>Compressed Sensing: Theory and Applications, 210–268</em>. Cambridge University Press, 2012.</p>
</li>
<li>
<p>D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression.
<em>COLT, 2012.</em></p>
</li>
</ol>
Mon, 12 Nov 2018 00:00:00 +0000
http://ostrodmit.github.io/blog/2018/11/12/self-concordance-part-1/
http://ostrodmit.github.io/blog/2018/11/12/self-concordance-part-1/