Generalized linear models

CSE 599, Autumn 2023

Empirical risk minimization (ERM) is a widely studied problem in learning theory and statistics. A prominent special case is the problem of optimizing a generalized linear model (GLM), i.e.,

$$\begin{equation}\label{eq:glm} \min_{x \in \mathbb R^n} F(x) \quad \textrm{ for } \quad F(x) := \sum_{i=1}^m \varphi_i(\langle a_i, x \rangle - b_i)\,, \end{equation}$$

where the total loss $F : \mathbb R^n \to \mathbb R$, is defined by vectors $a_1,\ldots,a_m \in \mathbb R^n$, $b \in \mathbb R^m$, and loss functions $f_1, \dots, f_m: \mathbb R\to \mathbb R$. Different choices of the loss functions ${ \varphi_i }$ capture a variety of important problems, including linear regression, logistic regression, and $\ell_p$ regression.

When $m \gg n$, a natural approach for fast algorithms is to apply sparsification techniques to reduce the value of $m$, while maintaining a good multiplicative approximation of the objective value. This corresponds to our usual sparsification model where $f_i(x) = \varphi_i(\langle a_i,x\rangle - b_i)$. As mentioned previously, we will can assume that $b_1=\cdots=b_m=0$ by lifting to one dimension higher. In previous lectures, we solved the sparsification problem when $\varphi_i(z)=|z|^p$ for some $1 \leqslant p \leqslant 2$.

However, $\ell_p$ losses are the only class of natural loss functions for which linear-size sparsification results are known for GLMs. For instance, for the widely-studied class of Huber-type loss functions, e.g., $\varphi_i(z) = \min(|z|, |z|^2)$, the best known sparsity bounds obtained by previous methods were of the form is $\tilde{O}(n^{4-2\sqrt{2}} \varepsilon^{-2})$.

We will often think of the case $\varphi_i(z) = h_i(z)^2$ for some $h_i : \mathbb R\to \mathbb R_+$, as the assumptions we need are stated more naturally in terms of $\sqrt{\varphi_i}$. To that end, consider a function $h : \mathbb R^n \to \mathbb R_+$ and the following two properties, where $L, c, \theta> 0$ are some positive constants.

Assumptions:
1. ($L$-auto-Lipschitz) $|h(z)-h(z’)| \leqslant L\,h(z-z’)$ for all $z,z’ \in \mathbb R$.
2. (Lower $\theta$-homogeneous) $h(\lambda z) \geqslant c \lambda^{\theta} h(z)$ for all $z \in \mathbb R$ and $\lambda \geqslant 1$.
Theorem [Jambulapati-Lee-Liu-Sidford 2023]: Consider $\varphi_1, \dots, \varphi_m: \mathbb R\to \mathbb R_{+}$, and suppose there are numbers $L,c,\theta> 0$ such that each $\sqrt{\varphi_i}$ is $L$-auto-Lipschitz and lower $\theta$-homogeneous (with constant $c$). Then for any $a_1,\ldots,a_m \in \mathbb R^n$, and numbers $0 < \varepsilon< \tfrac12$ and $s_{\max}> s_{\min}\geqslant 0$, there are nonnegative weights $w_1,\ldots,w_m \geqslant 0$ such that
\[\left| F(x) - \sum_{i=1}^m w_i \varphi_i(\langle a_i,x\rangle) \right| \leqslant\varepsilon F(x)\,,\quad \forall x \in \mathbb R^n \ \textrm{ s.t. }\ s_{\min}\leqslant F(x) \leqslant s_{\max}\,,\]
where $F(x) \mathrel{\mathop:}=\varphi_1(\langle a_1,x\rangle) + \cdots + \varphi_m(\langle a_m,x\rangle)$, and
\[\left | i \in \{1,\ldots,m\} : w_i > 0 \right | \lesssim_{L,c,\theta} \frac{n}{\varepsilon^2} \log \left(\frac{n}{\varepsilon} \frac{s_{\max}}{s_{\min}}\right) \left(\log \frac{n}{\varepsilon}\right)^3\,.\]
Moreover, the weights ${w_i}$ can be computed in time $\tilde{O}_{L,c,\theta}\!\left((\mathsf{nnz}(a_1,\ldots,a_m) + n^{\omega} + m \mathcal{T}_{\mathrm{eval}}) \log(ms_{\max}/s_{\min})\right)$, with high probability.
Here, $\mathcal{T}_{\mathrm{eval}}$ is the maximum time needed to evaluate each $\varphi_i$, $\mathsf{nnz}(a_1,\ldots,a_m)$ is the total number of non-zero entries in the vectors $a_1,\ldots,a_m$, and $\omega$ is the matrix multiplication exponent. “High probability” means that the failure probability can be made less than $n^{-\ell}$ for any $\ell > 1$ by increasing the running time by an $O(\ell)$ factor.

It is not difficult to see that for $0 < p \leqslant 2$, the function $\varphi_i(z) = |z|^p$ satisfies the required hypotheses of . Additionally, the $\gamma_p$ functions, defined as

$$\begin{equation}\label{eq:gammap} \gamma_p(z) := \begin{cases} \frac{p}{2}z^2 & \enspace \text{ for } \enspace |z| \leqslant 1 \\ |z|^p - (1-\frac{p}{2}) & \enspace \text{ for } \enspace |z| \geqslant 1, \end{cases}\end{equation}$$

for $p \in (0, 2]$, also satisfy the conditions. The special case of $\gamma_1$ is known as the Huber loss.

To show that $\sqrt{\gamma_p}$ is $1$-auto-Lipschitz, it suffices to prove that $\sqrt{\gamma_p}$ is concave. The lower growth bound is immediate for $p > 0$.

Importance sampling and multiscale weights

Given $F(x) = \varphi_1(\langle a_1,x\rangle) + \cdots + \varphi_m(\langle a_m,x\rangle)$, our approach to sparsification is via importance sampling. Given a probability vector $\rho \in \mathbb R^m$ with $\rho_1,\ldots,\rho_m > 0$ and $\rho_1 + \cdots + \rho_m = 1$, we sample $M \geqslant 1$ coordinates $\nu_1,\ldots,\nu_M$ i.i.d. from $\rho$, and define our potential approximator by

$$\tilde{F}(x) \seteq \frac{1}{M} \sum_{j=1}^M \frac{\varphi_{\nu_j}(\langle a_{\nu_j},x\rangle)}{\rho_{\nu_j}}\,.$$

Since we want an approximation guarantee to hold simultaneously for many $x \in \mathbb R^n$, it is natural to analyze expressions of the form

$$\mathop{\mathrm{\mathbb{E}}}\max_{F(x) \leqslant s} \left|F(x)-\tilde{F}(x)\right|.$$

As we have seen, analysis of this expression involves the size of discretizations of the set $B_F(s) \mathrel{\mathop:}=\{ x \in \mathbb R^n : F(x) \leqslant s \}$ at various granularities. The key consideration (via Dudley’s entropy inequality) is how well $B_F(s)$ can be covered by cells on which we have uniform control on how much the terms $\varphi_i(\langle a_i,x\rangle)/\rho_i$ vary within each cell.

The $\ell_2$ case

Let’s consider the case $\varphi_i(z) = |z|^2$ so that $F(x) = |\langle a_1,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2$. Here, $B_F(s) = \{ x \in \mathbb R^n : \norm{Ax}_2^2 \leqslant s \}$, where $A$ is the matrix with $a_1,\ldots,a_m$ as rows.

A cell at scale $2^{j}$ looks like

$$\mathsf K_j \mathrel{\mathop:}=\left\{ x \in \mathbb R^n : \max_{i \in [m]} \frac{|\langle a_i,x\rangle|^2}{\rho_i} \leqslant 2^{j} \right\},$$

and the pertinent question is how many translates of $\mathsf K_j$ it takes to cover $B_F(s)$. In the $\ell_2$ case, this is the well-studied problem of covering Euclidean balls by $\ell_{\infty}$ balls.

If $N_j$ denotes the minimum number of such cells required, then the dual-Sudakov inequality tells us that

$$\log N_j \lesssim \frac{s}{2^j} \log (m) \max_{i \in [m]} \frac{\|(A^{\top} A)^{-1/2} a_i\|^2_2}{\rho_i}\,.$$

Choosing $\rho_i \mathrel{\mathop:}=\frac{1}{n} |(A^\top A)^{-1/2} a_i|_2^2$ to be the normalized leverage scores yields uniform control on the size of the coverings:

$$\log N_j \lesssim \frac{s}{2^j} n \log m\,.$$

The $\ell_p$ case

Now we take $\varphi_i(z) = |z|^p$ so that $F(x) = \norm{Ax}_p^p$. A cell at scale $2^j$ now looks like

$$\mathsf K_j \mathrel{\mathop:}=\left\{ x \in \mathbb R^n : \max_{i \in [m]} \frac{|\langle a_i,x\rangle|^p}{\rho_i} \leqslant 2^j \right\},$$

To cover $B_F(s)$ by translates of $\mathsf K_j$, we will again employ Euclidean balls, and one can use $\ell_p$ Lewis weights to relate the $\ell_p$ structure to an $\ell_2$ structure.

A classical result of Lewis establishes that there are nonnegative weights $w_1,\ldots,w_m \geqslant 0$ such that if $W = \mathrm{diag}(w_1,\ldots,w_m)$ and $U \mathrel{\mathop:}=(A^{\top} W A)^{1/2}$, then

$$\begin{equation}\label{eq:wis} w_i = \frac{\norm{U^{-1} a_i}_2^p}{\norm{U^{-1} a_i}_2^2} = \frac{\varphi_i(\norm{U^{-1} a_i}_2)}{\norm{U^{-1} a_i}_2^2}\,. \end{equation}$$

Assuming that $A$ has full rank, a straightforward calculation gives $\sum_{i=1}^m w_i \norm{U^{-1} a_i}_2^2 = \mathop{\mathrm{tr}}(U^2 U^{-2}) = n$.

Therefore we can take $\rho_i \mathrel{\mathop:}=\frac{1}{n} w_i \norm{U^{-1} a_i}_2^2$ for $i=1,\ldots,m$, and our cells become

$$\mathsf K_j \mathrel{\mathop:}=\left\{ x \in \mathbb R^n : \max_{i \in [m]} |\langle a_i,x\rangle|^p \leqslant\frac{2^j}{n} w_i \norm{U^{-1} a_i}_2^2 \right\}.$$

If we are trying to use $\ell_2$-$\ell_{\infty}$ covering bounds, we face an immediate problem: Unlike in the $\ell_2$ case, we don’t have prior control on $\norm{U x}_2$ for $x \in B_F(s)$. One can obtain an initial bound using the structure of $U = (A^{\top} W A)^{1/2}$:

$$ \begin{align} \norm{U x}_2^2 = \sum_{i=1}^m w_i \langle a_i,x\rangle^2 &= \sum_{i=1}^m \norm{U^{-1} a_i}_2^{p-2} \langle a_i,x\rangle^2 \nonumber \\ &= \sum_{i=1}^m \left(\frac{|\langle a_i,x\rangle|}{\norm{U^{-1} a_i}_2}\right)^{2-p} |\langle a_i,x\rangle|^p \leqslant\norm{U x}_2^{2-p} \sum_{i=1}^m |\langle a_i,x\rangle|^p\,, \label{eq:intro-nc0} \end{align}$$

where the last inequality is Cauchy-Schwarz: $|\langle a_i, x\rangle| = |\langle U^{-1} a_i, U x\rangle| \leqslant\norm{U^{-1} a_i}_2 \norm{U x}_2$. This gives the bound $\norm{U x}_2 \leqslant\norm{Ax}_p \leqslant s^{1/p}$ for $x \in B_F(s)$.

Problematically, this uniform $\ell_2$ bound is too weak, but there is a straightforward solution: Suppose we cover $B_F(s)$ by translates of $\mathsf K_{j_0}$. This gives an $\ell_{\infty}$ bound on the elements of each cell, meaning that we can apply \eqref{eq:intro-nc0} and obtain a better upper bound on $\norm{Ux}_2$ for $x \in \mathsf K_{j_0}$. Thus to cover $B_F(s)$ by translates of $\mathsf K_j$ with $j < j_0$, we will cover first by translates of $\mathsf K_{j_0}$, then cover each translate $(x + \mathsf K_{j_0}) \cap B_F(s)$ by translates of $\mathsf K_{j_0-1}$, and so on.

The standard approach in this setting is to use interpolation inequalities and duality of covering numbers for a cleaner analytic version of such an iterated covering bound. But the iterative covering argument can be adapted to the non-homogeneous setting, as we discuss next.

Non-homogeneous loss functions

When we move to more general loss functions $\varphi_i : \mathbb R\to \mathbb R$, we lose the homogeneity property $\varphi_i(\lambda x) = \lambda^p \varphi_i(x)$, $\lambda > 0$ that holds for $\ell_p$ losses. Because of this, we need to replace the single Euclidean structure present in \eqref{eq:wis} (given by the linear operator $U$) with a family of structures, one for every relevant scale.

Definition (Approximate weight at scale $s$): Fix $a_1,\ldots,a_m \in \mathbb R^n$ and loss functions $\varphi_1,\ldots,\varphi_m~:~\mathbb R\to \mathbb R_+$. Say that a vector $w \in \mathbb R_+^m$ is an $\alpha$-approximate weight at scale $s$ if
\[\label{eq:approx-weight} \frac{s}{\alpha} \leqslant\frac{\varphi_i(\norm{M_w^{-1/2} a_i}_2)}{w_i \norm{M_w^{-1/2} a_i}_2^2} \leqslant\alpha s\,,\quad i=1,\ldots,m\,,\text{ where }M_w \mathrel{\mathop:}=\sum_{j=1}^m w_j a_j a_j^{\top}\,.\]

Moreover, in order to generalize the iterated covering argument for $\ell_p$ losses, we need there to be a relationship between weights at different scales.

Definition (Approximate weight scheme): Let $\mathcal J\subseteq \mathbb Z$ be a contiguous interval. A family ${w^{(j)} \in \mathbb R_+^m : j \in \mathcal J}$ is an $\alpha$-approximate weight scheme if each $w^{(j)}$ is an $\alpha$-approximate weight at scale $2^j$ and, furthermore, for every pair $j,j+1 \in \mathcal J$ and $i \in {1,\ldots,m}$,
\[\label{eq:smooth-weights} w_i^{(j+1)} \leqslant\alpha w_i^{(j)}\,.\]

Given a weight scheme, we choose sampling probabilities

$$\rho_i \propto \max_{j \in \mathcal J} w_i^{(j)} \norm{M_{w^{(j)}}^{-1/2} a_i}_2^2 = \max_{j \in \mathcal{J}} \sigma_i(W_j^{1/2} A)\,,\quad i=1,\ldots,m\,,$$

where $W_j = \mathrm{diag}(w_1^{(j)},\ldots,w_m^{(j)})$.

In our setting, $|\mathcal J| \leqslant O(\log(ms_{\max}/s_{\min}))$, which results in the sparsity increasing by a corresponding factor.