Importance sampling and leverage scores

Intro to sparsification

Unbiased estimators and importance sampling

\(\ell_2\) regression and statistical leverage scores

Considering \(F(x) = f_1(x) + \cdots + f_m(x)\), one way to choose an importance for each \(f_i\) is to obtain a uniform bound on each term. In other words, we consider the quantity

\[\alpha_i \seteq \max_{0 \neq x \in \R^n} \frac{f_i(x)}{F(x)}\,.\]

If we assume that \(F\) and \(f_i\) scale homogeneously (e.g., in the case of \(\ell_2\) regression, \(F(\lambda x) = |\lambda|^2 F(x)\) and similarly for each \(f_i\)), then we can write this as

\[\begin{equation}\label{eq:uniform-con} \alpha_i = \max_{F(x) \leq 1} f_i(x)\,. \end{equation}\]

Let’s try to derive good sampling probabilities for the \(\ell_2\) regression problem where

\[F(x) = |\langle a_1,x\rangle|^2 + |\langle a_2,x\rangle|^2 + \cdots + |\langle a_m,x\rangle|^2\,,\]

for \(a_1,\ldots,a_m \in \R^n\), and we define \(A\) as the matrix with \(a_1,\ldots,a_m\) as rows. We’ll derive our importance scores in a few different ways. Each of these perspectives will be useful for moving later to more general settings.

Concentration

Once we’ve chosen our sampling probabilities \(\rho_i = \sigma_i(A)/n\), we can independently sample \(M\) terms from the distribution, and we are left to analyze the random sum

\[\tilde{A} = \frac{1}{M}\left(\frac{a_{i_1} a_{i_1}^{\top}}{\rho_{i_1}} + \cdots + \frac{a_{i_M} a_{i_M}^{\top}}{\rho_{i_M}}\right).\]

We have already argued that \(\E[\tilde{A}] = A\), so the real question is about concentration, which we will begin to cover in the next lecture.

There are actually two ways to approach this question. One is to think about proving concentration of the sum along every direction \(x \in \R^n\) simultaneously. This leads naturally to entropy bounds, covering numbers, and the generic chaining theory. The second approach is to think about \(\tilde{A}\) as a random operator and try to prove concentration of sums of independent random matrices. The former will be far more general, while the latter appears substantially more powerful in certain special cases.