Jekyll2023-09-17T07:09:49-07:00https://joshnguyen.net/feed.xmlJosh Nguyenpersonal descriptionJosh Nguyenjoshtn@seas.upenn.eduProjection onto the L1-Norm Ball2023-02-10T00:00:00-08:002023-02-10T00:00:00-08:00https://joshnguyen.net/posts/l1-proj<p>In this post we will discuss a frequently visited problem in convex optimization: projection onto the $\ell_1$-norm ball. I think this is a great example to understand the concepts of duality and piecewise functions and the phenomenon called <em>sparsity</em>, where the solution to a problem contains mostly zeros, and only some remain “activated.”</p> <h2 id="the-problem">The Problem</h2> <p>Suppose we have a vector $a \in \mathbb{R}^n$ and want to find another vector $x$ in an $\ell_1$-norm ball such that the distance between $a$ and $x$ is the smallest possible. In other words, we seek to solve the following problem</p> \begin{align} \begin{aligned} \min_{x} \quad &amp; \left\{ f(x) = \frac{1}{2} \lVert x - a \rVert_2^2 \right\} \\ \text{such that} \quad &amp; \lVert x \rVert_1 \leq \kappa. \end{aligned} \label{eq:primal} \end{align} <p>The constraint $\lVert x \rVert_1 \leq \kappa$ denotes the $\ell_1$-norm ball of radius $\kappa &gt; 0$ (that is, all the points whose $\ell_1$ norm is at most $\kappa$) and $f(x)$ is the primal objective function.</p> <p>Does this problem have a closed-form solution? Yes, but not always. Like all projection problems, if $a$ is already in the convex set ($\lVert a \rVert_1 \leq \kappa$) then the solution is simply $x = a$, and we are done. Otherwise, we will need to approximate the solution.</p> <p align="center"> <img src="/files/l1_proj_figs/l1_gif.gif" title="L1 projection" /> </p> <p>In the above figure, we see an example in $n=2$ dimensions. The black square (containing all points on and within its boundaries) represents the $\ell_1$-norm ball of radius $\kappa = 1$. The center of the blue circle, in red, is the vector $a$ and the radius of the circle is the smallest distance from $a$ to the $\ell_1$-norm ball.</p> <h2 id="duality">Duality</h2> <p>It is easy to see that $\eqref{eq:primal}$ is a convex optimization problem as its objective and constraint are both convex. Note that in this case, $x = 0$ is a strictly feasible solution, which means, by <a href="https://en.wikipedia.org/wiki/Slater%27s_condition">Slater’s condition</a>, strong duality holds. We therefore will aim to solve $\eqref{eq:primal}$ by maximizing its dual function. Let $\gamma \geq 0$ be the dual variable; the Lagragian is</p> \begin{align} L(x, \gamma) = \frac{1}{2} \lVert x - a \rVert_2^2 + \gamma (\lVert x \rVert_1 - \kappa). \label{eq:lagrangian} \end{align} <p>Finding the dual objective function requires us to minimize $L(x, \gamma)$ with respect to $x$. Notice that we can rewrite the Lagrangian as</p> \begin{align*} L(x, \gamma) = - \kappa \gamma + \sum_{i=1}^{n} \left( \frac{1}{2} (x_i - a_i)^2 + \gamma \lvert x_i \rvert \right), \label{eq:lagrangian_as_sum} \end{align*} <p>where the subscript $i$ denotes the $i$th element of a vector. If we let</p> \begin{align*} s_i(x_i, \gamma) = \frac{1}{2} (x_i - a_i)^2 + \gamma \lvert x_i \rvert, \label{eq:si} \end{align*} <p>then minimizing $L(x, \gamma)$ with respect to $x$ means to minimize each $s_i(x_i, \gamma)$ with respect to $x_i$. Fortunately, the problem $\min_{x_i} s_i(x_i, \gamma)$ has a unique and closed-form solution:</p> \begin{align} x_i(\gamma) = \begin{cases} a_i - \gamma &amp; \text{if} \quad a_i &gt; \gamma \\ 0 &amp; \text{if} \quad - \gamma \leq a_i \leq \gamma \\ a_i + \gamma &amp; \text{if} \quad a_i &lt; \gamma. \end{cases} \label{eq:threshold_si} \end{align} <p>This is is called the <em>soft thresholding operator</em> for $\gamma$. Equation $\eqref{eq:threshold_si}$ shows us how to convert a dual solution $\gamma$ to a primal solution $x$. Now, if we let $s_i^*(\gamma) = \min_{x_i} s_i(x_i, \gamma) = s_i(x_i(\gamma), \gamma)$, the dual objective is</p> \begin{align*} g(\gamma) = \min_{x} L(x, \gamma) = - \kappa \gamma + \sum_{i=1}^{n} s_i^*(\gamma). \end{align*} <p>We know that $g$ is a concave function by design. Furthermore, since the solution to $\min_{x_i} s_i(x_i, \gamma)$ is unique for every $\gamma \geq 0$, by <a href="https://en.wikipedia.org/wiki/Danskin%27s_theorem">Danskin’s theorem</a>, each $s_i^*$ is differentiable, which makes $g$ differentiable as well. We can easily verify that the derivative of $g$ is</p> \begin{align} g'(\gamma) = \min_{x} L(x, \gamma) = - \kappa + \sum_{i=1}^{n} \max(\lvert a_i \rvert - \gamma, 0). \label{eq:dual_derivative} \end{align} <details> <summary><b>Proof.</b></summary> <div style="padding-left:2em; padding-right:2em"> <br /> It only remains to be shown that $\frac{d}{d\gamma} s_i^*(\gamma) = \max(\lvert a_i \rvert - \gamma, 0)$. To see why, note that \begin{align*} s_i^*(\gamma) = s_i(x_i(\gamma), \gamma) = \frac{1}{2} (x_i(\gamma) - a)^2 + \gamma (x_i(\gamma)). \end{align*} It is easy to show that, by $\eqref{eq:threshold_si}$, \begin{align*} (x_i(\gamma) - a)^2 = \min(\lvert a_i \rvert, \gamma)^2, \end{align*} and \begin{align*} (x_i(\gamma) - a)^2 = \max(\lvert a_i \rvert - \gamma, 0)^2. \end{align*} Now we consider two cases of $\gamma$. First, if $\gamma \leq |a_i|$, we have $s_i^*(\gamma) = - \frac{1}{2} \gamma^2 + |a_i| \gamma$, which gives its derivative equal to $|a_i| - \gamma$. Second, if if $\gamma &gt; |a_i|$, $s_i^*(\gamma) = 0$. Either way, $\frac{d}{d\gamma} s_i^*(\gamma) = \max(|a_i| - \gamma, 0)$. </div> </details> <p>So far we have been able to find the dual function $g(\gamma)$ and its derivative $g’(\gamma)$. Now we will explore a method to maximize $g(\gamma)$ and recover the primal optimal solution.</p> <h2 id="optimizing-the-dual-function">Optimizing the Dual Function</h2> <p>As a reminder, we will aim to solve the problem</p> \begin{align} \max_{\gamma} g(\gamma) \quad \text{such that} \quad \gamma \geq 0. \label{eq:dual} \end{align} <p>Since $g$ is concave and differentiable, we can aim to maximize it by using a hill-climbing algorithm such as gradient ascent with backtracking line search. Below is an example dual function and its derivative at various values of $\gamma$.</p> <p align="center"> <img src="/files/l1_proj_figs/l1_dual_gif.gif" title="L1 projection, dual function" /> </p> <p>In this post we will solve this problem using a different method called <a href="https://en.wikipedia.org/wiki/Bisection_method">bisection</a>. The aim here is to set the derivative to zero and solve for $\gamma$. In other words, we seek the solution to $g’(\gamma) = 0$. The bisection method requires us to have a range $[\gamma_{\min}, \gamma_{\max}]$ in which we are sure the optimal solution $\gamma^*$ lies.</p> <p>First, since $\gamma^*$ must be feasible, we set $\gamma_{\min} = 0$. To find an upper bound, note that since the optimal objective value for Problem $\eqref{eq:primal}$ must be non-negative, and strong duality holds, the optimal value for Problem $\eqref{eq:dual}$ is also non-negative. This implies that</p> \begin{align*} - \kappa \gamma^* + \sum_{i=1}^{n} s_i(x_i(\gamma^*), \gamma^*) \geq 0. \end{align*} <p>Therefore,</p> \begin{align*} \gamma^* &amp; \leq \frac{1}{\kappa} \sum_{i=1}^{n} s_i(x_i(\gamma^*), \gamma^*) \leq \frac{1}{\kappa} \sum_{i=1}^{n} s_i(0, \gamma^*) = \frac{1}{\kappa} \sum_{i=1}^{n} \frac{a_i^2}{2} = \frac{1}{2 \kappa} \lVert a \rVert_2^2, \end{align*} <p>where the second inequality is due to the fact that $$x_i(\gamma^*)$$ is the minimizer if $s_i(x, \gamma^*)$. So an upper bound we can set for $$\gamma^*$$ is $$\gamma_{\max} = \frac{1}{2 \kappa} \lVert a \rVert_2^2$$.</p> <p>Now that we know $$\gamma^*$$ is in between $\gamma_{\min} = 0$ and $$\gamma_{\max} = \frac{1}{2 \kappa} \lVert a \rVert_2^2$$, the bisection method works as follows. First, let $\gamma = (\gamma_{\min} + \gamma_{\max}) / 2$. If the sign of $g’(\gamma)$ is the same as that of $g’(\gamma_{\min})$, then $\gamma_{\min}$ is updated to $\gamma$. Otherwise, $\gamma_{\max}$ is updated to $\gamma$. It is simple as that! The method is also guaranteed to converge, as after each iteration, the length of the interval $[\gamma_{\min}, \gamma_{\max}]$ is reduced by half.</p> <p>Another point to note is that $$g'$$ is a monotonically non-increasing function. It achieves a minimum of $$-\kappa$$ when $$\gamma \geq \max_i \left\{ \lvert x_i \rvert \right\}$$ and a maximum of $$-\kappa + \lVert a \rVert_1$$ when $$\gamma \leq \min_i \left\{ \lvert x_i \rvert \right\}$$. When $$\lVert a \lVert \leq \kappa$$, $g’$ is always negative so the solution can be incorrect. In this special case, we can directly conclude the solution $x = a$ without having to solve anything else.</p> <p align="center"> <img src="/files/l1_proj_figs/l1_dual_grad.png" title="L1 projection" /> </p> <p>The figures above show the derivative of a dual function where $a = [1,2,3]^\top$. The dark green horizontal line depicts $y = 0$. We can see that $g’$ is non-increasing and piecewise. In the left plot, $\kappa$ is set to $2 &lt; \lVert a \rVert_1 = 6$, which allows $g’$ to cross the $y = 0$ line and so the solution to $g’(\gamma)$ exists. On the other hand, in the left plot where $\kappa$ exceeds $\lVert a \rVert_1$, no solution exists. In this case one can output $a$ as the solution already.</p> <h2 id="implementation">Implementation</h2> <p>Here is a simple Python implementation of the bisection method for maximizing the dual objective. First we define a few functions.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">primal_fn</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span> <span class="k">return</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">((</span><span class="n">x</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="c1"># Vectorize the computation of s_i(x, gamma) </span><span class="k">def</span> <span class="nf">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span> <span class="k">return</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">gamma</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">def</span> <span class="nf">x_gamma</span><span class="p">(</span><span class="n">gamma</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span> <span class="n">sol</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">a</span> <span class="o">&gt;</span> <span class="n">gamma</span> <span class="n">sol</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">-</span> <span class="n">gamma</span> <span class="n">idx</span> <span class="o">=</span> <span class="n">a</span> <span class="o">&lt;</span> <span class="o">-</span> <span class="n">gamma</span> <span class="n">sol</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span> <span class="o">+</span> <span class="n">gamma</span> <span class="k">return</span> <span class="n">sol</span> <span class="k">def</span> <span class="nf">dual_fn</span><span class="p">(</span><span class="n">gamma</span><span class="p">,</span> <span class="n">kappa</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x_gamma</span><span class="p">(</span><span class="n">gamma</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="k">return</span> <span class="o">-</span> <span class="n">kappa</span> <span class="o">*</span> <span class="n">gamma</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">s</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">gamma</span><span class="p">,</span> <span class="n">a</span><span class="p">))</span> <span class="k">def</span> <span class="nf">dual_grad</span><span class="p">(</span><span class="n">gamma</span><span class="p">,</span> <span class="n">kappa</span><span class="p">,</span> <span class="n">a</span><span class="p">):</span> <span class="k">return</span> <span class="o">-</span> <span class="n">kappa</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">maximum</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">-</span> <span class="n">gamma</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span> </code></pre></div></div> <p>Then, the bisection method is straightforward. We can let the iterations run until the difference $\gamma_{\max} - \gamma_{\min}$ reaches below a pre-defined error $\varepsilon$, at which point the derivative should be close enough to $0$.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">bisection</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">kappa</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-5</span><span class="p">):</span> <span class="n">gamma_min</span><span class="p">,</span> <span class="n">gamma_max</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">kappa</span><span class="p">))</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">a</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="c1"># Run until gamma_max and gamma_min are the same </span> <span class="k">while</span> <span class="n">gamma_max</span> <span class="o">-</span> <span class="n">gamma_min</span> <span class="o">&gt;</span> <span class="n">eps</span><span class="p">:</span> <span class="n">gamma</span> <span class="o">=</span> <span class="p">(</span><span class="n">gamma_max</span> <span class="o">+</span> <span class="n">gamma_min</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span> <span class="n">grad</span> <span class="o">=</span> <span class="n">dual_grad</span><span class="p">(</span><span class="n">gamma</span><span class="p">,</span> <span class="n">kappa</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="k">if</span> <span class="n">grad</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span> <span class="n">gamma_max</span> <span class="o">=</span> <span class="n">gamma</span> <span class="k">else</span><span class="p">:</span> <span class="n">gamma_min</span> <span class="o">=</span> <span class="n">gamma</span> <span class="k">return</span> <span class="n">gamma</span> </code></pre></div></div> <p>The first two plots in this post are produced using the following code.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Point to be projected </span><span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.1</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">])</span> <span class="c1"># Radius of the ell_1 norm ball </span><span class="n">kappa</span> <span class="o">=</span> <span class="mi">1</span> <span class="c1"># Find approximate solution to the dual problem </span><span class="n">dual_solution</span> <span class="o">=</span> <span class="n">bisection</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">kappa</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-5</span><span class="p">)</span> <span class="c1"># Convert to the primal solution </span><span class="n">primal_solution</span> <span class="o">=</span> <span class="n">x_gamma</span><span class="p">(</span><span class="n">dual_solution</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> </code></pre></div></div> <h2 id="sparseness-of-the-solution">Sparseness of the Solution</h2> <p>In the figure at the top of this post, you probably have observed that as $a$ moves, there seems to be a “region” of $a$ in which the solution stays in a vertex of the square. Projection onto the $\ell_1$-norm ball has an interesting characteristic: in high dimensions, the optimal solution has a tendency to be <em>sparse</em>, which means most of its elements are driven to zero. To see how, let’s try an example.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">dual_solution</span> <span class="o">=</span> <span class="n">bisection</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">kappa</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-10</span><span class="p">)</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">primal_solution</span> <span class="o">=</span> <span class="n">x_gamma</span><span class="p">(</span><span class="n">dual_solution</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="o">&gt;&gt;&gt;</span> <span class="k">print</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">count_nonzero</span><span class="p">(</span><span class="n">primal_solution</span><span class="p">))</span> <span class="mi">5</span> </code></pre></div></div> <p>In this example, I generated a $100$-dimensional vector $a$ by independently sampling $100$ samples from the standard normal distribution. After projection, the solution only contains $5$ non-zero values. Only $5$ out of $100$ elements remain non-zero! (Be careful: I set $\varepsilon$ to be very small but it’s probably better to check equality using <code class="language-plaintext highlighter-rouge">np.allclose</code> to compare two floating point numbers.)</p> <p>You may ask, “What’s the significance of this?” The tendency to drive most variables to zero is behind the success of the <a href="https://en.wikipedia.org/wiki/Lasso_(statistics)">lasso</a> method. Imagine you are performing a regression analysis with many, many variables. The lasso, or $\ell_1$ regularized, problem is</p> \begin{align} \min_{w} \frac{1}{2} \lVert Xw - y \rVert_2^2 + \lambda \lVert w \rVert_1, \label{eq:lasso} \end{align} <p>where $X$ is the design matrix, $y$ is the ground-truth labels and $\lambda$ is the regularization strength. Note that the formulation of lasso regression is not exactly the problem discussed in this method: the variable has to go through a linear transformation in lasso. However, observations remain similar: the solution $w^*$ to this problem tends to be sparse where most weights are driven to zero.</p> <p>While $\ell_2$ is a more popular regularizer, $\ell_1$ may be preferred if you like to assess features’ importance: those with non-zero coefficients tend to be very few and represent the most important features you may want to keep during feature selection.</p> <h2 id="conclusion">Conclusion</h2> <p>In this post we explore the problem of projection onto an $\ell_1$-norm ball. We formalize the primal problem and see how the dual problem can be expressed and optimized. We also observe that the solution tends to be sparse in high dimensions.</p> <p>Several things deserve some mentioning in these concluding remarks. First, we have yet to talk about the <em>asymptotic complexity</em> of solving $\ell_1$ projection. That is, given some tolerance $\epsilon$, how much time do we need to achieve an approximate solution $x$ to Problem $\eqref{eq:primal}$ such that $f(x) - f(x^*) &lt; \epsilon$? Second, you may be interested in a variant of $\ell_1$ projection called simplex projection, where the variable $x$ is also constrained to be non-negative. In this case the Lagrangian in $\eqref{eq:lagrangian}$ must involve another set of variables for the constraints $x_i \geq 0, i = 1, \ldots, n$, and a different optimization algorithm is needed. Third, minimizing the lasso objective as in $\eqref{eq:lasso}$ deserves some discussion, too. The resources below should offer some answer to these questions.</p> <h2 id="resources">Resources</h2> <ol> <li>Ryan Tibshirani’s <a href="https://www.stat.cmu.edu/~ryantibs/convexopt/">lectures on convex optimization</a>, specifically those on duality and proximal gradient descent.</li> <li><a href="https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf">Convex Optimization</a> by Boyd and Vandenberghe.</li> <li><a href="https://stanford.edu/~jduchi/projects/DuchiShSiCh08.pdf">Efficient Projections onto the $\ell_1$-Ball for Learning in High Dimensions</a> by John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra.</li> </ol>Josh Nguyenjoshtn@seas.upenn.eduIn this post we will discuss a frequently visited problem in convex optimization: projection onto the $\ell_1$-norm ball. I think this is a great example to understand the concepts of duality and piecewise functions and the phenomenon called sparsity, where the solution to a problem contains mostly zeros, and only some remain “activated.”A Simple, but Detailed, Example of Backpropagation2022-12-07T00:00:00-08:002022-12-07T00:00:00-08:00https://joshnguyen.net/posts/backprop<p>In this post, we will go through an exercise involving backpropagation for a fully connected feed-forward neural network. Though simple, I observe that a lot of “Introduction to Machine Learning” courses don’t tend to explain this example thoroughly enough. In fact, a common way students are taught about optimizing a neural network is that the gradients can be calculated using an algorithm called <em>backpropagation</em> (a.k.a. the chain rule), and the parameters are updated using gradient descent. However, the chain rule, apart from its formula, is typically swept under the rug and replaced by a “black box” operation called <em>autograd</em>.</p> <p>I think being able to implement backpropagation, at least in the simplest case, is quite important for its conceptual understanding. Hopefully this will benefit the students who stumble upon this page after a while of searching for “How to implement backprop.”</p> <h2 id="the-exercise">The Exercise</h2> <p>Below is a simple fully connected neural network.</p> <p align="left"> <img src="/files/backprop_nn_example.svg" title="Simple neural network" width="1000px" /> </p> <p>Let’s decompose this architecture:</p> <ul> <li>The first layer has 5 neurons. This network accepts inputs that are 5-dimensional.</li> <li>The final layer has 1 neuron. It represents the loss function, which is a scalar.</li> <li>The second-last layer, which has 12 neurons, is actually typically called the last layer. If this layer is followed by softmax, you can think of this network as a 12-class classifier.</li> <li>There are two hidden layers, one with 10 neurons and the other with 4.</li> </ul> <p>Here’s the computation in a forward pass through this network:</p> <ol> <li>Start with the input, which is 5-dimensional.</li> <li>Compute the first hidden output: <ul> <li>Apply a linear transformation: $t_0 = W_0 x$</li> <li>Apply a non-linear activation: $z_0 = \tanh(t_0)$, where we $\tanh$ to every element of $t_0$</li> </ul> </li> <li>Compute the second hidden output: <ul> <li>Apply a linear transformation: $t_1 = W_1 z_0$</li> <li>Apply a non-linear activation: $z_1 = \sigma(t_1)$, element-wise as well</li> </ul> </li> <li>Compute the second-last layer (classification output) <ul> <li>Apply a linear transformation: $t_2 = W_2 z_1$</li> <li>No activation: $z_2 = \text{Id}(t_2)$</li> </ul> </li> <li>Compute the loss <ul> <li>$\ell = \frac{1}{2} \lVert z_2 \rVert^2 = \frac{1}{2} \sum_{i} [z_2]_i^2$</li> </ul> </li> </ol> <p>Note that the dimensions of $W_0$, $W_1$ and $W_2$ are $10 \times 5$, $4 \times 10$ and $12 \times 4$, respectively.</p> <p>There is nothing special about choosing $\tanh$ and $\sigma$ (sigmoid) as the activation functions in steps 2 and 3; we can simply choose others such as ReLU. Likewise, we can use an activation function in Step 4 as well.</p> <p>Now our job is to find the gradient of $\ell$ with respect to the model parameters, that is, $\nabla_{W_0}\ell, \nabla_{W_1} \ell$ and $\nabla_{W_2}\ell$.</p> <p>First, let’s define our network in <code class="language-plaintext highlighter-rouge">numpy</code>. To make things a bit easier, we will define a few resuable classes.</p> <h3 id="tensor">Tensor</h3> <p>The first class we define is a tensor. It is basically a <code class="language-plaintext highlighter-rouge">numpy</code> array with a gradient, which is an array of the same size storing its gradient. The array is stored in <code class="language-plaintext highlighter-rouge">.data</code> and its gradient in <code class="language-plaintext highlighter-rouge">.grad</code>.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Tensor</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">arr</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">arr</span> <span class="bp">self</span><span class="p">.</span><span class="n">grad</span> <span class="o">=</span> <span class="bp">None</span> <span class="c1"># Optionally store the name of this tensor </span> <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span> </code></pre></div></div> <h3 id="activation-functions">Activation functions</h3> <p>Activation functions are functions that will be applied element-wise to tensors. For example, $z_0 = \tanh(t_0)$ means that $z_0$ and $t_0$ have the same dimensions, and every element in $z_0$ is the hyperbolic tangent transformation of the corresponding element in $t_0$.</p> <p>We will have a base class called <code class="language-plaintext highlighter-rouge">Activation</code>, which implements two methods:</p> <ul> <li><code class="language-plaintext highlighter-rouge">__call__</code> will be apply the function to an input.</li> <li><code class="language-plaintext highlighter-rouge">grad</code> will apply the gradient function to an input.</li> </ul> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Activation</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">pass</span> <span class="k">def</span> <span class="nf">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">pass</span> </code></pre></div></div> <p>Let’s implement the $\tanh$ activation function. We can simply use <code class="language-plaintext highlighter-rouge">np.tanh</code> for the forward pass. The derivative of this function is</p> \begin{align*} \tanh'(x) = 1 - \tanh^2(x). \end{align*} <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Tanh</span><span class="p">(</span><span class="n">Activation</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">def</span> <span class="nf">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span> </code></pre></div></div> <p>Similarly, we can implement the sigmoid function based on its formulas:</p> \begin{align*} \sigma(x) &amp;= \frac{1}{1 + e^{-x}}\\ \sigma'(x) &amp;= \sigma(x) (1 - \sigma(x)). \end{align*} <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Sigmoid</span><span class="p">(</span><span class="n">Activation</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">x</span><span class="p">))</span> <span class="k">def</span> <span class="nf">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="n">sx</span> <span class="o">=</span> <span class="bp">self</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">return</span> <span class="n">sx</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">sx</span><span class="p">)</span> </code></pre></div></div> <p>Another function we used above is the identity function, which simply returns the input. Its derivative is $1$.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Identity</span><span class="p">(</span><span class="n">Activation</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">x</span> <span class="k">def</span> <span class="nf">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> </code></pre></div></div> <h3 id="loss-function">Loss function</h3> <p>A loss function takes an input vector and returns a scalar (number). We will also implement the <code class="language-plaintext highlighter-rouge">grad</code> method for this function, <code class="language-plaintext highlighter-rouge">grad</code> should return a vector that is of the same shape as the input.</p> <p>The example loss function above is simply (half) squared norm, which simply squares every element of the input, sums them together, and divides the result by two. Calculus tells us that the gradient of such a function is the input itself.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">HalfSumSq</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="k">def</span> <span class="nf">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="k">return</span> <span class="n">x</span> </code></pre></div></div> <h2 id="network-and-forward-propagation">Network and Forward Propagation</h2> <p>Now we are ready to put things together and create our neural net. For the sake of simplicity, we will only define one additional method for our class, which is <code class="language-plaintext highlighter-rouge">loss_and_grad</code>. It will (1) take an input $x$ and perform a forward pass to get the loss, and (2) perform a backward pass to calculate the gradient of the loss with respect to its parameters.</p> <p>As I have explained the forward pass above, we are able define the most part of our network.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="k">class</span> <span class="nc">Net</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="c1"># Weight matrices. We will initialize them randomly </span> <span class="bp">self</span><span class="p">.</span><span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">output_dim</span><span class="p">,</span> <span class="n">input_dim</span><span class="p">))</span> <span class="k">for</span> <span class="n">input_dim</span><span class="p">,</span> <span class="n">output_dim</span> <span class="ow">in</span> <span class="p">[(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">10</span><span class="p">),</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">12</span><span class="p">)]]</span> <span class="c1"># Register t_0, t_1,... The default value (np.zeros) doesn't matter, as we </span> <span class="c1"># populate them in the forward pass later. </span> <span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">dim</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">))</span> <span class="k">for</span> <span class="n">dim</span> <span class="ow">in</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">12</span><span class="p">)]</span> <span class="c1"># Register z_0, z_1,... similarly </span> <span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">dim</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">))</span> <span class="k">for</span> <span class="n">dim</span> <span class="ow">in</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">12</span><span class="p">)]</span> <span class="c1"># Activation and loss functions </span> <span class="bp">self</span><span class="p">.</span><span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">Tanh</span><span class="p">(),</span> <span class="n">Sigmoid</span><span class="p">(),</span> <span class="n">Identity</span><span class="p">()]</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss</span> <span class="o">=</span> <span class="n">HalfSumSq</span><span class="p">()</span> <span class="k">def</span> <span class="nf">loss_and_grad</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="n">curr_output</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Forward prop </span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">)):</span> <span class="c1"># Linear transformation </span> <span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span> <span class="o">@</span> <span class="n">curr_output</span><span class="p">.</span><span class="n">data</span> <span class="n">curr_output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c1"># Activation function </span> <span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">activations</span><span class="p">[</span><span class="n">i</span><span class="p">](</span><span class="n">curr_output</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="n">curr_output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c1"># Loss function </span> <span class="n">l</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss</span><span class="p">(</span><span class="n">curr_output</span><span class="p">.</span><span class="n">data</span><span class="p">)</span> <span class="c1"># We will implement backprop later </span> <span class="c1"># TODO: backprop </span> <span class="k">return</span> <span class="n">l</span> </code></pre></div></div> <h2 id="backpropagation">Backpropagation</h2> <p>The forward propagation above creates a <em>computation graph</em>, which shows us the flow of signals from input to output. To find the gradients, we need to traverse this graph <em>backwards</em>, that is, from output to input, hence the name.</p> <p>Recall that this is an application of the chain rule in multivariate calculus. Suppose we have a scalar function $h(v) = (f \circ g)(v) = f(g(v))$. To find the gradient of $h$ with respect to $v$, we follow the chain rule \begin{align*} J_{h}(v) = J_{f \circ g} (v) = J_{f}(g(v)) J_{g}(v), \end{align*} where $J$ denotes the <em>Jacobian</em>, which is a matrix of partial derivatives. Since $h$ is a scalar function, $J_{h}(v)$ is a row vector. Transposing it will give us the gradient with respect to $v$.</p> <p>The computation of $h$ looks familiar. First, we have an input $v$. Then transform $v$ to another (vector of scalar) value $g(v)$. Then use $g(v)$ as the input to $f$. The chain rules says that to find the gradient for $v$, we first need to go backwards: differentiate $f$ with respect to $g(v)$ first, then differentiate g with respect to $v$, then multiply them together.</p> <h3 id="in-our-neural-network-example">In our neural network example</h3> <p>Back to our example. As we have seen, the order of computation in a forward propagation is</p> \begin{align*} x \rightarrow t_0 \rightarrow z_0 \rightarrow t_1 \rightarrow z_1 \rightarrow t_2 \rightarrow z_2 \rightarrow \ell. \end{align*} <p>It should be clear to us now that finding gradients means we have to traverse the network backwards. Start from the loss $\ell$. Differentiate that with respect to $z_2$. Then with respect to $t_2$. Then with repect to $z_1$. And so on.</p> <p>We actually don’t want these gradients. What we actually want is the gradient with respect to $W_0, W_1$ and $W_2$, which are the matrices that transform a $z$ in one layer to a $t$ the next layer. However, in calculating these gradients, the chain rule requires us to compute the above intermediate gradients as well.</p> <p>Below is a step-by-step procedure of backpropagation.</p> <h3 id="gradients-for-z_2-and-t_2">Gradients for $z_2$ and $t_2$</h3> <p>First, let’s start with $z_2$, the most immediate signal. Since we’re using the half sum of squares loss, the gradient is just $z_2$ itself:</p> \begin{align*} \nabla_{z_2} \ell = z_2. \end{align*} <p>Now to $t_2$. Since $z_2$ is an element-wise identity transformation of of $t_2$, using the chain rule we have</p> \begin{align*} \nabla_{t_2} \ell = \nabla_{z_2} \ell \odot \text{Id}'(t_2), \end{align*} <p>where $\odot$ denotes element-wise multiplication. The reason why we have an element-wise multiplication here is that the Jacobian of $z_2$ with respect to $t_2$ is a diagonal matrix, $\text{diag}(\text{Id}’(t_2))$, and multiplying $J_\ell(z_2)$ with this matrix is the same as performing an element-wise product.</p> <h3 id="gradients-for-z_1-w_2-and-t_1">Gradients for $z_1$, $W_2$, and $t_1$</h3> <p>Now let’s move back one layer. Recall that \begin{align*} t_2 = W_2 z_1. \end{align*}</p> <p>We need to find the gradient for both $z_1$ and $W_2$. First, since this is a linear operation, differentiating $t_2$ with respect to $z_1$ will simply give us $W_2$. Using the chain rule again, we have</p> \begin{align*} \nabla_{z_1} \ell = W_2^\top (\nabla_{t_2} \ell). \end{align*} <p>Now to $W_2$. Applying the chain rule, we have</p> \begin{align*} \nabla_{W_2} \ell = (\nabla_{t_2} \ell) z_1^\top. \end{align*} <p>Note that this is an outer product.</p> <p>In both updates of $z_1$ and $W_2$, we used $\nabla_{t_2} \ell$ from the previous step. This is why the previou gradient signal needs to be stored for backpropagation, and why we need to calculate the gradient for variables we’re not interested in (remember, we only need the gradients for $W$).</p> <p>Finally to $t_1$. Since $z_1$ is an element-wise sigmoid transformation of $t_1$, we apply the same formula as that for $t_2$, this time replacing $\tanh$ with $\sigma$:</p> \begin{align*} \nabla_{t_1} \ell = \nabla_{z_1} \ell \odot \sigma'(t_2). \end{align*} <h3 id="remaining-gradients">Remaining gradients</h3> <p>There is no need to repeat ourselves when finding the gradients for the rest of the variables. This is because the procedure for $(z_0, W_1, t_0)$ is identical for $(z_1, W_2, t_1)$. Once we have the gradient signal $\nabla_{t_1}\ell$, we’re good to go.</p> <p>One final note is that when we have traversed all the way to the beginning of the network, we only need to find the gradient with respect to $W_0$. This will require $z_{-1}$, which is just $x$. The gradient for $x$ (the input) is not used for anything.</p> <h3 id="implementing-backpropagation">Implementing backpropagation</h3> <p>We are now ready to fill in the TODO in the <code class="language-plaintext highlighter-rouge">loss_and_grad</code> method in <code class="language-plaintext highlighter-rouge">Net</code> above.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Paste this code at the end of loss_and_grad </span> <span class="c1"># Diff the loss w.r.t. final layer (This is nabla_{z2}) </span><span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">loss</span><span class="p">.</span><span class="n">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">data</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">):</span> <span class="c1"># Gradient from z to t. The "*" below is the element-wise product </span> <span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">grad</span> <span class="o">=</span> \ <span class="bp">self</span><span class="p">.</span><span class="n">activations</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">grad</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">)</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">grad</span> <span class="c1"># Gradient w.r.t. weights matrix. This is nabla_{W}. </span> <span class="n">prev_output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">data</span> <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">else</span> <span class="n">x</span> <span class="bp">self</span><span class="p">.</span><span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">grad</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">outer</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">grad</span><span class="p">,</span> <span class="n">prev_output</span><span class="p">)</span> <span class="c1"># Check if we have traversed to the first layer </span> <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span> <span class="c1"># If not at the first layer, continue finding nabla_{z} </span> <span class="bp">self</span><span class="p">.</span><span class="n">nonlinear_outputs</span><span class="p">[</span><span class="n">i</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">grad</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">weights</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">data</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="bp">self</span><span class="p">.</span><span class="n">linear_outputs</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">grad</span> <span class="k">return</span> <span class="n">l</span> </code></pre></div></div> <h2 id="finishing-network-with-backpropagation">Finishing Network with Backpropagation</h2> <p>Let’s try an input $x$ and find the gradients of $\ell$ with respect to the parameters. After we call <code class="language-plaintext highlighter-rouge">loss_and_grad</code>, the gradients of all eligible tensors will be stored in their <code class="language-plaintext highlighter-rouge">.grad</code> attributes.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># For reducibility </span><span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="n">np_net</span> <span class="o">=</span> <span class="n">Net</span><span class="p">()</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span> <span class="n">loss</span> <span class="o">=</span> <span class="n">np_net</span><span class="p">.</span><span class="n">loss_and_grad</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Get the gradients for all parameters </span><span class="n">np_grads</span> <span class="o">=</span> <span class="p">{</span><span class="s">"W"</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">):</span> <span class="n">g</span><span class="p">.</span><span class="n">grad</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">g</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">np_net</span><span class="p">.</span><span class="n">weights</span><span class="p">)}</span> </code></pre></div></div> <p>Now we are ready to take a gradient descent step!</p> <h2 id="autograd-with-pytorch">Autograd with PyTorch</h2> <p>To verify that our computation is correct, let’s use <code class="language-plaintext highlighter-rouge">autorgrad</code> in PyTorch and find the gradients for the parameters.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span> <span class="n">pt_net</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">()</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">add_module</span><span class="p">(</span><span class="s">"W0"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">out_features</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">add_module</span><span class="p">(</span><span class="s">"A0"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Tanh</span><span class="p">())</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">add_module</span><span class="p">(</span><span class="s">"W1"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">out_features</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">add_module</span><span class="p">(</span><span class="s">"A1"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sigmoid</span><span class="p">())</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">add_module</span><span class="p">(</span><span class="s">"W2"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">in_features</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">out_features</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">add_module</span><span class="p">(</span><span class="s">"A2"</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Identity</span><span class="p">())</span> <span class="c1"># Copy the weights in out numpy network to this new network </span><span class="k">for</span> <span class="n">param</span><span class="p">,</span> <span class="n">np_param</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">pt_net</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">np_net</span><span class="p">.</span><span class="n">weights</span><span class="p">):</span> <span class="n">param</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">np_param</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span> <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span> <span class="n">output</span> <span class="o">=</span> <span class="n">pt_net</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">loss</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">torch</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">output</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="s">"Loss ="</span><span class="p">,</span> <span class="n">loss</span><span class="p">.</span><span class="n">detach</span><span class="p">().</span><span class="n">item</span><span class="p">())</span> <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span> <span class="c1"># Get the gradients for all parameters </span><span class="n">pt_grads</span> <span class="o">=</span> <span class="p">{</span><span class="n">name</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"."</span><span class="p">)[</span><span class="mi">0</span><span class="p">]:</span> <span class="n">x</span><span class="p">.</span><span class="n">grad</span><span class="p">.</span><span class="n">numpy</span><span class="p">()</span> <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">pt_net</span><span class="p">.</span><span class="n">named_parameters</span><span class="p">()}</span> <span class="n">pt_grads</span> </code></pre></div></div> <p>Check that the gradients by both versions match.</p> <div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">np_grads</span><span class="p">.</span><span class="n">keys</span><span class="p">():</span> <span class="k">assert</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">pt_grads</span> <span class="k">print</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="s">"gradients match?"</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">allclose</span><span class="p">(</span><span class="n">np_grads</span><span class="p">[</span><span class="n">name</span><span class="p">],</span> <span class="n">pt_grads</span><span class="p">[</span><span class="n">name</span><span class="p">]))</span> </code></pre></div></div> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W0 gradients match? True W1 gradients match? True W2 gradients match? True </code></pre></div></div> <h2 id="conclusion">Conclusion</h2> <p>We have learned how backpropagation works in a feed-forward neural network. Here are some things you can try on your own:</p> <ul> <li>Add more layers to the network.</li> <li>Try more activation functions, e.g., ReLU, leaky ReLU, GeLU, etc.</li> <li>Add bias to each Linear layer and find the gradient with respect to the bias.</li> </ul> <p>The example we just went through is very simple. You may have seen other, more complicated, architectures in which the computation graph is not sequential. An example is ResNet with skip connections. The chain rule still applies, but backpropagation requires you to perform a topological sorting of the nodes in this graph, and traverse backwards. In fact, in our example, going back from output to input is basically this traversal, as our network is sequential.</p> <p>Finally, you can download a Jupyter notebook version of this post <a href="/files/backprop_tutorial.ipynb">here</a>.</p>Josh Nguyenjoshtn@seas.upenn.eduIn this post, we will go through an exercise involving backpropagation for a fully connected feed-forward neural network. Though simple, I observe that a lot of “Introduction to Machine Learning” courses don’t tend to explain this example thoroughly enough. In fact, a common way students are taught about optimizing a neural network is that the gradients can be calculated using an algorithm called backpropagation (a.k.a. the chain rule), and the parameters are updated using gradient descent. However, the chain rule, apart from its formula, is typically swept under the rug and replaced by a “black box” operation called autograd.PageRank, Stochastic Matrices and the Power Iteration2022-10-26T00:00:00-07:002022-10-26T00:00:00-07:00https://joshnguyen.net/posts/pagerank<p>In this post, we will revisit a popular algorithm called <a href="https://en.wikipedia.org/wiki/PageRank">PageRank</a>, which is used by Google to rank webpages for its search engine. Surprising to some but not so to others, PageRank is simple enough that only a level of first-year undergraduate linear algebra is required to understand it.</p> <h2 id="the-web-as-a-graph">The Web as a Graph</h2> <p>Consider the web as a collection of <em>pages</em>, some of which are connected to each other using <em>hyperlinks</em>. For example, the Wikipedia article on <a href="https://en.wikipedia.org/wiki/General_relativity">general relativity</a> contains a hyperlink to another article on <a href="https://en.wikipedia.org/wiki/Albert_Einstein">Albert Einstein</a>. By clicking the link, we move from the former webpage to the latter.</p> <p>We can model this as a graph $$G = (V, E)$$, where the set of nodes (or vertices) $$V$$ contains the webpages and the set of edges $$E$$ contains binary relations $$(v_i, v_j)$$, indicating that the page $$v_i \in V$$ contains a hyperlink to $$v_j \in V$$. Since $$v_i$$ may lead to $$v_j$$ but not the other way around, the edge $$(v_i, v_j)$$ may be in $$E$$ while $$(v_j, v_i)$$ may not. In this case we call the graph $$G$$ a <em>directed</em> graph.</p> <h2 id="the-importance-of-a-page">The Importance of a Page</h2> <p>PageRank defines a score for each webpage where more “important” pages have high scores. This is particularly useful in <em>information retrieval</em>, where a system is asked to return pages relevant to a query. An assumption is that higher-ranked pages should be returned first, as they are more important and therefore have a higher chance of being what the user wants. Some other intuitions on building a ranking system are:</p> <ul> <li>If many pages have a hyperlink to page $$i$$, then $$i$$ should be important.</li> <li>If a highly ranked page links to page $$i$$, then $$i$$ should also be highly ranked.</li> </ul> <p>Let $$r \in \mathbb{R}^n$$ be the rank vector—that is, $$r_i$$ is the numerical value denoting the importance of page $$i$$. We will propose a method for finding $$r$$ such that if $$r_i &gt; r_j$$, then page $$i$$ is more important than page $$j$$. Note that since the rankings are ordinal, we can just scale $$r$$ by a positive number and the relative ordering of the pages based on importance will not change at all.</p> <h3 id="the-importance-matrix">The importance matrix</h3> <p>To find the importance of every page, we will need to exploit the structure of the graph, specifically the in- and out-links of every node. Suppose that page $$i$$ with importance $$r_i$$ has $$d_i$$ out-neighbors—that is, pages that $$i$$ links to. In graph theory, $$d_i$$ is also called the <em>out-degree</em> of $$i$$. Based on the intuitions above, we want these out-neighbors to enjoy $$i$$’s importance. To do so, we assume that each out-neighbor of $$i$$ will get an equal amount of importance from $$i$$. In other words, each out-neighbor will get an amount $$\frac{r_i}{d_i}$$ of importance from $$i$$.</p> <p>In this setting, the importance of a page $$j$$ will be the sum of all importance flowing into it from its in-neighbors:</p> \begin{align} \label{eq:importance_flow} r_j = \sum_{i \rightarrow j} \frac{r_i}{d_i}. \end{align} <p>Notice that we have a recursive structure: Every page influences the pages it leads to. But the importance of that page is flows from the pages leading to it.</p> <p>Define a matrix $$A$$, called the <em>importance matrix</em>, where $$A_{j, i} = \frac{1}{d_i}$$ if page $$i$$ leads to page $$j$$. In other words, each column of $$i$$ of $$A$$ is a vector containing either $$0$$ (where there is no out-going edge) or $$\frac{1}{d_i}$$ (when there is). Since the out-degree of $$i$$ is exactly $$d_i$$, it must be the case that every column of $$i$$ sums to $$1$$.</p> <p>The product $$A r$$ gives us the importance flowing into every page. To see why it is, consider the $$j$$th component of this product:</p> \begin{align*} (A r)_j = \sum_{i=1}^{n} A_{j, i} r_{i} = \sum_{i \rightarrow j} \frac{r_i}{d_i}, \end{align*} <p>where we have the last inequality because $$A_{j, i}$$ is non-zero (and equal to $$\frac{1}{d_i}$$) when there is an edge from $$i$$ to $$j$$. This equation exactly matches $$\eqref{eq:importance_flow}$$.</p> <h3 id="the-random-surfer">The random surfer</h3> <p>One can think of $$A$$ as an adjacency matrix of $$G$$, but instead of $$A_{j, i} = 1$$ when there is an edge from $$i$$ to $$j$$, we have $$A_{j, i} = \frac{1}{d_i}$$. There is a nice interpretation of $$A$$ called the random surfer model.</p> <p>Suppose we have a web surfer who is currently on page $$i$$. To visit a new page, the surfer will randomly choose one of the out-neighbors of $$i$$. Since the out-degree of $$i$$ is $$d_i$$, if we assume that all out-neighbors are equally likely to be chosen, the probability that the surfer will choose a neighbor is $$\frac{1}{d_i}$$. This is exactly captured in the matrix $$A$$.</p> <h3 id="pagerank-as-a-fixed-point-problem">PageRank as a fixed-point problem</h3> <p>Since $$(A r)_j$$ gives us the importance of page $$j$$, which is also equal to $$r_j$$, we have:</p> \begin{align} \label{eq:fixed_point} A r = r. \end{align} <p>The solution $$r$$ to this linear system is the vector containing the ranks of our webpages. Note that we can scale $$r$$ by a positive number and it would still satisfy this equation, achieving our goal of preserving the order from positive scaling stated above.</p> <p>Such an $$r$$ satisfying $$\eqref{eq:fixed_point}$$ is called a <em>fixed point</em> of $$A$$, because applying $$A$$ to $$r$$ (that is, multiplying $$A$$ by $$r$$) will not change the values of $$r$$ at all. I have another post on solving for a fixed point in the context of machine learing, which can be found <a href="/posts/anderson-acceleration">here</a>. In this post, we will revisit a method to solve for $$r$$.</p> <h2 id="solving-pagerank">Solving PageRank</h2> <p>If we look again at equation $$\eqref{eq:fixed_point}$$, we can recognize that this is an eigenvector problem. Specifically, if $$\eqref{eq:fixed_point}$$ holds, then $$r$$ must be an eigenvector of $$A$$ corresponding to an eigenvalue of $$1$$. There are two important questions to answer.</p> <p>First, is it guaranteed that $$A$$ has $$1$$ as an eigenvalue? After all, $$A$$ is just a non-negative matrix with each column summing to $$1$$. It turns out that this is true, and we will see the proof below.</p> <p>Second, given that $$1$$ is an eigenvalue, then we can solve $$A r = r$$ using a row-reduction algorithm such as <a href="https://en.wikipedia.org/wiki/Gaussian_elimination">Gaussian elimination</a>. Is that it? The answer is no, because Gaussian elimination has the time complexity of $$O(n^3)$$, where $$n$$ is the number of pages. This does not scale well with our page collection, as $$n$$ could be in the billions, if not more. Therefore, we need to find another way to solve $$\eqref{eq:fixed_point}$$.</p> <h3 id="stochastic-matrices">Stochastic matrices</h3> <p>To answer the first question above, notice that the matrix $$A$$ is an example of a <em>stochastic matrix</em>, which is a square matrix with non-negative entries and having every column sum to 1. In the context of PageRank, $$A$$ is also called the <em>stochastic adjacency matrix</em>.</p> <p>What is interesting about a stochastic matrix is that it accepts $$1$$ as an eigenvalue, and all other eigenvalues (real or complex) of $$A$$ are less than or equal to $$1$$ in absolute value.</p> <div style="padding-left:2em; padding-right:2em"> <b>Proof.</b> <br /> Since $A$ is a square matrix, $A$ and $A^\top$ share the same eigenvalues. We need to prove that $1$ is an eigenvalue of $A^\top$. Because every row of $A^\top$ sums to $1$, we have $A^\top \mathbf{1}_n = \mathbf{1}_n$, where $\mathbf{1}_n$ is a column vector of $n$ ones. So, $1$ is an eigenvalue of $A^\top$ and, therefore, of $A$. <br /><br /> To show why all other eigenvalues of $A$ are less than or equal to $1$ in absolute value, let $\lambda$ be an eigenvalue of $A$. So $\lambda$ is also an eigenvalue of $A^\top$, associated with an eigenvector $x = [x_1,\ldots,x_n]^\top$. In other words, $A^\top x = \lambda x$. Let $j$ be index of the largest element in absolute value of $x$, that is, $|x_i| \leq |x_j| ~ \text{for all} ~ i=1,\ldots,n$. We have \begin{align*} |\lambda| |x_j| = |\lambda x_j| = \left| \sum_{i=1}^{n} A_{i, j} x_i \right| \leq \sum_{i=1}^{n} A_{i, j} |x_j| = |x_j| \sum_{i=1}^{n} A_{i, j} = |x_j|, \end{align*} where the first inequality uses the triangle inequality and the definition of $x_j$, and the last equality uses the fact the column $j$ of $A$ sums to 1. Since $x_j \neq 0$, this implies that $|\lambda| \leq 1$. </div> <h3 id="the-fixed-point-iteration">The fixed-point iteration</h3> <p>To answer the second question, we use the fact we just proved above, which is that $$1$$ is the largest eigenvalue of $$A$$ in absolute value. In linear algebra, it is also called the <a href="https://en.wikipedia.org/wiki/Spectral_radius"><em>spectral radius</em></a> of $$A$$. As an alternative to Gaussian elimination, a popular algorithm to find the spectral radius and its corresponding eigenvector is the <a href="https://en.wikipedia.org/wiki/Power_iteration">power iteration</a>.</p> <div style="padding-left:2em; padding-bottom:1em;"> Procedure: Power Iteration <br /> Input: A diagonalizable $n \times n$ matrix $A$ <br /> Let $b_0$ some non-zero vector <br /> For $k = 0, \ldots, K-1$ do <br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apply $A$ to $b_k$: $\tilde{b}_{k+1} = A b_{k}$ <br /> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Normalize: $b_{k+1} = \frac{\tilde{b}_{k+1}}{\lVert \tilde{b}_{k+1} \rVert}$ <br /> Output: $b_{K}$ </div> <p>The sequence $$\left(\frac{\lVert A b_k \rVert}{\lVert b_k \rVert}\right)_k$$ is guaranteed to converge to the spectral radius of $$A$$ (which is $$1$$ in our case), and the sequence $$(b_k)_k$$ converges to the corresponding eigenvector with unit norm.</p> <p>How fast this sequence converges can be found <a href="https://en.wikipedia.org/wiki/Power_iteration#Analysis">here</a>. In practice, one would run the power iteration until the difference between two iterates falls below some pre-defined tolerance $$\epsilon$$. For example, we can run until $$\lVert b_{k+1} - b_{k} \rVert \leq 10^{-3}$$.</p> <h3 id="conclusion-temporary">Conclusion (temporary)</h3> <p>We have learned how to find the importance scores of webpages in order to rank them. First, we construct the importance matrix from the structure of the graph. Then, we use the power iteration to solve for the fixed point of this matrix, which is the eigenvector corresponding to the largest eigenvalue in absolute value. This solution $$r$$ now contains the importance of the pages, and we are ready to use $$r$$ to rank them!</p> <p>However, there are two potential problems with this approach. We will explore it and propose a solution in the below.</p> <h2 id="two-problems-with-pagerank">Two Problems with PageRank</h2> <h3 id="problem-1-dead-ends">Problem 1: dead ends</h3> <p>In the previous section, we have learned to use the power iteration to solve for the importance vector $$r$$. However, the power iteration works under an assumption that the matrix $$A$$ is <a href="https://en.wikipedia.org/wiki/Diagonalizable_matrix"><em>diagonalizable</em></a>. This will not hold if a column of $$A$$ contains all zeros. This case happens when a webpage has no outgoing links. In other words, the page is a <em>dead end</em>.</p> <p>How do we solve this? Let’s go back to the random surfer model above. If the surfer is at a dead end, meaning there is no hyperlink on the page the surfer can click to go to, we will assume that they will randomly jump to any other page in our collection. In addition, all pages are assumed to be equally likely to be chosen. So, if a page $$i$$ is a dead end, we will replace the all-zeros column for $$i$$ with a column of all $$\frac{1}{n}$$’s, where $$n$$ is the number of pages in our collection.</p> <p>Therefore, we can transform the matrix $$A$$ into one without dead ends. Let us call this matrix $$A'$$. Every column of $$A'$$ now sums to 1.</p> <h3 id="problem-2-spider-traps">Problem 2: spider traps</h3> <p>The matrix $$A'$$ is now guaranteed to be a stochastic matrix, and we are ready to use the power iteration to find its fixed point. However, the result might not be what we want. Consider the following scenario: In our web graph, there is a set of at least one node such that there are no links coming out of this set. There can be links between nodes in this set, but there are no links to any other outside node.</p> <p>We call such a set of nodes a <em>spider trap</em>. But what is the problem? If we use the power iteration for a graph with a spider trap, the algorithm will cause all importance scores to be captured within the nodes in this spider trap, and the rest of the nodes will have zero importance. This kind of pages can be constructed intentionally or unintentionally, but their existence will cause PageRank to output an undesirable result.</p> <p>So how do we deal with spider traps? Once the random surfer is in a spider trap, they will never be able to leave it. We will assume that, when the surfer is at page $$i$$, they will flip a coin. If the coin comes up heads, the surfer will follow a link at random, and the probability of choosing a page is found by looking up the $$i$$th column of $$A'$$. If the coin comes up tails, the surfer will jump to a page in our collection uniformly at random. So, if page $$i$$ is in a spider trap, the surfer has a some chance of jumping outside the trap when the coin comes up tails.</p> <p>To formalize this, let $$p$$ be the probability of the coin coming up heads. The probability that the surfer, currently at page $$i$$, will go to page $$j$$ is</p> \begin{align*} p A'_{j, i} + (1 - p) \frac{1}{n}. \end{align*} <h3 id="the-google-matrix">The Google matrix</h3> <p>In 1998, Larry Page and Sergey Brin, the founders of Google, proposed a matrix combining the solutions to these two problems. It is now widely called the <em>Google matrix</em>:</p> \begin{align*} \mathscr{G} = pA' + (1-p) \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^\top. \end{align*} <p>By using the power iteration on $$\mathscr{G}$$, we can find the importance scores of the pages in our collection. This is the algorithm that Google uses to rank webpages.</p> <h2 id="resources">Resources</h2> <ol> <li><a href="https://textbooks.math.gatech.edu/ila">Interactive Linear Algebra</a> by Dan Margalit and Joseph Rabinoff. Specifically Chapter 5.</li> <li><a href="http://www.mmds.org/">Mining of Massive Datasets</a> by Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Specifically Chapter 5.</li> <li><a href="http://web.stanford.edu/class/cs224w/">CS224W - Machine Learning with Graphs</a> by Jure Leskovec, Fall 2021 edition. Specifically Lecture 4.</li> <li><a href="https://research.google/pubs/pub334.pdf">The Anatomy of a Large-Scale Hypertextual Web Search Engine</a> by Sergey Brin and Lawrence Page.</li> </ol>Josh Nguyenjoshtn@seas.upenn.eduIn this post, we will revisit a popular algorithm called PageRank, which is used by Google to rank webpages for its search engine. Surprising to some but not so to others, PageRank is simple enough that only a level of first-year undergraduate linear algebra is required to understand it.A Gentle Introduction to Computational Optimal Transport2022-10-18T00:00:00-07:002022-10-18T00:00:00-07:00https://joshnguyen.net/posts/ot-anu-aml<p>This is a lecture I gave to the <a href="/teaching/2022-AML">COMP4680/8650 Advanced Topics in Machine Learning</a> 2022S2 class at ANU.</p> <p>The slides can be found <a href="/files/ANU_OT_Slides.pdf">here</a>. The lecture recording is on <a href="https://www.youtube.com/watch?v=3YFmaoCYSlc">YouTube</a>.</p>Josh Nguyenjoshtn@seas.upenn.eduThis is a lecture I gave to the COMP4680/8650 Advanced Topics in Machine Learning 2022S2 class at ANU.Variational Bayes for Latent Dirichlet Allocation2022-08-01T00:00:00-07:002022-08-01T00:00:00-07:00https://joshnguyen.net/posts/lda<p>In this post we will learn about a widely-used topic model called Latent Dirichlet Allocation (LDA), proposed by <a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">Blei, Ng and Jordan in 2003</a>. Although research in probabilistic topic modeling has been long-standing, approaching it from a perspective of a newcomer can be quite challenging. Also, there is a lot of literature on the applications of topic models, especially LDA and in many disciplines; I therefore would need to dedicate at least a series of 10 posts to reasonably cover these applications. As such, I constrain myself to the following desiderata when writing this post:</p> <ul> <li>Explain what probabilistic topic modeling is, and what assumptions it makes.</li> <li>Recognize the observable and latent variables in a topic model, and specifically in LDA.</li> <li>Explain the generative process of LDA, and derive the complete probability.</li> <li>Explain what inference means in a mixture model, and why it is hard in LDA.</li> <li>Find the approximate posterior distribution of LDA using variational inference, and explain the procedure to find the optimal variational parameters.</li> <li>Explain what it means to “fit” an LDA model to a corpus, and describe how this procedure works.</li> <li>Be able to write code for an LDA model, including training and inference.</li> </ul> <h2 id="introduction">Introduction</h2> <p>Being able to describe a large collection of documents is an important task in many disciplines. This task is often called “describe the haystack,” and the idea is to find the common <em>themes</em> that appear in the documents. For example, given a corpus of abstracts from papers published to <a href="https://www.pnas.org/">PNAS</a>, can we find the common scientific topics—such as “cellular biology,” “genetics” or “evolution”—that are covered in these abstracts? Another example is when you collect many tweets in a specific period, and want to find out what common topics people tweet about during this period, in the hope of predicting what topics will be trending in the near future. To help us approach this, there are three discussions worth noting here.</p> <p>First, identifying topics by manually reading a collection of documents is probably the best way to characterize its themes, but the mere size of a corpus makes it impossible to perform this; we are looking at tens of thousands of abstracts, hudreds of thousands of Reddit posts, millions of Wikipedia articles, and tens of millions of tweets. Coming up with a way in which a computer can help us <em>automatically</em> identify the topics is much more desirable.</p> <p>Second, what do we mean by <em>topics</em>, or themes? Put simply, a topic is a probability distribution over the vocabulary. For example, a topic about natural language processing is a distribution, with (much) higher probabilities for words such as “machine,” “token,” “vector” and “likelihood” than for words such as “mechanic,” “torts,” “cell” and “chemical.” Typically, we describe a topic by a list of most-likely words of size, say, 10 or 15. A human can look at this list and give the topic a representative name if necessary.</p> <p>Third, it is quite evident that a document is rarely exclusively about one topic. (Well, this depends on how fine-grained you define each topic to be, but note that the more fine-grained, the harder it is to generalize.) In fact, we often associate a document with a <em>mixture</em> of topics, perhaps with a higher weight to some than others. For example, a research paper in machine learning can be a mixture of topics such as optimization, statistics, statistical physics, and so on, and a human reader can probably tell which topic is weighed higher than others after reading the paper. A solution to modeling this is to have a probability distribution over topics, given a document.</p> <h3 id="probabilistic-topic-models">Probabilistic topic models</h3> <p>The two types probability distribution described above are the main ingredients of probabilistic topic models such as LDA. If we are able to model them, we can do many useful things. First, using the topic-word distributions allows us to characterize the topics present in a corpus, thereby summarizing it in a meaningful way. And using the document-topic distributions allows us to draw inference on the topics that a document is about, also helping with summarization. The applications of these models are quite boundless, which is why they are so popular in many fields such as computational social science, psychology, cognitive science, and so on.</p> <p>However, in order to use them correctly as well as identifying the pros and cons to make good decisions while modeling, one should not stop at only calling <code class="language-plaintext highlighter-rouge">sklearn.decomposition.LatentDirichletAllocation</code> arbitrarily, but should be able to understand the model, its assumptions, and how to tune its hyperparameters. To demonstrate this, let us dive into the details of the model.</p> <h2 id="latent-dirichlet-allocation">Latent Dirichlet Allocation</h2> <p>A probabilistic topic model, LDA still remains one of the most popular choices for topic modeling today. It is an example of a <em>mixture model</em> whose structure contains two types of random variables:</p> <ul> <li>The <em>observable variables</em> are the words you observe in each document.</li> <li>The <em>latent variables</em> are those you do not observe, but which describe some internal <em>structure</em> of your data, in particular, the “topics”.</li> </ul> <p>You can readily see the assumption here, which is that there there <em>is</em> some internal structure to your data, and our job is to model that structure using the latent variables.</p> <h3 id="generative-process">Generative process</h3> <p>In specifying a mixture model like LDA, we need to describe how data can be generated using this model. Before we do that, let us set up the notation carefully. Note that in this blog post, I have chosen the notation used in Hoffmann, Blei and Bach’s <a href="https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">paper on online learning for LDA</a>. This blog is intended to follow the “batch variational Bayes” part of the paper, with some more detail to help you read more easily.</p> <p>Suppose we have a collection of $D$ documents, where each document $d$ is of length $N_d$. Also suppose that we have a fixed vocabulary of $W$ words. We wish to discover $K$ topics in this collection, where each topic $k$ is specified by the probability $\beta_k$ over all words. The generative process works as follows. For document $d$, sample probability distribution $\theta_d$ over the topics $1, \ldots, K$. For each word $w_{di}$ in document $d$, sample a topic $z_{di}$ from the distribution $\theta_d$. With the chosen topic $z_{di}$, sample a word $w_{di}$ from the probability distribution $\beta_{z_{di}}$. In other words,</p> <ul> <li>Draw a topic-word distribution $\beta_k \sim \text{Dir}(\eta)$ for $k = 1, \ldots, K$.</li> <li>For each document $d = 1, \ldots, D$: <ul> <li>Draw document-topic distribution for document $d$: $\theta_d \sim \text{Dir}(\alpha)$.</li> <li>For each word $i$ in document $d$: <ul> <li>Draw a topic $z_{di} \sim \theta_d$.</li> <li>Draw a word $w_{di} \sim \beta_{z_{di}}$.</li> </ul> </li> </ul> </li> </ul> <p>The notation is summarized in the following table.</p> <table> <thead> <tr> <th style="text-align: center">Notation</th> <th style="text-align: center">Dimensionality</th> <th style="text-align: left">Meaning</th> <th style="text-align: left">Notes</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">$D$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of documents</td> <td style="text-align: left">Positive integer</td> </tr> <tr> <td style="text-align: center">$W$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of words in the vocabulary</td> <td style="text-align: left">Positive integer</td> </tr> <tr> <td style="text-align: center">$K$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of topics</td> <td style="text-align: left">Positive integer,typically much smaller than $D$</td> </tr> <tr> <td style="text-align: center">$N_d$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of words in document $d$</td> <td style="text-align: left">Positive integer</td> </tr> <tr> <td style="text-align: center">$\beta_k$</td> <td style="text-align: center">$W$</td> <td style="text-align: left">Word distribution for topic $k$</td> <td style="text-align: left">$\beta_k$ ($k = 1, \ldots, K)$ are mutually independent. Each $\beta_k$ is a non-negative vector and $\sum_{w=1}^{W} \beta_{kw} = 1$.</td> </tr> <tr> <td style="text-align: center">$\eta$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Dirichlet prior parameter for $\beta_k$</td> <td style="text-align: left">All $\beta_k$ share the same parameter $\eta$.</td> </tr> <tr> <td style="text-align: center">$\theta_d$</td> <td style="text-align: center">$K$</td> <td style="text-align: left">Topic distribution for document $d$</td> <td style="text-align: left">$\theta_d$ ($d = 1, \ldots, D$) are mutually independent. Each $\theta_d$ is a non-negative vector and $\sum_{k=1}^{K} \theta_{dk} = 1$.</td> </tr> <tr> <td style="text-align: center">$\alpha$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Dirichlet prior parameter for $\theta_d$</td> <td style="text-align: left">All $\theta_d$ share the same parameter $\alpha$.</td> </tr> <tr> <td style="text-align: center">$w_{di}$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Word $i$ in document $d$</td> <td style="text-align: left">$w_{di} \in \{1, 2, \ldots, W\}$</td> </tr> <tr> <td style="text-align: center">$z_{di}$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Topic assignment for word $w_{di}$</td> <td style="text-align: left">$z_{di} \in \{1, 2, \ldots, K\}$</td> </tr> </tbody> </table> <h3 id="complete-model">Complete model</h3> <p>The types of variables should be clear to us now. The only observables we have are $w$, the words in the documents. On the other hand, the latent variables are $z$, $\theta$ and $\beta$. The generative process allows us to specify the complete model—i.e., the joint distribution of both observable and latent variables—as follows</p> \begin{align} p(w, z, \theta, \beta \mid \alpha, \eta) &amp; = p(\beta \mid \eta) \prod_{d=1}^{D} p(\theta_d \mid \alpha) p(z_d \mid \theta_d) p(w_d \mid \theta_d, z_i, \beta) \label{eq:joint_prob}\\ &amp; = \prod_{k=1}^{K} p(\beta_k \mid \eta) \prod_{d=1}^{D} p(\theta_d \mid \alpha) \prod_{i=1}^{N_d} p(z_{di} \mid \theta_d) p(w_{di} \mid \theta_d, z_{di}, \beta). \nonumber \end{align} <h3 id="dirichlet-and-categorical-distributions">Dirichlet and categorical distributions</h3> <p>Note that there are two probability distributions used in this process. The first is the <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet</a> used to sample $\beta_k$ and $\theta_d$. For example, the probability of the topic distribution for document $d$ is</p> $p(\theta_d \mid \alpha) = \frac{\Gamma\left( K \alpha \right)}{\Gamma(\alpha)^K} \prod_{k=1}^K \theta_{dk}^{\alpha-1}.$ <p>The second is the <a href="https://en.wikipedia.org/wiki/Categorical_distribution">categorical distribution</a>, used to sample $z_{di}$ and $w_{di}$. For example, to find the probablity that the word $w_{di}$ given all other variables, we first need to find the value of $z_{di}$. Suppose $z_{d_i} = 2$. Then the distribution we need to use is $\beta_2$, or the second topic. Then the probability that $w_{di}$ equals some $w$ is</p> $p(w_{di} | z_{di} = 2, \beta, \theta_d) = \beta_{2, w},$ <p>that is, the $w$-th entry of $\beta_{2}$.</p> <h2 id="inference-approximate-inference-and-parameter-estimation">Inference, Approximate Inference and Parameter Estimation</h2> <p>Inference refers to the task of finding the probability of latent varibles given observable variables. In our LDA example, the quantity we want to calculate is</p> \begin{align} p(z, \theta, \beta \mid w, \alpha, \eta) = \frac{p(w, z, \theta, \beta \mid \alpha, \eta)}{\int_{z, \theta, \beta} p(w, z, \theta, \beta \mid \alpha, \eta) dz d\theta d\beta}. \label{eq:bayes-infr} \end{align} <p>What is this quantity? Imagine you see new document, how do you know what topics it belongs to, along with the topic weights? The probability in $\eqref{eq:bayes-infr}$ helps us do just that: use the Bayes theorem to find the <em>posterior</em> distribution on the latent variables, enabling us to draw inference on the structure of the document.</p> <p>But there is a catch. The integral in the denominator $\eqref{eq:bayes-infr}$, which is equal to $p(w \mid \alpha, \eta)$ and often called the <em>evidence</em>, is very hard to evaluate. This is mainly because of the coupling of the latent variables, and exactly calulating this will take exponential time. Instead, we will use an method called <em>variational inference</em> to approximate it.</p> <h3 id="variational-inference-vi">Variational inference (VI)</h3> <p>(To keep this blog post short enough, I will not explain the details of VI. You are encourage to check out Chapter 10 in <a href="https://probml.github.io/pml-book/book2.html">Kevin Murphy’s textbook on probabilistic machine learning</a> for an introduction to VI.)</p> <p>Basically, the goal of VI is to approximate the distribution $p(z, \theta, \beta \mid w, \alpha, \eta)$ using a simpler distribution $q(z, \theta, \beta)$ that is “the closest” to $p$. Here “closeness” is defined by the Kullback-Leibler divergence between $q$ and $p$. In other words, we aim to solve the following optimization problem:</p> $\min_{q} \left\{ \text{KL}(q(z, \theta, \beta) \| p(z, \theta, \beta \mid w, \alpha, \eta)) = \mathbb{E}_q \left[ \log \frac{q(z, \theta, \beta)}{p(z, \theta, \beta \mid w, \alpha, \eta)} \right] \right\}.$ <h3 id="evidence-lower-bound-elbo-and-variational-bayes-vb">Evidence lower bound (ELBO) and variational Bayes (VB)</h3> <p>Interestingly, minimizing this KL divergence is equivalent to maximizing the <em>evidence lower bound</em> (ELBO) of the data, where the ELBO $\mathcal{L}(w, z, \theta, \beta)$ is defined as</p> \begin{align} \mathcal{L}(w, \phi, \gamma, \lambda) = \mathbb{E}_q\left[ \log p(w, z, \theta, \beta \mid \alpha, \eta) \right] - \mathbb{E}_q\left[ \log q(z, \theta, \beta) \right]. \label{eq:elbo:def} \end{align} <p>As the name suggests, the ELBO is a lower bound on the log-likelihood of our data. The maximum ELBO gives us the “closest” approximation to the likelihood. Check Section 10.1.2 in <a href="https://probml.github.io/pml-book/book2.html">Murphy’s textbook</a> for a full derivation.</p> <p>To “fit” the data in the Bayesian sense, we will aim to approximate the true posterior as well as possible. Applying VI to this task is called <em>variational Bayes</em> (VB).</p> <h3 id="choosing-variational-parameters">Choosing variational parameters</h3> <p>We have mentioned the “simpler” distribution $q(z, \theta, \beta)$ above, but what exactly is it? In using VI for LDA inference, we assume that $q(z, \theta, \beta)$ factorizes to three marginal distributions:</p> <ul> <li>$q(z_{di}) = \phi_{d w_{di} k}$. The dimensionality of $\phi$ is $D \times W \times K$, and $\sum_{k=1}^{K} \phi_{d w k} = 1, \forall d, w$;</li> <li>$\theta_d \sim \text{Dir}(\gamma_d)$, where $\gamma_d$ is a vector of length $K$. Note that $\gamma_d$ is <em>not</em> symmetric;</li> <li>$\beta_k \sim \text{Dir}(\lambda_k)$, where $\lambda_k$ is a vector of length $W$. Similarly, $\beta_k$ is <em>not</em> symmetric.</li> </ul> <p>This is an application of the <em>mean-field assumption</em>, which says that variational distributions for each set of latent variables are mutually independent, allowing the joint to be factorized into marginals.</p> <p>In summary,</p> \begin{align} q(z_d, \theta_d,\beta) = q(z_d) q(\theta_d)q(\beta), \label{eq:mean_field} \end{align} <p>and we have three types of variational parameters: $\phi$ of size $D \times W \times K$; $\gamma_d$ of size $K$, for $d = 1, \ldots, D$; and $\lambda_k$ of size $W$, for $k = 1, \ldots, K$.</p> <h3 id="factorizing-elbo">Factorizing ELBO</h3> <!-- $\log p(w, z, \theta, \beta \mid \alpha, \eta) = \log p(\beta \mid \eta) + \sum_{d=1}^{D} \left[ \log p(\theta_d \mid \alpha) + \log p(z_d \mid \theta_d) + \log p(w_d \mid z_d, \theta_d, \beta) \right]$ --> <!-- $\log q(z_d, \theta_d,\beta) = \log q(z_d) + \log q(\theta_d) + \log q(\beta)$. --> <p>Given the complete model in $\eqref{eq:joint_prob}$ and the variational distribution in $\eqref{eq:mean_field}$, we can decompose the ELBO as follows: \begin{align} \mathcal{L}(w, \phi, \gamma, \lambda) &amp; = \sum_{d=1}^{D} \left\{ \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] + \mathbb{E}_q\left[ \log p(z_d \mid \theta_d) \right] - \mathbb{E}_q\left[ \log p(\theta_d \mid \alpha) \right] \right\} \nonumber \\ &amp;~~~~ - \sum_{d=1}^{D} \left\{ \mathbb{E}_q\left[ \log q(z_d \mid \theta_d) \right] + \mathbb{E}_q\left[ \log q(\theta_d) \right] \right\} \nonumber \\ &amp;~~~~ + \mathbb{E}_q\left[ \log p(\beta \mid \eta) \right] - \mathbb{E}_q\left[ \log q(\beta) \right] \nonumber \\ &amp; = \sum_{d=1}^{D} \left\{ \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] + \mathbb{E}_q\left[ \log p(z_d \mid \theta_d) \right] - \mathbb{E}_q\left[ \log q(z_d \mid \theta_d) \right] \right. \nonumber\\ &amp;\quad \quad \quad ~ +\left.\mathbb{E}_q\left[ \log p(\theta_d \mid \alpha) \right] - \mathbb{E}_q\left[ \log q(\theta_d) \right] \right\} \nonumber \\ &amp; ~~~~ + (\mathbb{E}_q\left[ \log p(\beta \mid \eta) \right] - \mathbb{E}_q\left[ \log q(\beta) \right]). \label{eq:elbo} \\ \end{align}</p> <h3 id="elbo-as-a-function-of-variational-parameters">ELBO as a function of variational parameters</h3> <p>Analyzing each term in the sum. \begin{align} \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] &amp; = \sum_{i=1}^{N_d} \mathbb{E}_q\left[ \log p(w_{di} \mid \theta_d, z_{di}, \beta) \right] \nonumber \\ &amp; = \sum_{i=1}^{N_d} \sum_{k=1}^{K} q(z_{di} = k) \mathbb{E}_q\left[ \log p(w_{di} \mid \theta_d, z_{di}, \beta) \right] \nonumber \\ &amp; = \sum_{i=1}^{N_d} \sum_{k=1}^{K} \phi_{d w_{di} k} \mathbb{E}_q\left[ \log \beta_{k w_{di}} \right], \nonumber \end{align}</p> <p>where the expectation on the last row is with respect to $q(\beta_k)$. We can see that in this formula, the contribution of each word $w$ to the term is $\sum_{k=1}^{K} \phi_{d w k} \mathbb{E} \left[ \log \beta_{k w} \right]$, which is the same for regardless of the position of word $w$ in document $d$. Therefore, we can simply count the number of times $w$ appears in $d$, and then multiply it with this contribution to get the contribution of all occurrences of $w$. This gives us the equivalent expression: \begin{align} \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] = \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{d w k} \mathbb{E}_q\left[ \log \beta_{k w} \right], \label{eq:elbo:1} \end{align}</p> <p>where $n_{dw}$ is the number of occurrences of word $w$ in document $d$. Using the same trick, we have \begin{align} \mathbb{E}_q\left[ \log p(z_d \mid \theta_d) \right] &amp; = \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{d w k} \mathbb{E}_q\left[ \log \theta_{dk} \right], \text{and} \label{eq:elbo:2} \\ \mathbb{E}_q\left[ \log q(z_d) \right] &amp; = \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{d w k} \log \phi_{d w k}. \label{eq:elbo:3} \end{align}</p> <p>For the last two terms inside the sum, first note that $p(\theta_d \mid \alpha)$ is a Dirichlet distribution with symmetric parameter $\alpha$, i.e., $q(\theta_d \mid \alpha) = \frac{\Gamma(K \alpha)}{\Gamma(\alpha)^K} \prod_{k=1}^{K} \theta_{dk}^{\alpha-1}$. Therefore, \begin{align} \mathbb{E}_q\left[ \log p(\theta_d \mid \alpha) \right] = \log \Gamma(K \alpha) - K \log \Gamma(\alpha) + (\alpha - 1) \sum_{k=1}^{K} \log \theta_{dk}. \label{eq:elbo:4} \end{align}</p> <p>Similarly, because $q(\theta_d)$ is a Dirichlet distribution with asymmetric parameter $\gamma_d$, we have \begin{align} \mathbb{E}_q\left[ \log q(\theta_d) \right] = \log \Gamma\left(\sum_{k=1}^{K} \gamma_{dk} \right) - \sum_{k=1}^{K} \log \Gamma(\gamma_{dk}) + \sum_{k=1}^{K} (\theta_{dk} - 1) \log \theta_{dk}. \label{eq:elbo:5} \end{align}</p> <p>Now for the last two terms, also note that $p(\beta_k \mid \eta)$ is Dirichlet with symmetric $\eta$. Therefore, \begin{align} \mathbb{E}_q\left[ \log p(\beta \mid \eta) \right] &amp;= \sum_{k=1}^{K} \mathbb{E}_q\left[ \log p(\beta_k \mid \eta) \right] \nonumber \\ &amp;= K [\log \Gamma(W \eta) - W \log \Gamma(\eta)] + \sum_{k=1}^{K} \sum_{w=1}^{W} (\eta - 1) \mathbb{E}_q\left[ \log \beta_{k w} \right]. \label{eq:elbo:6} \end{align}</p> <p>Simlarly, the final term is \begin{align} \mathbb{E}_q\left[ \log q(\beta) \right] &amp;= \sum_{k=1}^{K} \mathbb{E}_q\left[ \log q(\beta_k) \right] \nonumber \\ &amp;= \sum_{k=1}^{K} \left( \log \Gamma \left( \sum_{w=1}^{W} \lambda_{kw} \right) - \sum_{w=1}^{W} \Gamma(\lambda_{kw}) + \sum_{w=1}^{W} (\lambda_{kw} - 1) \mathbb{E}_q\left[ \log \beta_{k w} \right] \right). \label{eq:elbo:7} \end{align}</p> <p>Plugging $\eqref{eq:elbo:1}, \eqref{eq:elbo:2}, \eqref{eq:elbo:3}, \eqref{eq:elbo:4}, \eqref{eq:elbo:5}, \eqref{eq:elbo:6}, \eqref{eq:elbo:7}$ into $\eqref{eq:elbo}$, we have the ELBO as a function of variational parameters:</p> \begin{align} \mathcal{L} &amp;= \sum_{d=1}^{D} \left\{ \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{dwk} \left( \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] - \log \phi_{dwk} \right) \right. \nonumber\\ &amp; \left. \quad \quad \quad ~ - \log \Gamma\left( \sum_{k=1}^{K} \gamma_{dk} \right) + \sum_{k=1}^{K}\left( \log \Gamma(\gamma_{dk}) + (\alpha - \gamma_{dk}) \mathbb{E}_q\left[ \log \theta_{dk} \right] \right) \right\} \nonumber \\ &amp;~~~~ + \sum_{k=1}^{K} \left( - \log \Gamma\left( \sum_{w}^{W} \lambda_{kw} \right) + \sum_{w=1}^{W} \left( \log \Gamma(\lambda_{kw}) + (\eta - \lambda_{kw}) \mathbb{E}_q\left[ \log \beta_{k w} \right] \right) \right) \nonumber \\ &amp;~~~~ + D [\log \Gamma(K \alpha) - K \log \Gamma(\alpha)] + K [\log \Gamma(W \eta) - W \log \Gamma(\eta)]. \label{eq:elbo:var} \end{align} <h2 id="variational-bayes-for-lda">Variational Bayes for LDA</h2> <p>The main objective here is to maximize the ELBO $\mathcal{L}$ with respect to the variational parameters $\phi$, $\gamma$ and $\lambda$. To do so, we will use a procedure called <em>coordinate ascent</em>, in which we maximize $\mathcal{L}$ with respect to one set of parameters, keeping the others fixed. We will then alternate to another set of variables, keeping others fixed, and so on. In our LDA example, we first keep $\gamma$ and $\lambda$ fixed, and maximize $\mathcal{L}$ as a function of $\phi$ only. Then we do the same for $\gamma$ and $\lambda$.</p> <h3 id="maximizing-with-respect-to-phi">Maximizing with respect to $\phi$</h3> <p>Only keeping the terms involving $\phi_{dwk}$ in $\eqref{eq:elbo:var}$, and treating everything else as constants, we have the objective function w.r.t. $\phi_{dwk}$ as</p> $\mathcal{L}_{[\phi_{dwk}]} = \phi_{dwk} \left( \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] - \log \phi_{dwk} \right) + \text{const},$ <p>which gives the gradient:</p> $\frac{\partial \mathcal{L}}{\partial \phi_{dwk}} = \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] - \log \phi_{dwk} - 1.$ <p>Setting the gradient to zero and solving for $\phi_{dwk}$, we get the update rule for $\phi_{dwk}$:</p> \begin{align} \phi_{dwk} \propto \exp \left\{ \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] \right\}. \label{eq:update:phi} \end{align} <p>Where we have suppressed all multiplicative constants by using $\propto$. After this update for all $\phi_{dwk}$, we can simply rescale them so that $\sum_{k=1}^{K} \phi_{dwk} = 1, \forall d, w$.</p> <p>The final thing to handle is the expectations inside $\exp$. How do we calculate them exactly? Lucklily, both of them can be calculated using the <a href="https://en.wikipedia.org/wiki/Digamma_function"><em>digamma function</em></a> $\Psi$—the first derivative of the logarithm of the gamma function—as follows:</p> \begin{align*} \mathbb{E}_q\left[ \log \theta_{dk} \right] &amp; = \Psi(\gamma_{dk}) - \Psi\left(\sum_{i=1}^{K} \gamma_{di}\right), \\ \mathbb{E}_q\left[ \log \beta_{k w} \right] &amp; = \Psi(\lambda_{kw}) - \Psi\left(\sum_{i=1}^{W} \lambda_{ki}\right). \end{align*} <h3 id="maximizing-with-respect-to-gamma">Maximizing with respect to $\gamma$</h3> <p>Similarly, the objective function w.r.t. $\gamma_{dk}$ is</p> \begin{align*} \mathcal{L}_{[\gamma_{dk}]} &amp; = \sum_{w=1}^{W} n_{dw} \phi_{dwk} \mathbb{E}_q \left[ \log \theta_{dk} \right] - \log \Gamma\left( \sum_{i=1}^{K} \gamma_{d_i} \right) \\ &amp; ~~~~+ \log \Gamma(\gamma_{dk}) + (\alpha - \gamma_{dk}) \mathbb{E}_q \left[ \log \theta_{dk} \right] + \text{const} \\ &amp; = \left( \alpha + \sum_{w=1}^{W} n_{dw} \phi_{dwk} - \gamma_{dk} \right) \left( \Psi(\gamma_{dk}) - \Psi\left(\sum_{i=1}^{K} \gamma_{di}\right) \right) \\ &amp; ~~~~ - \log \Gamma\left( \sum_{i=1}^{K} \gamma_{d_i} \right) + \log \Gamma(\gamma_{dk}) + \text{const}, \end{align*} <p>where we have used the digamma function $\Psi$ similarly to the previous section. A simple manipulation gives the gradient:</p> \begin{align*} \frac{\partial \mathcal{L}}{\partial \gamma_{dk}} = \left( \Psi'(\gamma_{dk}) - \Psi'\left(\sum_{i=1}^{K} \gamma_{di}\right) \right) \left( \alpha + \sum_{w=1}^{W} n_{dw} \phi_{dwk} - \gamma_{dk} \right). \end{align*} <p>Setting this gradient to zero and solving for $\gamma_{dk}$, we get the update rule for $\gamma_{dk}$:</p> \begin{align} \gamma_{dk} = \alpha + \sum_{w=1}^{W} n_{dw} \phi_{dwk}. \label{eq:update:gamma} \end{align} <p>The variational Bayes estimate of $\gamma$ has an intuitive explanation. The number of times document $d$ is assigned to topic $k$ is the weighted sum of the times each word in $d$ is assigned to topic $k$, where the weight $\phi_{dwk}$ is the probability that word $w$ in document $d$ belongs to topic $k$—plus the Dirichlet prior $\eta$.</p> <h3 id="maximizing-with-respect-to-lambda">Maximizing with respect to $\lambda$</h3> <p>Similar to $\gamma$, we can use the digamma function $\Psi$ in the objective functin w.r.t. $\lambda_{kw}$ as follows</p> \begin{align*} \mathcal{L}_{[\lambda_{kw}]} &amp; = \left( \eta + \sum_{d=1}^{D} n_{dw} \phi_{dwk} - \lambda_{kw} \right) \left( \Psi(\lambda_{kw}) - \Psi\left(\sum_{i=1}^{W} \lambda_{ki} \right) \right) \\ &amp; ~~~~ - \log \Gamma\left(\sum_{i=1}^{W} \lambda_{ki} \right) + \log \Gamma(\lambda_{kw}) + \text{const}, \end{align*} <p>which gives the gradient:</p> \begin{align*} \frac{\partial \mathcal{L}}{\partial \lambda_{kw}} = \left( \Psi'(\lambda_{kw}) - \Psi'\left(\sum_{i=1}^{W} \lambda_{ki} \right) \right) \left( \eta + \sum_{d=1}^{D} n_{dw} \phi_{dwk} - \lambda_{kw} \right). \end{align*} <p>Setting the gradient to zero and solving for $\lambda_{kw}$, we get the update estimate:</p> \begin{align} \lambda_{kw} = \eta + \sum_{d=1}^{D} n_{dw} \phi_{dwk}. \label{eq:update:lambda} \end{align} <p>Similar to $\gamma_{dk}$, the variational Bayes estimate of $\lambda$ has an intuitive explanation. The count of word $w$ in topic $k$ the weighted sum of word count for $w$ in each document $d$, where the weight $\phi_{dwk}$ is the probability that word $w$ in document $d$ belongs to topic $k$—plus the Dirichlet prior $\eta$.</p> <h3 id="putting-everything-together">Putting everything together</h3> <p>We have shown the update rules for the variational parameters: $\phi_{dwk}$ in $\eqref{eq:update:phi}$, $\gamma_{dk}$ in $\eqref{eq:update:gamma}$, and $\lambda_{kw}$ in $\eqref{eq:update:lambda}$. The variational Bayes algorithm is complete. There is one final thing to note, taken from the Section 2.1 of the <a href="https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">original paper</a>.</p> <p>We can actually partition these updates into two steps, analogous to the two steps in the EM algorithm. In the “E”-step, we keep updating $\gamma$ and $\phi$ until convergence, keeping $\lambda$ fixed. In the “M”-step, iteratively update $\lambda$ holding $\gamma$ and $\phi$ fixed.</p> <p>Now you can understand the paper’s <a href="https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">Algorithm 1</a> fully and can start implementing it in your favorite language.</p>Josh Nguyenjoshtn@seas.upenn.eduIn this post we will learn about a widely-used topic model called Latent Dirichlet Allocation (LDA), proposed by Blei, Ng and Jordan in 2003. Although research in probabilistic topic modeling has been long-standing, approaching it from a perspective of a newcomer can be quite challenging. Also, there is a lot of literature on the applications of topic models, especially LDA and in many disciplines; I therefore would need to dedicate at least a series of 10 posts to reasonably cover these applications. As such, I constrain myself to the following desiderata when writing this post: Explain what probabilistic topic modeling is, and what assumptions it makes. Recognize the observable and latent variables in a topic model, and specifically in LDA. Explain the generative process of LDA, and derive the complete probability. Explain what inference means in a mixture model, and why it is hard in LDA. Find the approximate posterior distribution of LDA using variational inference, and explain the procedure to find the optimal variational parameters. Explain what it means to “fit” an LDA model to a corpus, and describe how this procedure works. Be able to write code for an LDA model, including training and inference.Anderson Acceleration for Fixed-Point Iteration2022-04-13T00:00:00-07:002022-04-13T00:00:00-07:00https://joshnguyen.net/posts/anderson-acceleration<p>In the first ever post on my website (yay!), I will introduce you to the Anderson acceleration method in fixed-point iteration. It accompanies our paper <a href="https://joshnguyen.net/publication/2021-08-FedAA"><em>“Accelerating Federated Edge Learning”</em></a>. The code can be found <a href="https://github.com/joshnguyen99/anderson_acceleration">this repository</a>.</p> <h2 id="fixed-point-iteration">Fixed-Point Iteration</h2> <p>Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be an affine function of the form $g(x) = Ax + b$, where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. We would like to find a <em>fixed point</em> of $g$, which is a vector $x^\ast$ such that $g(x^\ast) = x^\ast$. The reason why $x^\ast$ is called a fixed point is because applying $g$ to $x^\ast$ doesn’t change itself.</p> <p>The analytical solution to this problem is $x^\ast = -(A - I)^{-1} b$, but there may be several issues to this. First, $A - I$ may not be invertible, in which case we need to use least squares to find its pseudoinverse. Second, even if it is invertible, the cost of solving for $x^\ast$ is $O(d^3)$, where $d$ is the dimensionality $d$, which is very costly in high dimensions.</p> <p>The common numerical method to solve for a fixed point of $g$ is the <em>fixed-point iteration</em>. Start with a randomly chosen $x_0$ and iteratively apply $g$ to it:</p> $\label{eqn:fixed_point} x_{t+1} = g(x_t),$ <p>until $\lVert g(x_{t+1}) - x_{t+1} \rVert \lt \epsilon$ for some predetermined precision $\epsilon$. In order for this to converge, we want to ensure that $g$ is a contraction mapping, that is, there exists an $L \in [0, 1)$ such that $\forall x, x’ \in \mathbb{R}^d, \lVert g(x) - g(x’) \rVert \leq L \lVert x - x’ \rVert$. This can be achieved when the <em>spectral radius</em> of $A$ is less than $1$.</p> <p>We can prove that to achieve a precision of $\epsilon$, we need to apply $O\left(\kappa \log \frac{1}{\epsilon} \right)$ iterations, where $\kappa$ is the <em>condition number</em> of $A$, which is the ratio between $A$’s largest and smallest singular values.</p> <h2 id="anderson-acceleration">Anderson Acceleration</h2> <p>Fixed-point iteration could converge very slowly. The reason is that the condition number of $A$ could be large. (In real datasets, $\kappa$ could be greater than $10^6$.) Anderson acceleration (AA) can speed up convergence considerably. Here’s how it works.</p> <p>Define $f_t = g(x_t) - x_t$ to be the <em>residual</em> at iteration $t$. To find $x_{t+1}$, consider the space spanned by the previous $m_t+1$ iterates $\{x_{t - m_t}, x_{t - m_t + 1}, \ldots, x_t \}$, where $m_t$ is the <em>window size</em> you can choose. To find the next iterate, we consider a linear combination of these previous vectors:</p> $\label{eqn:linear_comb} \bar{x}_t = \sum_{i=1}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ <p>and find $\alpha^{(t)} \in \mathbb{R}^{m_t + 1}$ such that $$\| g(\bar{x}_t) - \bar{x}_t \|$$ is minimized. So what are doing here to use the previous iterates to better guide us to the solution. You can check the paper for a full derivation, but the $\alpha^{(t)}$ we should choose is</p> $\label{eqn:alpha} \alpha^{(t)} = \frac{(F_t^\top F_t)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t)^{-1} \boldsymbol{1}},$ <p>where $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right] \in \mathbb{R}^{d \times (m_t + 1)}$ is the matrix of all residuals, $\boldsymbol{1}$ is the $(m_t + 1)$-dimensional column vector of all ones.</p> <p>After finding $\alpha^{(t)}$, we set the new iterate to</p> $\label{eqn:extrapolate} x_{t+1} = \beta \sum_{i=0}^{m_t} \alpha_i^{(t)} g(x_{t - m_t + i}) + (1 - \beta) \sum_{i=0}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ <p>where $\beta \in [0, 1]$ is a predetermined <em>mixing parameter</em>.</p> <h3 id="regularization">“Regularization”</h3> <p>You can see in the paper that in Algorithm 1, we actually set $\alpha^{(t)}$ as</p> $\label{eqn:alpha_reg} \alpha^{(t)} = \frac{(F_t^\top F_t + \lambda I)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t + \lambda I)^{-1} \boldsymbol{1}},$ <p>which is slightly different from \eqref{eqn:alpha}. The reason is we want to solve the regularized version of the problem</p> $\underset{\alpha^{(t)}: \boldsymbol{1}^\top \alpha^{(t)} = 1}{\min} \| g(\bar{x}_t) - \bar{x}_t \|^2 + \lambda \| \alpha^{(t)} \|^2$ <p>for stability (Section II). Without regularization ($\lambda = 0$), we recover \eqref{eqn:alpha}.</p> <h3 id="the-algorithm">The algorithm</h3> <p>Anderson acceleration is very similar to the vanilla fixed-point iteration: start with some $x_0$. In each iteration, find $\alpha^{(t)}$ like above, and <em>extrapolate</em> from the $m_t + 1$ previous iterates to find the next iterate $x_{t+1}$. In other words, in each iteration $t$:</p> <ul> <li>Calculate $g(x_t)$.</li> <li>Compute the residual: $f_t = g(x_t) - x_t$.</li> <li>Form the residual matrix: $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right]$.</li> <li>Solve for $\alpha^{(t)}$ according to \eqref{eqn:alpha_reg}.</li> <li>Extrapolate from $m_t + 1$ previous iterates according to \eqref{eqn:extrapolate}.</li> </ul> <h2 id="python-implementation-of-aa">Python Implementation of AA</h2> <p>You can find the implementation in the <a href="https://github.com/joshnguyen99/anderson_acceleration/blob/main/aa.py">aa.py</a> file. The <code class="language-plaintext highlighter-rouge">AndersonAcceleration</code> class should be in instantiated with the <code class="language-plaintext highlighter-rouge">window_size</code> ($m_t$, defaulted to $5$) and <code class="language-plaintext highlighter-rouge">reg</code> ($\lambda$, defaulted to 0). Here’s an example.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">aa</span> <span class="kn">import</span> <span class="n">AndersonAcceleration</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">AndersonAcceleration</span><span class="p">(</span><span class="n">window_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">reg</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="c1"># some iterate </span><span class="o">&gt;&gt;&gt;</span> <span class="n">x_acc</span> <span class="o">=</span> <span class="n">acc</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># accelerated from x </span></code></pre></div></div> <p>You will need to apply $g$ to $x_t$ first. The result $g(x_t)$ should be the input to <code class="language-plaintext highlighter-rouge">acc.apply</code>, which will solve for $\alpha^{(t)}$ and extrapolate to find $x_{t+1}$. See <a href="https://github.com/joshnguyen99/anderson_acceleration">the repository</a> for more detail.</p> <h2 id="some-numerical-examples">Some numerical examples</h2> <h3 id="minimizing-a-convex-quadratic-objective">Minimizing a convex quadratic objective</h3> <p>We will minimize a strictly convex quadratic objective. Check <a href="https://github.com/joshnguyen99/anderson_acceleration/blob/main/quadratic_example.ipynb"><code class="language-plaintext highlighter-rouge">quadratic_example.ipynb</code></a> for more detail. The below plot shows the <em>optimality gap</em> between $f(x_t)$ and $f(x^\ast)$ over $t$. AA with a window size of 2 converges much faster than the vanilla gradient descent (GD).</p> <p align="center"> <img src="https://github.com/joshnguyen99/anderson_acceleration/raw/main/AA_GD_quadratic.png" title="Comparing GD to AA on a quadratic objective with very high condition number" /> </p> <h3 id="minimizing-a-convex-non-quadratic-objective">Minimizing a convex non-quadratic objective</h3> <p>We will minimize the $\ell_2$-regularized cross entropy loss function for logistic regression. Check <a href="logistic_regression_example.ipynb"><code class="language-plaintext highlighter-rouge">logistic_regression_example.ipynb</code></a> for more detail. Similarly, AA is much more favorable than the vanilla GD when optimizing this objective.</p> <p align="center"> <img src="https://github.com/joshnguyen99/anderson_acceleration/raw/main/AA_GD_logistic_regression.png" title="Comparing GD to AA on a non-quadratic objective with very high condition number" /> </p>Josh Nguyenjoshtn@seas.upenn.eduIn the first ever post on my website (yay!), I will introduce you to the Anderson acceleration method in fixed-point iteration. It accompanies our paper “Accelerating Federated Edge Learning”. The code can be found this repository. Fixed-Point Iteration Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be an affine function of the form $g(x) = Ax + b$, where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. We would like to find a fixed point of $g$, which is a vector $x^\ast$ such that $g(x^\ast) = x^\ast$. The reason why $x^\ast$ is called a fixed point is because applying $g$ to $x^\ast$ doesn’t change itself. The analytical solution to this problem is $x^\ast = -(A - I)^{-1} b$, but there may be several issues to this. First, $A - I$ may not be invertible, in which case we need to use least squares to find its pseudoinverse. Second, even if it is invertible, the cost of solving for $x^\ast$ is $O(d^3)$, where $d$ is the dimensionality $d$, which is very costly in high dimensions. The common numerical method to solve for a fixed point of $g$ is the fixed-point iteration. Start with a randomly chosen $x_0$ and iteratively apply $g$ to it: $\label{eqn:fixed_point} x_{t+1} = g(x_t),$ until $\lVert g(x_{t+1}) - x_{t+1} \rVert \lt \epsilon$ for some predetermined precision $\epsilon$. In order for this to converge, we want to ensure that $g$ is a contraction mapping, that is, there exists an $L \in [0, 1)$ such that $\forall x, x’ \in \mathbb{R}^d, \lVert g(x) - g(x’) \rVert \leq L \lVert x - x’ \rVert$. This can be achieved when the spectral radius of $A$ is less than $1$. We can prove that to achieve a precision of $\epsilon$, we need to apply $O\left(\kappa \log \frac{1}{\epsilon} \right)$ iterations, where $\kappa$ is the condition number of $A$, which is the ratio between $A$’s largest and smallest singular values. Anderson Acceleration Fixed-point iteration could converge very slowly. The reason is that the condition number of $A$ could be large. (In real datasets, $\kappa$ could be greater than $10^6$.) Anderson acceleration (AA) can speed up convergence considerably. Here’s how it works. Define $f_t = g(x_t) - x_t$ to be the residual at iteration $t$. To find $x_{t+1}$, consider the space spanned by the previous $m_t+1$ iterates $\{x_{t - m_t}, x_{t - m_t + 1}, \ldots, x_t \}$, where $m_t$ is the window size you can choose. To find the next iterate, we consider a linear combination of these previous vectors: $\label{eqn:linear_comb} \bar{x}_t = \sum_{i=1}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ and find $\alpha^{(t)} \in \mathbb{R}^{m_t + 1}$ such that $$\| g(\bar{x}_t) - \bar{x}_t \|$$ is minimized. So what are doing here to use the previous iterates to better guide us to the solution. You can check the paper for a full derivation, but the $\alpha^{(t)}$ we should choose is $\label{eqn:alpha} \alpha^{(t)} = \frac{(F_t^\top F_t)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t)^{-1} \boldsymbol{1}},$ where $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right] \in \mathbb{R}^{d \times (m_t + 1)}$ is the matrix of all residuals, $\boldsymbol{1}$ is the $(m_t + 1)$-dimensional column vector of all ones. After finding $\alpha^{(t)}$, we set the new iterate to $\label{eqn:extrapolate} x_{t+1} = \beta \sum_{i=0}^{m_t} \alpha_i^{(t)} g(x_{t - m_t + i}) + (1 - \beta) \sum_{i=0}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ where $\beta \in [0, 1]$ is a predetermined mixing parameter. “Regularization” You can see in the paper that in Algorithm 1, we actually set $\alpha^{(t)}$ as $\label{eqn:alpha_reg} \alpha^{(t)} = \frac{(F_t^\top F_t + \lambda I)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t + \lambda I)^{-1} \boldsymbol{1}},$ which is slightly different from \eqref{eqn:alpha}. The reason is we want to solve the regularized version of the problem $\underset{\alpha^{(t)}: \boldsymbol{1}^\top \alpha^{(t)} = 1}{\min} \| g(\bar{x}_t) - \bar{x}_t \|^2 + \lambda \| \alpha^{(t)} \|^2$ for stability (Section II). Without regularization ($\lambda = 0$), we recover \eqref{eqn:alpha}. The algorithm Anderson acceleration is very similar to the vanilla fixed-point iteration: start with some $x_0$. In each iteration, find $\alpha^{(t)}$ like above, and extrapolate from the $m_t + 1$ previous iterates to find the next iterate $x_{t+1}$. In other words, in each iteration $t$: Calculate $g(x_t)$. Compute the residual: $f_t = g(x_t) - x_t$. Form the residual matrix: $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right]$. Solve for $\alpha^{(t)}$ according to \eqref{eqn:alpha_reg}. Extrapolate from $m_t + 1$ previous iterates according to \eqref{eqn:extrapolate}. Python Implementation of AA You can find the implementation in the aa.py file. The AndersonAcceleration class should be in instantiated with the window_size ($m_t$, defaulted to $5$) and reg ($\lambda$, defaulted to 0). Here’s an example. &gt;&gt;&gt; import numpy as np &gt;&gt;&gt; from aa import AndersonAcceleration &gt;&gt;&gt; acc = AndersonAcceleration(window_size=2, reg=0) &gt;&gt;&gt; x = np.random.rand(100) # some iterate &gt;&gt;&gt; x_acc = acc.apply(x) # accelerated from x You will need to apply $g$ to $x_t$ first. The result $g(x_t)$ should be the input to acc.apply, which will solve for $\alpha^{(t)}$ and extrapolate to find $x_{t+1}$. See the repository for more detail. Some numerical examples Minimizing a convex quadratic objective We will minimize a strictly convex quadratic objective. Check quadratic_example.ipynb for more detail. The below plot shows the optimality gap between $f(x_t)$ and $f(x^\ast)$ over $t$. AA with a window size of 2 converges much faster than the vanilla gradient descent (GD). Minimizing a convex non-quadratic objective We will minimize the $\ell_2$-regularized cross entropy loss function for logistic regression. Check logistic_regression_example.ipynb for more detail. Similarly, AA is much more favorable than the vanilla GD when optimizing this objective.