Jekyll2022-04-13T00:48:47-07:00https://joshnguyen.net/feed.xmlJosh Nguyenpersonal descriptionJosh NguyenAnderson Acceleration for Fixed-Point Iteration2022-04-13T00:00:00-07:002022-04-13T00:00:00-07:00https://joshnguyen.net/posts/anderson-acceleration<p>In the first ever post on my website (yay!), I will introduce you to the Anderson acceleration method in fixed-point iteration. It accompanies our paper <a href="https://joshnguyen.net/publication/2021-08-FedAA"><em>“Accelerating Federated Edge Learning”</em></a>. The code can be found <a href="https://github.com/joshnguyen99/anderson_acceleration">this repository</a>.</p>
<h2 id="fixed-point-iteration">Fixed-Point Iteration</h2>
<p>Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be an affine function of the form $g(x) = Ax + b$, where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. We would like to find a <em>fixed point</em> of $g$, which is a vector $x^\ast$ such that $g(x^\ast) = x^\ast$. The reason why $x^\ast$ is called a fixed point is because applying $g$ to $x^\ast$ doesn’t change itself.</p>
<p>The analytical solution to this problem is $x^\ast = -(A - I)^{-1} b$, but there may be several issues to this. First, $A - I$ may not be invertible, in which case we need to use least squares to find its pseudoinverse. Second, even if it is invertible, the cost of solving for $x^\ast$ is $O(d^3)$, where $d$ is the dimensionality $d$, which is very costly in high dimensions.</p>
<p>The common numerical method to solve for a fixed point of $g$ is the <em>fixed-point iteration</em>. Start with a randomly chosen $x_0$ and iteratively apply $g$ to it:</p>
\[\label{eqn:fixed_point}
x_{t+1} = g(x_t),\]
<p>until $\lVert g(x_{t+1}) - x_{t+1} \rVert \lt \epsilon$ for some predetermined precision $\epsilon$. In order for this to converge, we want to ensure that $g$ is a contraction mapping, that is, there exists an $L \in [0, 1)$ such that $\forall x, x’ \in \mathbb{R}^d, \lVert g(x) - g(x’) \rVert \leq L \lVert x - x’ \rVert$. This can be achieved when the <em>spectral radius</em> of $A$ is less than $1$.</p>
<p>We can prove that to achieve a precision of $\epsilon$, we need to apply $O\left(\kappa \log \frac{1}{\epsilon} \right)$ iterations, where $\kappa$ is the <em>condition number</em> of $A$, which is the ratio between $A$’s largest and smallest singular values.</p>
<h2 id="anderson-acceleration">Anderson Acceleration</h2>
<p>Fixed-point iteration could converge very slowly. The reason is that the condition number of $A$ could be large. (In real datasets, $\kappa$ could be greater than $10^6$.) Anderson acceleration (AA) can speed up convergence considerably. Here’s how it works.</p>
<p>Define $f_t = g(x_t) - x_t$ to be the <em>residual</em> at iteration $t$. To find $x_{t+1}$, consider the space spanned by the previous $m_t+1$ iterates $\{x_{t - m_t}, x_{t - m_t + 1}, \ldots, x_t \}$, where $m_t$ is the <em>window size</em> you can choose. To find the next iterate, we consider a linear combination of these previous vectors:</p>
\[\label{eqn:linear_comb}
\bar{x}_t = \sum_{i=1}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},\]
<p>and find $\alpha^{(t)} \in \mathbb{R}^{m_t + 1}$ such that \(\| g(\bar{x}_t) - \bar{x}_t \|\) is minimized. So what are doing here to use the previous iterates to better guide us to the solution. You can check the paper for a full derivation, but the $\alpha^{(t)}$ we should choose is</p>
\[\label{eqn:alpha}
\alpha^{(t)} = \frac{(F_t^\top F_t)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t)^{-1} \boldsymbol{1}},\]
<p>where $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right] \in \mathbb{R}^{d \times (m_t + 1)}$ is the matrix of all residuals, $\boldsymbol{1}$ is the $(m_t + 1)$-dimensional column vector of all ones.</p>
<p>After finding $\alpha^{(t)}$, we set the new iterate to</p>
\[\label{eqn:extrapolate}
x_{t+1} = \beta \sum_{i=0}^{m_t} \alpha_i^{(t)} g(x_{t - m_t + i}) + (1 - \beta) \sum_{i=0}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},\]
<p>where $\beta \in [0, 1]$ is a predetermined <em>mixing parameter</em>.</p>
<h3 id="regularization">“Regularization”</h3>
<p>You can see in the paper that in Algorithm 1, we actually set $\alpha^{(t)}$ as</p>
\[\label{eqn:alpha_reg}
\alpha^{(t)} = \frac{(F_t^\top F_t + \lambda I)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t + \lambda I)^{-1} \boldsymbol{1}},\]
<p>which is slightly different from \eqref{eqn:alpha}. The reason is we want to solve the regularized version of the problem</p>
\[\underset{\alpha^{(t)}: \boldsymbol{1}^\top \alpha^{(t)} = 1}{\min} \| g(\bar{x}_t) - \bar{x}_t \|^2 + \lambda \| \alpha^{(t)} \|^2\]
<p>for stability (Section II). Without regularization ($\lambda = 0$), we recover \eqref{eqn:alpha}.</p>
<h3 id="the-algorithm">The algorithm</h3>
<p>Anderson acceleration is very similar to the vanilla fixed-point iteration: start with some $x_0$. In each iteration, find $\alpha^{(t)}$ like above, and <em>extrapolate</em> from the $m_t + 1$ previous iterates to find the next iterate $x_{t+1}$. In other words, in each iteration $t$:</p>
<ul>
<li>Calculate $g(x_t)$.</li>
<li>Compute the residual: $f_t = g(x_t) - x_t$.</li>
<li>Form the residual matrix: $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right]$.</li>
<li>Solve for $\alpha^{(t)}$ according to \eqref{eqn:alpha_reg}.</li>
<li>Extrapolate from $m_t + 1$ previous iterates according to \eqref{eqn:extrapolate}.</li>
</ul>
<h2 id="python-implementation-of-aa">Python Implementation of AA</h2>
<p>You can find the implementation in the <a href="https://github.com/joshnguyen99/anderson_acceleration/blob/main/aa.py">aa.py</a> file. The <code class="language-plaintext highlighter-rouge">AndersonAcceleration</code> class should be in instantiated with the <code class="language-plaintext highlighter-rouge">window_size</code> ($m_t$, defaulted to $5$) and <code class="language-plaintext highlighter-rouge">reg</code> ($\lambda$, defaulted to 0). Here’s an example.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">aa</span> <span class="kn">import</span> <span class="n">AndersonAcceleration</span>
<span class="o">>>></span> <span class="n">acc</span> <span class="o">=</span> <span class="n">AndersonAcceleration</span><span class="p">(</span><span class="n">window_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">reg</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="c1"># some iterate
</span><span class="o">>>></span> <span class="n">x_acc</span> <span class="o">=</span> <span class="n">acc</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># accelerated from x
</span></code></pre></div></div>
<p>You will need to apply $g$ to $x_t$ first. The result $g(x_t)$ should be the input to <code class="language-plaintext highlighter-rouge">acc.apply</code>, which will solve for $\alpha^{(t)}$ and extrapolate to find $x_{t+1}$. See <a href="https://github.com/joshnguyen99/anderson_acceleration">the repository</a> for more detail.</p>
<h2 id="some-numerical-examples">Some numerical examples</h2>
<h3 id="minimizing-a-convex-quadratic-objective">Minimizing a convex quadratic objective</h3>
<p>We will minimize a strictly convex quadratic objective. Check <a href="https://github.com/joshnguyen99/anderson_acceleration/blob/main/quadratic_example.ipynb"><code class="language-plaintext highlighter-rouge">quadratic_example.ipynb</code></a> for more detail. The below plot shows the <em>optimality gap</em> between $f(x_t)$ and $f(x^\ast)$ over $t$. AA with a window size of 2 converges much faster than the vanilla gradient descent (GD).</p>
<p align="center">
<img src="https://github.com/joshnguyen99/anderson_acceleration/raw/main/AA_GD_quadratic.png" title="Comparing GD to AA on a quadratic objective with very high condition number" />
</p>
<h3 id="minimizing-a-convex-non-quadratic-objective">Minimizing a convex non-quadratic objective</h3>
<p>We will minimize the $\ell_2$-regularized cross entropy loss function for logistic regression. Check <a href="logistic_regression_example.ipynb"><code class="language-plaintext highlighter-rouge">logistic_regression_example.ipynb</code></a> for more detail. Similarly, AA is much more favorable than the vanilla GD when optimizing this objective.</p>
<p align="center">
<img src="https://github.com/joshnguyen99/anderson_acceleration/raw/main/AA_GD_logistic_regression.png" title="Comparing GD to AA on a non-quadratic objective with very high condition number" />
</p>Josh NguyenIn the first ever post on my website (yay!), I will introduce you to the Anderson acceleration method in fixed-point iteration. It accompanies our paper “Accelerating Federated Edge Learning”. The code can be found this repository. Fixed-Point Iteration Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be an affine function of the form $g(x) = Ax + b$, where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. We would like to find a fixed point of $g$, which is a vector $x^\ast$ such that $g(x^\ast) = x^\ast$. The reason why $x^\ast$ is called a fixed point is because applying $g$ to $x^\ast$ doesn’t change itself. The analytical solution to this problem is $x^\ast = -(A - I)^{-1} b$, but there may be several issues to this. First, $A - I$ may not be invertible, in which case we need to use least squares to find its pseudoinverse. Second, even if it is invertible, the cost of solving for $x^\ast$ is $O(d^3)$, where $d$ is the dimensionality $d$, which is very costly in high dimensions. The common numerical method to solve for a fixed point of $g$ is the fixed-point iteration. Start with a randomly chosen $x_0$ and iteratively apply $g$ to it: \[\label{eqn:fixed_point} x_{t+1} = g(x_t),\] until $\lVert g(x_{t+1}) - x_{t+1} \rVert \lt \epsilon$ for some predetermined precision $\epsilon$. In order for this to converge, we want to ensure that $g$ is a contraction mapping, that is, there exists an $L \in [0, 1)$ such that $\forall x, x’ \in \mathbb{R}^d, \lVert g(x) - g(x’) \rVert \leq L \lVert x - x’ \rVert$. This can be achieved when the spectral radius of $A$ is less than $1$. We can prove that to achieve a precision of $\epsilon$, we need to apply $O\left(\kappa \log \frac{1}{\epsilon} \right)$ iterations, where $\kappa$ is the condition number of $A$, which is the ratio between $A$’s largest and smallest singular values. Anderson Acceleration Fixed-point iteration could converge very slowly. The reason is that the condition number of $A$ could be large. (In real datasets, $\kappa$ could be greater than $10^6$.) Anderson acceleration (AA) can speed up convergence considerably. Here’s how it works. Define $f_t = g(x_t) - x_t$ to be the residual at iteration $t$. To find $x_{t+1}$, consider the space spanned by the previous $m_t+1$ iterates $\{x_{t - m_t}, x_{t - m_t + 1}, \ldots, x_t \}$, where $m_t$ is the window size you can choose. To find the next iterate, we consider a linear combination of these previous vectors: \[\label{eqn:linear_comb} \bar{x}_t = \sum_{i=1}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},\] and find $\alpha^{(t)} \in \mathbb{R}^{m_t + 1}$ such that \(\| g(\bar{x}_t) - \bar{x}_t \|\) is minimized. So what are doing here to use the previous iterates to better guide us to the solution. You can check the paper for a full derivation, but the $\alpha^{(t)}$ we should choose is \[\label{eqn:alpha} \alpha^{(t)} = \frac{(F_t^\top F_t)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t)^{-1} \boldsymbol{1}},\] where $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right] \in \mathbb{R}^{d \times (m_t + 1)}$ is the matrix of all residuals, $\boldsymbol{1}$ is the $(m_t + 1)$-dimensional column vector of all ones. After finding $\alpha^{(t)}$, we set the new iterate to \[\label{eqn:extrapolate} x_{t+1} = \beta \sum_{i=0}^{m_t} \alpha_i^{(t)} g(x_{t - m_t + i}) + (1 - \beta) \sum_{i=0}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},\] where $\beta \in [0, 1]$ is a predetermined mixing parameter. “Regularization” You can see in the paper that in Algorithm 1, we actually set $\alpha^{(t)}$ as \[\label{eqn:alpha_reg} \alpha^{(t)} = \frac{(F_t^\top F_t + \lambda I)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t + \lambda I)^{-1} \boldsymbol{1}},\] which is slightly different from \eqref{eqn:alpha}. The reason is we want to solve the regularized version of the problem \[\underset{\alpha^{(t)}: \boldsymbol{1}^\top \alpha^{(t)} = 1}{\min} \| g(\bar{x}_t) - \bar{x}_t \|^2 + \lambda \| \alpha^{(t)} \|^2\] for stability (Section II). Without regularization ($\lambda = 0$), we recover \eqref{eqn:alpha}. The algorithm Anderson acceleration is very similar to the vanilla fixed-point iteration: start with some $x_0$. In each iteration, find $\alpha^{(t)}$ like above, and extrapolate from the $m_t + 1$ previous iterates to find the next iterate $x_{t+1}$. In other words, in each iteration $t$: Calculate $g(x_t)$. Compute the residual: $f_t = g(x_t) - x_t$. Form the residual matrix: $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right]$. Solve for $\alpha^{(t)}$ according to \eqref{eqn:alpha_reg}. Extrapolate from $m_t + 1$ previous iterates according to \eqref{eqn:extrapolate}. Python Implementation of AA You can find the implementation in the aa.py file. The AndersonAcceleration class should be in instantiated with the window_size ($m_t$, defaulted to $5$) and reg ($\lambda$, defaulted to 0). Here’s an example. >>> import numpy as np >>> from aa import AndersonAcceleration >>> acc = AndersonAcceleration(window_size=2, reg=0) >>> x = np.random.rand(100) # some iterate >>> x_acc = acc.apply(x) # accelerated from x You will need to apply $g$ to $x_t$ first. The result $g(x_t)$ should be the input to acc.apply, which will solve for $\alpha^{(t)}$ and extrapolate to find $x_{t+1}$. See the repository for more detail. Some numerical examples Minimizing a convex quadratic objective We will minimize a strictly convex quadratic objective. Check quadratic_example.ipynb for more detail. The below plot shows the optimality gap between $f(x_t)$ and $f(x^\ast)$ over $t$. AA with a window size of 2 converges much faster than the vanilla gradient descent (GD). Minimizing a convex non-quadratic objective We will minimize the $\ell_2$-regularized cross entropy loss function for logistic regression. Check logistic_regression_example.ipynb for more detail. Similarly, AA is much more favorable than the vanilla GD when optimizing this objective.