Jekyll2022-08-01T02:07:56-07:00https://joshnguyen.net/feed.xmlJosh Nguyenpersonal descriptionJosh NguyenVariational Bayes for Latent Dirichlet Allocation2022-08-01T00:00:00-07:002022-08-01T00:00:00-07:00https://joshnguyen.net/posts/lda<p>In this post we will learn about a widely-used topic model called Latent Dirichlet Allocation (LDA), proposed by <a href="https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">Blei, Ng and Jordan in 2003</a>. Although research in probabilistic topic modeling has been long-standing, approaching it from a perspective of a newcomer can be quite challenging. Also, there is a lot of literature on the applications of topic models, especially LDA and in many disciplines; I therefore would need to dedicate at least a series of 10 posts to reasonably cover these applications. As such, I constrain myself to the following desiderata when writing this post, and hope that the these goals are achieved after you finish reading.</p> <ul> <li>Explain what probabilistic topic modeling is, and what assumptions it makes.</li> <li>Recognize the observable and latent variables in a topic model, and specifically in LDA.</li> <li>Explain the generative process of LDA, and be able to derive the complete probability.</li> <li>Explain what inference means in a mixture model, and why it is hard in LDA.</li> <li>Find the approximate posterior distribution of LDA using variational inference, and explain the procedure to find the optimal variational parameters.</li> <li>Explain what it means to “fit” an LDA model to a corpus, and describe how this procedure works.</li> <li>Be able to write code for an LDA model, including training and inference.</li> </ul> <h2 id="introduction">Introduction</h2> <p>Being able to describe a large collection of documents is an important task in many disciplines. This task is often called “describe the haystack,” and the idea is to find the common <em>themes</em> that appear in the documents. For example, given a corpus of abstracts from papers published to <a href="https://www.pnas.org/">PNAS</a>, can we find the common scientific topics—such as “cellular biology”, “genetics” or “evolution”—that are covered in these abstracts? Another example is when you collect many tweets in a specific period, and want to find out what common topics people tweet about during this period, the the hope of predicting what topics will be trending in the near future. To help us approach this, there are three discussions worth noting here.</p> <p>First, identifying topics by manually reading a collection of documents is probably the least way to characterize its themes, but the mere size of a corpus makes it impossible to perform this; we are looking at tens of thousands of abstracts, hudreds of thousands of Reddit posts, millions of Wikipedia articles, and tens of millions of tweets. Coming up with a way in which a computer can help us <em>automatically</em> identify the topics is much more desirable.</p> <p>Second, what do we mean by <em>topics</em>, or themes? Put simply, a topic is a probability distribution over the vocabulary. For example, a topic about natural language processing is a distribution, with (much) higher probabilities for words such as “machine”, “token”, “vector” and “likelihood” than for words such as “mechanic”, “torts”, “cell” and “chemical”. Typically, we describe a topic by a list of most-likely words, of size say, 10 or 15. A human can look at this list and give the topic a representative name if necessary.</p> <p>Third, it is quite evident that a document is rarely exclusively about one topic. (Well, this depends on how fine-grained you define each topic to be, but note that the more fine-grained, the harder it is to generalize.) In fact, we often associate a document with a <em>mixture</em> of topics, perhaps with a higher weight to some than others. For example, a research paper in machine learning can be a mixture of topics such as optimization, statistics, statistical physics, and so on, and a human reader can probably tell which topic is weighed higher than others after reading the paper. A solution to modeling this is to have a probability distribution over topics, given a document.</p> <h3 id="probabilistic-topic-models">Probabilistic topic models</h3> <p>The two types probability distribution described above are the main ingredients of probabilistic topic models such as LDA. If we are able to model them, we can do many useful things. First, using the topic-word distributions allows us to characterize the topics present in a corpus, thereby summarizing it in a meaningful way. And using the document-topic distributions allows us to draw inference on the topics that a document is about, also helping with summarization. The applications of these models are quite boundless, which is why they are so popular in many fields such as computational social science, psychology, cognitive science, and so on.</p> <p>However, in order to use them correctly as well as identifying the pros and cons to make good decisions while modeling, one should not stop at only calling <code class="language-plaintext highlighter-rouge">sklearn.decomposition.LatentDirichletAllocation</code> arbitrarily, but should be able to understand the model, its assumptions, and how to tune its hyperparameters. To demonstrate this, let us dive into the details of the model.</p> <h2 id="latent-dirichlet-allocation">Latent Dirichlet Allocation</h2> <p>A probabilistic topic model, LDA still remains one of the most popular choices for topic modeling today. It is an example of a <em>mixture model</em> whose structure contains two types of random variables:</p> <ul> <li>The <em>observable variables</em> are the words you observe in each document.</li> <li>The <em>latent variables</em> are those you do not observe, but which describe some internal <em>structure</em> of your data, in particular, the “topics”.</li> </ul> <p>You can readily see the assumption here, which is that there there <em>is</em> some internal structure to your data, and our job is to model that structure using the latent variables.</p> <h3 id="generative-process">Generative process</h3> <p>In specifying a mixture model like LDA, we need to describe how data can be generated using this model. Before we do that, let us set up the notation carefully. Note that in this blog post, I have chosen the notation used in Hoffmann, Blei and Bach’s <a href="https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">paper on online learning for LDA</a>. This blog is intended to follow the “batch variational Bayes” part of the paper, with some more detail to help you read more easily.</p> <p>Suppose we have a collection of $D$ documents, where each document $d$ is of length $N_d$. Also suppose that we have a fixed vocabulary of $W$ words. We wish to discover $K$ topics in this collection, where each topic $k$ is specified by the probability $\beta_k$ over all words. The generative process works as follows. For document $d$, sample probability distribution $\theta_d$ over the topics $1, \ldots, K$. For each word $w_{di}$ in document $d$, sample a topic $z_{di}$ from the distribution $\theta_d$. With the chosen topic $z_{di}$, sample a word $w_{di}$ from the probability distribution $\beta_{z_{di}}$. In other words,</p> <ul> <li>Draw a topic-word distribution $\beta_k \sim \text{Dir}(\eta)$ for $k = 1, \ldots, K$.</li> <li>For each document $d = 1, \ldots, D$: <ul> <li>Draw document-topic distribution for document $d$: $\theta_d \sim \text{Dir}(\alpha)$.</li> <li>For each word $i$ in document $d$: <ul> <li>Draw a topic $z_{di} \sim \theta_d$.</li> <li>Draw a word $w_{di} \sim \beta_{z_{di}}$.</li> </ul> </li> </ul> </li> </ul> <p>The notation is summarized in the following table.</p> <table> <thead> <tr> <th style="text-align: center">Notation</th> <th style="text-align: center">Dimensionality</th> <th style="text-align: left">Meaning</th> <th style="text-align: left">Notes</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">$D$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of documents</td> <td style="text-align: left">Positive integer</td> </tr> <tr> <td style="text-align: center">$W$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of words in the vocabulary</td> <td style="text-align: left">Positive integer</td> </tr> <tr> <td style="text-align: center">$K$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of topics</td> <td style="text-align: left">Positive integer,typically much smaller than $D$</td> </tr> <tr> <td style="text-align: center">$N_d$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Number of words in document $d$</td> <td style="text-align: left">Positive integer</td> </tr> <tr> <td style="text-align: center">$\beta_k$</td> <td style="text-align: center">$W$</td> <td style="text-align: left">Word distribution for topic $k$</td> <td style="text-align: left">$\beta_k$ ($k = 1, \ldots, K)$ are independent. Each $\beta_k$ is a non-negative vector and $\sum_{w=1}^{W} \beta_{kw} = 1$.</td> </tr> <tr> <td style="text-align: center">$\eta$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Dirichlet prior parameter for $\beta_k$</td> <td style="text-align: left">All $\beta_k$ share the same parameter $\eta$.</td> </tr> <tr> <td style="text-align: center">$\theta_d$</td> <td style="text-align: center">$K$</td> <td style="text-align: left">Topic distribution for document $d$</td> <td style="text-align: left">$\theta_d$ ($d = 1, \ldots, D$) are independent. Each $\theta_d$ is a non-negative vector and $\sum_{k=1}^{K} \theta_{dk} = 1$.</td> </tr> <tr> <td style="text-align: center">$\alpha$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Dirichlet prior parameter for $\theta_d$</td> <td style="text-align: left">All $\theta_d$ share the same parameter $\alpha$.</td> </tr> <tr> <td style="text-align: center">$w_{di}$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Word $i$ in document $d$</td> <td style="text-align: left">$w_{di} \in {1, 2, \ldots, W}$</td> </tr> <tr> <td style="text-align: center">$z_{di}$</td> <td style="text-align: center">Scalar</td> <td style="text-align: left">Topic assignment for word $w_{di}$</td> <td style="text-align: left">$z_{di} \in {1, 2, \ldots, K}$</td> </tr> </tbody> </table> <h3 id="complete-model">Complete model</h3> <p>The types of variables should be clear to us now. The only observables we have are $w$, the words in the documents. On the other hand, the latent variables are $z$, $\theta$ and $\beta$. The generative process allows us to specify the complete model—i.e., the joint distribution of both observable and latent variables—as follows</p> \begin{align} p(w, z, \theta, \beta \mid \alpha, \eta) &amp; = p(\beta \mid \eta) \prod_{d=1}^{D} p(\theta_d \mid \alpha) p(z_d \mid \theta_d) p(w_d \mid \theta_d, z_i, \beta) \label{eq:joint_prob}\\ &amp; = \prod_{k=1}^{K} p(\beta_k \mid \eta) \prod_{d=1}^{D} p(\theta_d \mid \alpha) \prod_{i=1}^{N_d} p(z_{di} \mid \theta_d) p(w_{di} \mid \theta_d, z_{di}, \beta). \nonumber \end{align} <h3 id="dirichlet-and-categorical-distributions">Dirichlet and categorical distributions</h3> <p>Note that there are two probability distributions used in this process. The first is the <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet</a> used to sample $\beta_k$ and $\theta_d$. For example, the probability of the topic distribution for document $d$ is</p> $p(\theta_d \mid \alpha) = \frac{\Gamma\left( K \alpha \right)}{\Gamma(\alpha)^K} \prod_{k=1}^K \theta_{dk}^{\alpha-1}.$ <p>The second is the <a href="https://en.wikipedia.org/wiki/Categorical_distribution">categorical distribution</a>, used to sample $z_{di}$ and $w_{di}$. For example, to find the probablity that the word $w_{di}$ given all other variables, we first need to find the value of $z_{di}$. Suppose $z_{d_i} = 2$. Then the distribution we need to use is $\beta_2$, or the second topic. Then the probability that $w_{di}$ equals some $w$ is</p> $p(w_{di} | z_{di} = 2, \beta, \theta_d) = \beta_{2, w},$ <p>that is, the $w$-th entry of $\beta_{2}$.</p> <h2 id="inference-approximate-inference-and-parameter-estimation">Inference, Approximate Inference and Parameter Estimation</h2> <p>Inference refers to the task of finding the probability of latent varibles given observable variables. In our LDA example, the quantity we want to calculate is</p> \begin{align} p(z, \theta, \beta \mid w, \alpha, \eta) = \frac{p(w, z, \theta, \beta \mid \alpha, \eta)}{\int_{z, \theta, \beta} p(w, z, \theta, \beta \mid \alpha, \eta) dz d\theta d\beta}. \label{eq:bayes-infr} \end{align} <p>What is this quantity? Imagine you see new document, how do you know what topics it belongs to, along with the topic weights? The probability in $\eqref{eq:bayes-infr}$ helps us do just that: use the Bayes theorem to find the <em>posterior</em> distribution on the latent variables, enabling us to draw inference on the structure of the document.</p> <p>But there is a catch. The integral in the denominator $\eqref{eq:bayes-infr}$, which is equal to $p(w \mid \alpha, \eta)$ and often called the <em>evidence</em>, is very hard to evaluate. This is mainly because of the coupling of of the latent variables, and exactly calulating this will take exponential time. Instead, we will use an method called <em>variational inference</em> to approximate it.</p> <h3 id="variational-inference-vi">Variational inference (VI)</h3> <p>(To keep this blog post short enough, I will not explain the details of VI. You are encourage to check out Chapter 10 in <a href="https://probml.github.io/pml-book/book2.html">Kevin Murphy’s textbook on probabilistic machine learning</a> for an introduction to VI.)</p> <p>Basically, the goal of VI is to approximate the distribution $p(z, \theta, \beta \mid w, \alpha, \eta)$ using a simpler distribution $q(z, \theta, \beta)$ that is “the closest” to $p$. Here “closeness” is defined by the Kullback-Leibler divergence between $q$ and $p$. In other words, we aim to solve the following optimization problem:</p> $\min_{q} \left\{ \text{KL}(q(z, \theta, \beta) \| p(z, \theta, \beta \mid w, \alpha, \eta)) = \mathbb{E}_q \left[ \log \frac{q(z, \theta, \beta)}{p(z, \theta, \beta \mid w, \alpha, \eta)} \right] \right\}.$ <h3 id="evidence-lower-bound-elbo-and-variational-bayes-vb">Evidence lower bound (ELBO) and variational Bayes (VB)</h3> <p>Interestingly, minimizing this KL divergence is equivalent to maximizing the <em>evidence lower bound</em> (ELBO) of the data, where the ELBO $\mathcal{L}(w, z, \theta, \beta)$ is defined as</p> \begin{align} \mathcal{L}(w, \phi, \gamma, \lambda) = \mathbb{E}_q\left[ \log p(w, z, \theta, \beta \mid \alpha, \eta) \right] - \mathbb{E}_q\left[ q(z, \theta, \beta) \right]. \label{eq:elbo:def} \end{align} <p>As the name suggests, the ELBO is a lower bound on the log-likelihood of our data. The maximum ELBO gives us the “closest” approximation to the likelihood. Check Section 10.1.2 in <a href="https://probml.github.io/pml-book/book2.html">Murphy’s textbook</a> for a full derivation.</p> <p>To “fit” the data in the Bayesian sense, we will aim to approximate the true posterior as well as possible. Applying VI to this task is called <em>variational Bayes</em> (VB).</p> <h3 id="choosing-variational-parameters">Choosing variational parameters</h3> <p>We have mentioned the “simpler” distribution $q(z, \theta, \beta)$ above, but what exactly is it? In using VI for LDA inference, we assume that $q(z, \theta, \beta)$ factorizes to three marginal distributions:</p> <ul> <li>$q(z_{di}) = \phi_{d w_{di} k}$. The dimensionality of $\phi$ is $D \times W \times K$, and $\sum_{k=1}^{K} \phi_{d w k} = 1, \forall d, w$;</li> <li>$\theta_d \sim \text{Dir}(\gamma_d)$, where $\gamma_d$ is a vector of length $K$. Note that $\gamma_d$ is <em>not</em> symmetric;</li> <li>$\beta_k \sim \text{Dir}(\lambda_k)$, where $\lambda_k$ is a vector of length $W$. Similarly, $\beta_k$ is <em>not</em> symmetric.</li> </ul> <p>This is an application of the <em>mean-field assumption</em>, which says that variational distributions for each set of latent variables are mutually independent, allowing the joint to be factorized into marginals.</p> <p>In summary,</p> \begin{align} q(z_d, \theta_d,\beta) = q(z_d) q(\theta_d)q(\beta), \label{eq:mean_field} \end{align} <p>and we have three types of variational parameters: $\phi$ of size $D \times W \times K$; $\gamma_d$ of size $K$, for $d = 1, \ldots, D$; and $\lambda_k$ of size $W$, for $k = 1, \ldots, K$.</p> <h3 id="factorizing-elbo">Factorizing ELBO</h3> <!-- $\log p(w, z, \theta, \beta \mid \alpha, \eta) = \log p(\beta \mid \eta) + \sum_{d=1}^{D} \left[ \log p(\theta_d \mid \alpha) + \log p(z_d \mid \theta_d) + \log p(w_d \mid z_d, \theta_d, \beta) \right]$ --> <!-- $\log q(z_d, \theta_d,\beta) = \log q(z_d) + \log q(\theta_d) + \log q(\beta)$. --> <p>Given the complete model in $\eqref{eq:joint_prob}$ and the variational distribution in $\eqref{eq:mean_field}$, we can decompose the ELBO as follows: \begin{align} \mathcal{L}(w, \phi, \gamma, \lambda) &amp; = \sum_{d=1}^{D} \left\{ \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] + \mathbb{E}_q\left[ \log p(z_d \mid \theta_d) \right] - \mathbb{E}_q\left[ \log p(\theta_d \mid \alpha) \right] \right\} \nonumber \\ &amp;~~~~ - \sum_{d=1}^{D} \left\{ \mathbb{E}_q\left[ \log q(z_d \mid \theta_d) \right] + \mathbb{E}_q\left[ \log q(\theta_d) \right] \right\} \nonumber \\ &amp;~~~~ + \mathbb{E}_q\left[ p(\beta \mid \eta) \right] - \mathbb{E}_q\left[ \log q(\beta) \right] \nonumber \\ &amp; = \sum_{d=1}^{D} \left\{ \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] + \mathbb{E}_q\left[ \log p(z_d \mid \theta_d) \right] - \mathbb{E}_q\left[ \log q(z_d \mid \theta_d) \right] \right. \nonumber\\ &amp;\quad \quad \quad ~ +\left.\mathbb{E}_q\left[ \log p(\theta_d \mid \alpha) \right] - \mathbb{E}_q\left[ \log q(\theta_d) \right] \right\} \nonumber \\ &amp; ~~~~ + (\mathbb{E}_q\left[ p(\beta \mid \eta) \right] - \mathbb{E}_q\left[ \log q(\beta) \right]). \label{eq:elbo} \\ \end{align}</p> <h3 id="elbo-as-a-function-of-variational-parameters">ELBO as a function of variational parameters</h3> <p>Analyzing each term in the sum. \begin{align} \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] &amp; = \sum_{i=1}^{N_d} \mathbb{E}_q\left[ \log p(w_{di} \mid \theta_d, z_{di}, \beta) \right] \nonumber \\ &amp; = \sum_{i=1}^{N_d} \sum_{k=1}^{K} q(z_{di} = k) \mathbb{E}_q\left[ \log p(w_{di} \mid \theta_d, z_{di}, \beta) \right] \nonumber \\ &amp; = \sum_{i=1}^{N_d} \sum_{k=1}^{K} \phi_{d w_{di} k} \mathbb{E}_q\left[ \log \beta_{k w_{di}} \right], \nonumber \end{align}</p> <p>where the expectation on the last row is with respect to $q(\beta_k)$. We can see that in this formula, the contribution of each word $w$ to the term is $\sum_{k=1}^{K} \phi_{d w k} \mathbb{E} \left[ \log \beta_{k w} \right]$, which is the same for regardless of the position of word $w$ in document d. Therefore, we can simply count the number of times $w$ appears in $d$, and then multiply it with this contribution to get the contribution of all occurrences of $w$. This gives us the equivalent expression: \begin{align} \mathbb{E}_q\left[ \log p(w_d \mid \theta_d, z_d, \beta) \right] = \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{d w k} \mathbb{E}_q\left[ \log \beta_{k w} \right], \label{eq:elbo:1} \end{align}</p> <p>where $n_{dw}$ is the number of occurrences of word $w$ in document $d$. Using the same trick, we have \begin{align} \mathbb{E}_q\left[ \log p(z_d \mid \theta_d) \right] &amp; = \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{d w k} \mathbb{E}_q\left[ \log \theta_{dk} \right], \text{and} \label{eq:elbo:2} \\ \mathbb{E}_q\left[ \log q(z_d) \right] &amp; = \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{d w k} \log \phi_{d w k}. \label{eq:elbo:3} \end{align}</p> <p>For the last two terms inside the sum, first note that $p(\theta_d \mid \alpha)$ is a Dirichlet distribution with symmetric parameter $\alpha$, i.e., $q(\theta_d \mid \alpha) = \frac{\Gamma(K \alpha)}{\Gamma(\alpha)^K} \prod_{k=1}^{K} \theta_{dk}^{\alpha-1}$. Therefore, \begin{align} \mathbb{E}_q\left[ \log p(\theta_d \mid \alpha) \right] = \log \Gamma(K \alpha) - K \log \Gamma(\alpha) + (\alpha - 1) \sum_{k=1}^{K} \log \theta_{dk}. \label{eq:elbo:4} \end{align}</p> <p>Similarly, because $q(\theta_d)$ is a Dirichlet distribution with asymmetric parameter $\gamma_d$, we have \begin{align} \mathbb{E}_q\left[ \log q(\theta_d) \right] = \log \Gamma\left(\sum_{k=1}^{K} \gamma_{dk} \right) - \sum_{k=1}^{K} \log \Gamma(\gamma_{dk}) + \sum_{k=1}^{K} (\theta_{dk} - 1) \log \theta_{dk}. \label{eq:elbo:5} \end{align}</p> <p>Now for the last two terms, also note that $p(\beta_k \mid \eta)$ is Dirichlet with symmetric $\eta$. Therefore, \begin{align} \mathbb{E}_q\left[ \log p(\beta \mid \eta) \right] &amp;= \sum_{k=1}^{K} \mathbb{E}_q\left[ \log p(\beta_k \mid \eta) \right] \nonumber \\ &amp;= K [\log \Gamma(W \eta) - W \log \Gamma(\eta)] + \sum_{k=1}^{K} \sum_{w=1}^{W} (\eta - 1) \mathbb{E}_q\left[ \log \beta_{k w} \right]. \label{eq:elbo:6} \end{align}</p> <p>Simlarly, the final term is \begin{align} \mathbb{E}_q\left[ \log q(\beta) \right] &amp;= \sum_{k=1}^{K} \mathbb{E}_q\left[ \log q(\beta_k) \right] \nonumber \\ &amp;= \sum_{k=1}^{K} \left( \log \Gamma \left( \sum_{w=1}^{W} \lambda_{kw} \right) - \sum_{w=1}^{W} \Gamma(\lambda_{kw}) + \sum_{w=1}^{W} (\lambda_{kw} - 1) \mathbb{E}_q\left[ \log \beta_{k w} \right] \right). \label{eq:elbo:7} \end{align}</p> <p>Plugging $\eqref{eq:elbo:1}, \eqref{eq:elbo:2}, \eqref{eq:elbo:3}, \eqref{eq:elbo:4}, \eqref{eq:elbo:5}, \eqref{eq:elbo:6}, \eqref{eq:elbo:7}$ into $\eqref{eq:elbo}$, we have the ELBO as a function of variational parameters:</p> \begin{align} \mathcal{L} &amp;= \sum_{d=1}^{D} \left\{ \sum_{w=1}^{W} n_{dw} \sum_{k=1}^{K} \phi_{dwk} \left( \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] - \log \phi_{dwk} \right) \right. \nonumber\\ &amp; \left. \quad \quad \quad ~ - \log \Gamma\left( \sum_{k=1}^{K} \gamma_{dk} \right) + \sum_{k=1}^{K}\left( \log \Gamma(\gamma_{dk}) + (\alpha - \gamma_{dk}) \mathbb{E}_q\left[ \log \theta_{dk} \right] \right) \right\} \nonumber \\ &amp;~~~~ + \sum_{k=1}^{K} \left( - \log \Gamma\left( \sum_{w}^{W} \lambda_{kw} \right) + \sum_{w=1}^{W} \left( \log \Gamma(\lambda_{kw}) + (\eta - \lambda_{kw}) \mathbb{E}_q\left[ \log \beta_{k w} \right] \right) \right) \nonumber \\ &amp;~~~~ + D [\log \Gamma(K \alpha) - K \log \Gamma(\alpha)] + K [\log \Gamma(W \eta) - W \log \Gamma(\eta)]. \label{eq:elbo:var} \end{align} <h2 id="variational-bayes-for-lda">Variational Bayes for LDA</h2> <p>The main objective here is to maximize the ELBO $\mathcal{L}$ with respect to the variational parameters $\phi$, $\gamma$ and $\lambda$. To do so, we will use a procedure called <em>coordinate ascent</em>, in which we maximize $\mathcal{L}$ with respect to one set of parameters, keeping the others fixed. We will then alternate to another set of variables, keeping others fixed, and so on. In our LDA example, we first keep $\gamma$ and $\lambda$ fixed, and maximize $\mathcal{L}$ as a function of $\phi$ only. Then we do the same for $\gamma$ and $\lambda$.</p> <h3 id="maximizing-with-respect-to-phi">Maximizing with respect to $\phi$</h3> <p>Only keeping the terms involving $\phi_{dwk}$ in $\eqref{eq:elbo:var}$, and treating everything else as constants, we have the objective function w.r.t. $\phi_{dwk}$ as</p> $\mathcal{L}_{[\phi_{dwk}]} = \phi_{dwk} \left( \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] - \log \phi_{dwk} \right) + \text{const},$ <p>which gives the gradient:</p> $\frac{\partial \mathcal{L}}{\partial \phi_{dwk}} = \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] - \log \phi_{dwk} - 1.$ <p>Setting the gradient to zero and solving for $\phi_{dwk}$, we get the update rule for $\phi_{dwk}$:</p> \begin{align} \phi_{dwk} \propto \exp \left\{ \mathbb{E}_q\left[ \log \theta_{dk} \right] + \mathbb{E}_q\left[ \log \beta_{k w} \right] \right\}. \label{eq:update:phi} \end{align} <p>Where we have suppressed all multiplicative constants by using $\propto$. After this update for all $\phi_{dwk}$, we can simply rescale them so that $\sum_{k=1}^{K} \phi_{dwk} = 1, \forall d, w$.</p> <p>The final thing to handle is the expectations inside $\exp$. How do we calculate them exactly? Lucklily, both of them can be calculated using the <a href="https://en.wikipedia.org/wiki/Digamma_function"><em>digamma function</em></a> $\Psi$—the first derivative of the logarithm of the gamma function—as follows:</p> \begin{align*} \mathbb{E}_q\left[ \log \theta_{dk} \right] &amp; = \Psi(\gamma_{dk}) - \Psi\left(\sum_{i=1}^{K} \gamma_{di}\right), \\ \mathbb{E}_q\left[ \log \beta_{k w} \right] &amp; = \Psi(\lambda_{kw}) - \Psi\left(\sum_{i=1}^{W} \lambda_{ki}\right). \end{align*} <h3 id="maximizing-with-respect-to-gamma">Maximizing with respect to $\gamma$</h3> <p>Similarly, the objective function w.r.t. $\gamma_{dk}$ is</p> \begin{align*} \mathcal{L}_{[\gamma_{dk}]} &amp; = \sum_{w=1}^{W} n_{dw} \phi_{dwk} \mathbb{E}_q \left[ \log \theta_{dk} \right] - \log \Gamma\left( \sum_{i=1}^{K} \gamma_{d_i} \right) \\ &amp; ~~~~+ \log \Gamma(\gamma_{dk}) + (\alpha - \gamma_{dk}) \mathbb{E}_q \left[ \log \theta_{dk} \right] + \text{const} \\ &amp; = \left( \alpha + \sum_{w=1}^{W} n_{dw} \phi_{dwk} - \gamma_{dk} \right) \left( \Psi(\gamma_{dk}) - \Psi\left(\sum_{i=1}^{K} \gamma_{di}\right) \right) \\ &amp; ~~~~ - \log \Gamma\left( \sum_{i=1}^{K} \gamma_{d_i} \right) + \log \Gamma(\gamma_{dk}) + \text{const}, \end{align*} <p>where we have used the digamma function $\Psi$ similarly to the previous section. A simple manipulation gives the gradient:</p> \begin{align*} \frac{\partial \mathcal{L}}{\partial \gamma_{dk}} = \left( \Psi'(\gamma_{dk}) - \Psi'\left(\sum_{i=1}^{K} \gamma_{di}\right) \right) \left( \alpha + \sum_{w=1}^{W} n_{dw} \phi_{dwk} - \gamma_{dk} \right). \end{align*} <p>Setting this gradient to zero and solving for $\gamma_{dk}$, we get the update rule for $\gamma_{dk}$:</p> \begin{align} \gamma_{dk} = \alpha + \sum_{w=1}^{W} n_{dw} \phi_{dwk}. \label{eq:update:gamma} \end{align} <p>The variational Bayes estimate of $\gamma$ has an intuitive explanation. The number of times document $d$ is assigned to topic $k$ is the weighted sum of the times each word in $d$ is assigned to topic $k$, where the weight $\phi_{dwk}$ is the probability that word $w$ in document $d$ belongs to topic $k$—plus the Dirichlet prior $\eta$.</p> <h3 id="maximizing-with-respect-to-lambda">Maximizing with respect to $\lambda$</h3> <p>Similar to $\gamma$, we can use the digamma function $\Psi$ in the objective functin w.r.t. $\lambda_{kw}$ as follows</p> \begin{align*} \mathcal{L}_{[\lambda_{kw}]} &amp; = \left( \eta + \sum_{d=1}^{D} n_{dw} \phi_{dwk} - \lambda_{kw} \right) \left( \Psi(\lambda_{kw}) - \Psi\left(\sum_{i=1}^{W} \lambda_{ki} \right) \right) \\ &amp; = - \log \Gamma\left(\sum_{i=1}^{W} \lambda_{ki} \right) + \log \Gamma(\lambda_{kw}) + \text{const}, \end{align*} <p>which gives the gradient:</p> \begin{align*} \frac{\partial \mathcal{L}}{\partial \lambda_{kw}} = \left( \Psi'(\lambda_{kw}) - \Psi'\left(\sum_{i=1}^{W} \lambda_{ki} \right) \right) \left( \eta + \sum_{d=1}^{D} n_{dw} \phi_{dwk} - \lambda_{kw} \right). \end{align*} <p>Setting the gradient to zero and solving for $\lambda_{kw}$, we get the update estimate:</p> \begin{align} \lambda_{kw} = \eta + \sum_{d=1}^{D} n_{dw} \phi_{dwk}. \label{eq:update:lambda} \end{align} <p>Similar to $\gamma_{dk}$, the variational Bayes estimate of $\lambda$ has an intuitive explanation. The count of word $w$ in topic $k$ the weighted sum of word count for $w$ in each document $d$, where the weight $\phi_{dwk}$ is the probability that word $w$ in document $d$ belongs to topic $k$—plus the Dirichlet prior $\eta$.</p> <h3 id="putting-everything-together">Putting everything together</h3> <p>We have shown the update rules for the variational parameters: $\phi_{dwk}$ in $\eqref{eq:update:phi}$, $\gamma_{dk}$ in $\eqref{eq:update:gamma}$, and $\lambda_{kw}$ in $\eqref{eq:update:lambda}$. The variational Bayes algorithm is complete. There is one final thing to note, taken from the Section 2.1 of the <a href="https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">original paper</a>.</p> <p>We can actually partition these updates into two steps, analogous to the two steps in the EM algorithm. In the “E”-step, we keep updating $\gamma$ and $\phi$ until convergence, keeping $\lambda$ fixed. In the “M”-step, iteratively update $\lambda$ holding $\phi$ fixed.</p> <p>Now you can understand the paper’s <a href="https://papers.nips.cc/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf">Algorithm 1</a> fully and can start implementing it in your favorite language.</p>Josh NguyenIn this post we will learn about a widely-used topic model called Latent Dirichlet Allocation (LDA), proposed by Blei, Ng and Jordan in 2003. Although research in probabilistic topic modeling has been long-standing, approaching it from a perspective of a newcomer can be quite challenging. Also, there is a lot of literature on the applications of topic models, especially LDA and in many disciplines; I therefore would need to dedicate at least a series of 10 posts to reasonably cover these applications. As such, I constrain myself to the following desiderata when writing this post, and hope that the these goals are achieved after you finish reading. Explain what probabilistic topic modeling is, and what assumptions it makes. Recognize the observable and latent variables in a topic model, and specifically in LDA. Explain the generative process of LDA, and be able to derive the complete probability. Explain what inference means in a mixture model, and why it is hard in LDA. Find the approximate posterior distribution of LDA using variational inference, and explain the procedure to find the optimal variational parameters. Explain what it means to “fit” an LDA model to a corpus, and describe how this procedure works. Be able to write code for an LDA model, including training and inference.Anderson Acceleration for Fixed-Point Iteration2022-04-13T00:00:00-07:002022-04-13T00:00:00-07:00https://joshnguyen.net/posts/anderson-acceleration<p>In the first ever post on my website (yay!), I will introduce you to the Anderson acceleration method in fixed-point iteration. It accompanies our paper <a href="https://joshnguyen.net/publication/2021-08-FedAA"><em>“Accelerating Federated Edge Learning”</em></a>. The code can be found <a href="https://github.com/joshnguyen99/anderson_acceleration">this repository</a>.</p> <h2 id="fixed-point-iteration">Fixed-Point Iteration</h2> <p>Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be an affine function of the form $g(x) = Ax + b$, where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. We would like to find a <em>fixed point</em> of $g$, which is a vector $x^\ast$ such that $g(x^\ast) = x^\ast$. The reason why $x^\ast$ is called a fixed point is because applying $g$ to $x^\ast$ doesn’t change itself.</p> <p>The analytical solution to this problem is $x^\ast = -(A - I)^{-1} b$, but there may be several issues to this. First, $A - I$ may not be invertible, in which case we need to use least squares to find its pseudoinverse. Second, even if it is invertible, the cost of solving for $x^\ast$ is $O(d^3)$, where $d$ is the dimensionality $d$, which is very costly in high dimensions.</p> <p>The common numerical method to solve for a fixed point of $g$ is the <em>fixed-point iteration</em>. Start with a randomly chosen $x_0$ and iteratively apply $g$ to it:</p> $\label{eqn:fixed_point} x_{t+1} = g(x_t),$ <p>until $\lVert g(x_{t+1}) - x_{t+1} \rVert \lt \epsilon$ for some predetermined precision $\epsilon$. In order for this to converge, we want to ensure that $g$ is a contraction mapping, that is, there exists an $L \in [0, 1)$ such that $\forall x, x’ \in \mathbb{R}^d, \lVert g(x) - g(x’) \rVert \leq L \lVert x - x’ \rVert$. This can be achieved when the <em>spectral radius</em> of $A$ is less than $1$.</p> <p>We can prove that to achieve a precision of $\epsilon$, we need to apply $O\left(\kappa \log \frac{1}{\epsilon} \right)$ iterations, where $\kappa$ is the <em>condition number</em> of $A$, which is the ratio between $A$’s largest and smallest singular values.</p> <h2 id="anderson-acceleration">Anderson Acceleration</h2> <p>Fixed-point iteration could converge very slowly. The reason is that the condition number of $A$ could be large. (In real datasets, $\kappa$ could be greater than $10^6$.) Anderson acceleration (AA) can speed up convergence considerably. Here’s how it works.</p> <p>Define $f_t = g(x_t) - x_t$ to be the <em>residual</em> at iteration $t$. To find $x_{t+1}$, consider the space spanned by the previous $m_t+1$ iterates $\{x_{t - m_t}, x_{t - m_t + 1}, \ldots, x_t \}$, where $m_t$ is the <em>window size</em> you can choose. To find the next iterate, we consider a linear combination of these previous vectors:</p> $\label{eqn:linear_comb} \bar{x}_t = \sum_{i=1}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ <p>and find $\alpha^{(t)} \in \mathbb{R}^{m_t + 1}$ such that $$\| g(\bar{x}_t) - \bar{x}_t \|$$ is minimized. So what are doing here to use the previous iterates to better guide us to the solution. You can check the paper for a full derivation, but the $\alpha^{(t)}$ we should choose is</p> $\label{eqn:alpha} \alpha^{(t)} = \frac{(F_t^\top F_t)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t)^{-1} \boldsymbol{1}},$ <p>where $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right] \in \mathbb{R}^{d \times (m_t + 1)}$ is the matrix of all residuals, $\boldsymbol{1}$ is the $(m_t + 1)$-dimensional column vector of all ones.</p> <p>After finding $\alpha^{(t)}$, we set the new iterate to</p> $\label{eqn:extrapolate} x_{t+1} = \beta \sum_{i=0}^{m_t} \alpha_i^{(t)} g(x_{t - m_t + i}) + (1 - \beta) \sum_{i=0}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ <p>where $\beta \in [0, 1]$ is a predetermined <em>mixing parameter</em>.</p> <h3 id="regularization">“Regularization”</h3> <p>You can see in the paper that in Algorithm 1, we actually set $\alpha^{(t)}$ as</p> $\label{eqn:alpha_reg} \alpha^{(t)} = \frac{(F_t^\top F_t + \lambda I)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t + \lambda I)^{-1} \boldsymbol{1}},$ <p>which is slightly different from \eqref{eqn:alpha}. The reason is we want to solve the regularized version of the problem</p> $\underset{\alpha^{(t)}: \boldsymbol{1}^\top \alpha^{(t)} = 1}{\min} \| g(\bar{x}_t) - \bar{x}_t \|^2 + \lambda \| \alpha^{(t)} \|^2$ <p>for stability (Section II). Without regularization ($\lambda = 0$), we recover \eqref{eqn:alpha}.</p> <h3 id="the-algorithm">The algorithm</h3> <p>Anderson acceleration is very similar to the vanilla fixed-point iteration: start with some $x_0$. In each iteration, find $\alpha^{(t)}$ like above, and <em>extrapolate</em> from the $m_t + 1$ previous iterates to find the next iterate $x_{t+1}$. In other words, in each iteration $t$:</p> <ul> <li>Calculate $g(x_t)$.</li> <li>Compute the residual: $f_t = g(x_t) - x_t$.</li> <li>Form the residual matrix: $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right]$.</li> <li>Solve for $\alpha^{(t)}$ according to \eqref{eqn:alpha_reg}.</li> <li>Extrapolate from $m_t + 1$ previous iterates according to \eqref{eqn:extrapolate}.</li> </ul> <h2 id="python-implementation-of-aa">Python Implementation of AA</h2> <p>You can find the implementation in the <a href="https://github.com/joshnguyen99/anderson_acceleration/blob/main/aa.py">aa.py</a> file. The <code class="language-plaintext highlighter-rouge">AndersonAcceleration</code> class should be in instantiated with the <code class="language-plaintext highlighter-rouge">window_size</code> ($m_t$, defaulted to $5$) and <code class="language-plaintext highlighter-rouge">reg</code> ($\lambda$, defaulted to 0). Here’s an example.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span> <span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">aa</span> <span class="kn">import</span> <span class="n">AndersonAcceleration</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">acc</span> <span class="o">=</span> <span class="n">AndersonAcceleration</span><span class="p">(</span><span class="n">window_size</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">reg</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="o">&gt;&gt;&gt;</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span> <span class="c1"># some iterate </span><span class="o">&gt;&gt;&gt;</span> <span class="n">x_acc</span> <span class="o">=</span> <span class="n">acc</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># accelerated from x </span></code></pre></div></div> <p>You will need to apply $g$ to $x_t$ first. The result $g(x_t)$ should be the input to <code class="language-plaintext highlighter-rouge">acc.apply</code>, which will solve for $\alpha^{(t)}$ and extrapolate to find $x_{t+1}$. See <a href="https://github.com/joshnguyen99/anderson_acceleration">the repository</a> for more detail.</p> <h2 id="some-numerical-examples">Some numerical examples</h2> <h3 id="minimizing-a-convex-quadratic-objective">Minimizing a convex quadratic objective</h3> <p>We will minimize a strictly convex quadratic objective. Check <a href="https://github.com/joshnguyen99/anderson_acceleration/blob/main/quadratic_example.ipynb"><code class="language-plaintext highlighter-rouge">quadratic_example.ipynb</code></a> for more detail. The below plot shows the <em>optimality gap</em> between $f(x_t)$ and $f(x^\ast)$ over $t$. AA with a window size of 2 converges much faster than the vanilla gradient descent (GD).</p> <p align="center"> <img src="https://github.com/joshnguyen99/anderson_acceleration/raw/main/AA_GD_quadratic.png" title="Comparing GD to AA on a quadratic objective with very high condition number" /> </p> <h3 id="minimizing-a-convex-non-quadratic-objective">Minimizing a convex non-quadratic objective</h3> <p>We will minimize the $\ell_2$-regularized cross entropy loss function for logistic regression. Check <a href="logistic_regression_example.ipynb"><code class="language-plaintext highlighter-rouge">logistic_regression_example.ipynb</code></a> for more detail. Similarly, AA is much more favorable than the vanilla GD when optimizing this objective.</p> <p align="center"> <img src="https://github.com/joshnguyen99/anderson_acceleration/raw/main/AA_GD_logistic_regression.png" title="Comparing GD to AA on a non-quadratic objective with very high condition number" /> </p>Josh NguyenIn the first ever post on my website (yay!), I will introduce you to the Anderson acceleration method in fixed-point iteration. It accompanies our paper “Accelerating Federated Edge Learning”. The code can be found this repository. Fixed-Point Iteration Let $g: \mathbb{R}^d \rightarrow \mathbb{R}^d$ be an affine function of the form $g(x) = Ax + b$, where $A \in \mathbb{R}^{d \times d}$ and $b \in \mathbb{R}^d$. We would like to find a fixed point of $g$, which is a vector $x^\ast$ such that $g(x^\ast) = x^\ast$. The reason why $x^\ast$ is called a fixed point is because applying $g$ to $x^\ast$ doesn’t change itself. The analytical solution to this problem is $x^\ast = -(A - I)^{-1} b$, but there may be several issues to this. First, $A - I$ may not be invertible, in which case we need to use least squares to find its pseudoinverse. Second, even if it is invertible, the cost of solving for $x^\ast$ is $O(d^3)$, where $d$ is the dimensionality $d$, which is very costly in high dimensions. The common numerical method to solve for a fixed point of $g$ is the fixed-point iteration. Start with a randomly chosen $x_0$ and iteratively apply $g$ to it: $\label{eqn:fixed_point} x_{t+1} = g(x_t),$ until $\lVert g(x_{t+1}) - x_{t+1} \rVert \lt \epsilon$ for some predetermined precision $\epsilon$. In order for this to converge, we want to ensure that $g$ is a contraction mapping, that is, there exists an $L \in [0, 1)$ such that $\forall x, x’ \in \mathbb{R}^d, \lVert g(x) - g(x’) \rVert \leq L \lVert x - x’ \rVert$. This can be achieved when the spectral radius of $A$ is less than $1$. We can prove that to achieve a precision of $\epsilon$, we need to apply $O\left(\kappa \log \frac{1}{\epsilon} \right)$ iterations, where $\kappa$ is the condition number of $A$, which is the ratio between $A$’s largest and smallest singular values. Anderson Acceleration Fixed-point iteration could converge very slowly. The reason is that the condition number of $A$ could be large. (In real datasets, $\kappa$ could be greater than $10^6$.) Anderson acceleration (AA) can speed up convergence considerably. Here’s how it works. Define $f_t = g(x_t) - x_t$ to be the residual at iteration $t$. To find $x_{t+1}$, consider the space spanned by the previous $m_t+1$ iterates $\{x_{t - m_t}, x_{t - m_t + 1}, \ldots, x_t \}$, where $m_t$ is the window size you can choose. To find the next iterate, we consider a linear combination of these previous vectors: $\label{eqn:linear_comb} \bar{x}_t = \sum_{i=1}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ and find $\alpha^{(t)} \in \mathbb{R}^{m_t + 1}$ such that $$\| g(\bar{x}_t) - \bar{x}_t \|$$ is minimized. So what are doing here to use the previous iterates to better guide us to the solution. You can check the paper for a full derivation, but the $\alpha^{(t)}$ we should choose is $\label{eqn:alpha} \alpha^{(t)} = \frac{(F_t^\top F_t)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t)^{-1} \boldsymbol{1}},$ where $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right] \in \mathbb{R}^{d \times (m_t + 1)}$ is the matrix of all residuals, $\boldsymbol{1}$ is the $(m_t + 1)$-dimensional column vector of all ones. After finding $\alpha^{(t)}$, we set the new iterate to $\label{eqn:extrapolate} x_{t+1} = \beta \sum_{i=0}^{m_t} \alpha_i^{(t)} g(x_{t - m_t + i}) + (1 - \beta) \sum_{i=0}^{m_t} \alpha_i^{(t)} x_{t - m_t + i},$ where $\beta \in [0, 1]$ is a predetermined mixing parameter. “Regularization” You can see in the paper that in Algorithm 1, we actually set $\alpha^{(t)}$ as $\label{eqn:alpha_reg} \alpha^{(t)} = \frac{(F_t^\top F_t + \lambda I)^{-1}\boldsymbol{1}}{\boldsymbol{1}^\top (F_t^\top F_t + \lambda I)^{-1} \boldsymbol{1}},$ which is slightly different from \eqref{eqn:alpha}. The reason is we want to solve the regularized version of the problem $\underset{\alpha^{(t)}: \boldsymbol{1}^\top \alpha^{(t)} = 1}{\min} \| g(\bar{x}_t) - \bar{x}_t \|^2 + \lambda \| \alpha^{(t)} \|^2$ for stability (Section II). Without regularization ($\lambda = 0$), we recover \eqref{eqn:alpha}. The algorithm Anderson acceleration is very similar to the vanilla fixed-point iteration: start with some $x_0$. In each iteration, find $\alpha^{(t)}$ like above, and extrapolate from the $m_t + 1$ previous iterates to find the next iterate $x_{t+1}$. In other words, in each iteration $t$: Calculate $g(x_t)$. Compute the residual: $f_t = g(x_t) - x_t$. Form the residual matrix: $F_t = \left[ f_{t- m_t},\ldots, f_{t} \right]$. Solve for $\alpha^{(t)}$ according to \eqref{eqn:alpha_reg}. Extrapolate from $m_t + 1$ previous iterates according to \eqref{eqn:extrapolate}. Python Implementation of AA You can find the implementation in the aa.py file. The AndersonAcceleration class should be in instantiated with the window_size ($m_t$, defaulted to $5$) and reg ($\lambda$, defaulted to 0). Here’s an example. &gt;&gt;&gt; import numpy as np &gt;&gt;&gt; from aa import AndersonAcceleration &gt;&gt;&gt; acc = AndersonAcceleration(window_size=2, reg=0) &gt;&gt;&gt; x = np.random.rand(100) # some iterate &gt;&gt;&gt; x_acc = acc.apply(x) # accelerated from x You will need to apply $g$ to $x_t$ first. The result $g(x_t)$ should be the input to acc.apply, which will solve for $\alpha^{(t)}$ and extrapolate to find $x_{t+1}$. See the repository for more detail. Some numerical examples Minimizing a convex quadratic objective We will minimize a strictly convex quadratic objective. Check quadratic_example.ipynb for more detail. The below plot shows the optimality gap between $f(x_t)$ and $f(x^\ast)$ over $t$. AA with a window size of 2 converges much faster than the vanilla gradient descent (GD). Minimizing a convex non-quadratic objective We will minimize the $\ell_2$-regularized cross entropy loss function for logistic regression. Check logistic_regression_example.ipynb for more detail. Similarly, AA is much more favorable than the vanilla GD when optimizing this objective.