Understanding Latent Dirichlet Allocation
[Note - This is a repost of a post I made on my old blog while I was in undergrad. I’m including it in case someone finds it useful, since my old blog is defunct. I haven’t significantly edited it, so I’m sorry if it doesn’t fit into my current style.]
This post is directed to a lay CS audience. I am an undergraduate in CS, so I consider myself part of that audience. If you’re awesome at machine learning already and don’t want to help me along here, just read the paper.
Latent Dirichlet Allocation (LDA) is a common method of topic modeling. That is, if I have a document and want to figure out if it’s a sports article or a mathematics paper, I can use LDA to build a system that looks at other sports articles or mathematics papers and automatically decides whether this unseen document’s topic is sports or math.
To LDA, a document is just a collection of topics where each topic has some particular probability of generating a particular word. For our potential sports article, the word “average” appears 4 times. What’s the probability of a sports topic generating that many instances of “average”? We determine this by looking at each training document as a “bag of words” pulled from a distribution selected by a Dirichlet process.
Dirichlet is a distribution specified by a vector parameter $$\alpha$$ containing some \(\alpha_i\) corresponding to each topic \(i\), which we write as \(\textrm{Dir}(\alpha)\). The formula for computing the probability density function for each topic vector \(x\) is proportional to the product over all topics \(i\) of \(x_i \alpha_i\). \(x_i\) is the probability that the topic is \(i\), so the items in \(x\) must sum to 1. That keeps you from getting arbitrarily large probabilities by giving arbitrarily large values of \(x\).
Confused? Ready for a picture ripped off Wikipedia?
Those graphs all show Dirichlet distributions for three topics. That triangle at the bottom has one side for each topic, and the closer a point on the triangle is to side \(i\) the higher the probability of topic \(i\). The purple curve is the probability density function over the mixture of topics. See how the edges of the triangle all have probability 0? We said that the pdf is proportional to \(x_i \alpha_i\), so if \(x_i\) is 0, the probability of that mixture of topics is 0. That restricts our model a bit and ensures that we never are totally certain about the topic of a document.
Okay, we’ve got “Dirichlet”, so let’s pop back up to the concept of LDA. If we want to find that mixture of topics for a document, we first need to determine the value of each \(x_i\). That means we’ve got another Dirichlet distribution in our model for each topic i where instead of the sides of the triangle being topics, they’re words. Picture the topic “sports article” like those distributions above, but instead of sitting on triangles they’re on shapes with so many sides the shapes go into as many dimensions as we have topics.. If “average” appears in a sports article, the bump pushes closer to the side for “average”.
The Latent part of LDA comes into play because in statistics, a variable we have to infer rather than directly observing is called a “latent variable”. We’re only directly observing the words and not the topics, so the topics themselves are latent variables (along with the distributions themselves).
LDA assumes that each document k is generated by:
- From our Dirichlet distribution for k, sample a random distribution of topics. That is, pick a place on that triangle that is associated with a certain probability of generating each topic. If we choose a place very close to the “sports article” edge, we have a higher probability of picking “sports article”. The probability of picking a particular place on the triangle is described by the pdf of the Dirichlet distribution (the placement of the purple mound).
- For each topic, pick a distribution of words for that topic from the Dirichlet for that topic.
- For each word in document \(k\),
- From the distribution of topics selected for \(k\), sample a topic, like “sports article”.
- From the distribution selected for “sports article”, pick the current word.
So let’s say your first four words all come from baseball and your document maybe starts off “average the bat bat”. If that’s not how you tend to write, that’s okay. All models are wrong.
The important thing to understand is that your Dirichlet priors are distribution of distributions, which are selected to generate each word.
We’re generally not just making these distributions for the heck of it or to actually generate documents. We want to figure out what topic was probably used for each word by our lazy writer who randomly generates each word. Maybe it’s been a while since you took probability, but do you remember this guy? \[ P(A|B) = \frac{P(B|A) P(A)}{P(B)} \]
This is Bayes’ Theorem. We already know the probability of generating a particular word given a topic according to our model. That’s the probability of sampling that word from the topic’s word distribution. So that’s \( P(B|A) \) where \( B \) is the event of “average” being generated for the current word and A is the event of picking the topic “sports article”. \(P(A)\) is the probability of “sports article” being picked from the document’s topic distribution. \( P(B) \) is the probability of “average” being generated at all, which is the sum over all topic selections \(A\) of \(P(B|A)P(A)\). Now we can use Bayes to find \(P(A|B)\), the probability that topic \(A\) generated word \(B\).
So now we know how to figure out the probability of each topic per word. Now we already know that our documents are assumed to be a mix of topics, but we want to find the most likely dominant topic. The lazy writer generates each word independently of each other word, so the overall probability of a topic throughout the document is the product of \(P(A|B)\) at each word \(B\). Just pick the most likely topic across the words.