Understanding Latent Dirichlet Allocation

Jul 9, 2012·
Naomi Saphra
Naomi Saphra
· 5 min read

[Note - This is a repost of a post I made on my old blog while I was in undergrad. I’m including it in case someone finds it useful, since my old blog is defunct. I haven’t significantly edited it, so I’m sorry if it doesn’t fit into my current style.]

This post is directed to a lay CS audience. I am an undergraduate in CS, so I consider myself part of that audience. If you’re awesome at machine learning already and don’t want to help me along here, just read the paper.

Latent Dirichlet Allocation (LDA) is a common method of topic modeling. That is, if I have a document and want to figure out if it’s a sports article or a mathematics paper, I can use LDA to build a system that looks at other sports articles or mathematics papers and automatically decides whether this unseen document’s topic is sports or math.

To LDA, a document is just a collection of topics where each topic has some particular probability of generating a particular word. For our potential sports article, the word “average” appears 4 times. What’s the probability of a sports topic generating that many instances of “average”? We determine this by looking at each training document as a “bag of words” pulled from a distribution selected by a Dirichlet process.

Dirichlet is a distribution specified by a vector parameter α\alpha containing some αi\alpha_i corresponding to each topic ii, which we write as Dir(α)\textrm{Dir}(\alpha). The formula for computing the probability density function for each topic vector xx is proportional to the product over all topics ii of xiαix_i \alpha_i. xix_i is the probability that the topic is ii, so the items in xx must sum to 1. That keeps you from getting arbitrarily large probabilities by giving arbitrarily large values of xx.

Confused? Ready for a picture ripped off Wikipedia?

Dirichlet Distributions

Those graphs all show Dirichlet distributions for three topics. That triangle at the bottom has one side for each topic, and the closer a point on the triangle is to side ii the higher the probability of topic ii. The purple curve is the probability density function over the mixture of topics. See how the edges of the triangle all have probability 0? We said that the pdf is proportional to xiαix_i \alpha_i, so if xix_i is 0, the probability of that mixture of topics is 0. That restricts our model a bit and ensures that we never are totally certain about the topic of a document.

Okay, we’ve got “Dirichlet”, so let’s pop back up to the concept of LDA. If we want to find that mixture of topics for a document, we first need to determine the value of each xix_i. That means we’ve got another Dirichlet distribution in our model for each topic i where instead of the sides of the triangle being topics, they’re words. Picture the topic “sports article” like those distributions above, but instead of sitting on triangles they’re on shapes with so many sides the shapes go into as many dimensions as we have topics.. If “average” appears in a sports article, the bump pushes closer to the side for “average”.

The Latent part of LDA comes into play because in statistics, a variable we have to infer rather than directly observing is called a “latent variable”. We’re only directly observing the words and not the topics, so the topics themselves are latent variables (along with the distributions themselves).

LDA assumes that each document k is generated by:

  1. From our Dirichlet distribution for k, sample a random distribution of topics. That is, pick a place on that triangle that is associated with a certain probability of generating each topic. If we choose a place very close to the “sports article” edge, we have a higher probability of picking “sports article”. The probability of picking a particular place on the triangle is described by the pdf of the Dirichlet distribution (the placement of the purple mound).
  2. For each topic, pick a distribution of words for that topic from the Dirichlet for that topic.
  3. For each word in document kk,
    1. From the distribution of topics selected for kk, sample a topic, like “sports article”.
    2. From the distribution selected for “sports article”, pick the current word.

So let’s say your first four words all come from baseball and your document maybe starts off “average the bat bat”. If that’s not how you tend to write, that’s okay. All models are wrong.

The important thing to understand is that your Dirichlet priors are distribution of distributions, which are selected to generate each word.

We’re generally not just making these distributions for the heck of it or to actually generate documents. We want to figure out what topic was probably used for each word by our lazy writer who randomly generates each word. Maybe it’s been a while since you took probability, but do you remember this guy? P(AB)=P(BA)P(A)P(B) P(A|B) = \frac{P(B|A) P(A)}{P(B)}

This is Bayes’ Theorem. We already know the probability of generating a particular word given a topic according to our model. That’s the probability of sampling that word from the topic’s word distribution. So that’s P(BA) P(B|A) where B B is the event of “average” being generated for the current word and A is the event of picking the topic “sports article”. P(A)P(A) is the probability of “sports article” being picked from the document’s topic distribution. P(B) P(B) is the probability of “average” being generated at all, which is the sum over all topic selections AA of P(BA)P(A)P(B|A)P(A). Now we can use Bayes to find P(AB)P(A|B), the probability that topic AA generated word BB.

So now we know how to figure out the probability of each topic per word. Now we already know that our documents are assumed to be a mix of topics, but we want to find the most likely dominant topic. The lazy writer generates each word independently of each other word, so the overall probability of a topic throughout the document is the product of P(AB)P(A|B) at each word BB. Just pick the most likely topic across the words.