## Latent Dirichlet Allocation
16 March 2017
## Outline
1. Introducing the LDA model
2. Deriving collapsed gibbs sampler
3. Results
4. Discussion
## LDA Model and Motivation
1. Introduced by David Blei, Andrew Ng, Michael Jordan in 2002
2. Topic modeling, usually with text documents
3. Useful to get a sense for which things "occur together"
## LDA Model
Consider each "document" as an unordered dictionary mapping each word to
the number of times it occured in that document.
1. Each "topic" is distribution over all possible words.
2. A document is then represented as a distribution over topics.
3. We can optionally learn the marginal distribution of topics.
## LDA Generative Model
1. Pick a distribution over topics.
2. For each word, randomly select a topic according to that distribution.
3. Then randomly select a word according to the topic distribution over words.
## Statistical Model
- $t_d$ is the distribution over topics in document d.
- $w_t$ is the distribution over words for topic t.
- $z_i$ is the latent topic from which word i is drawn.
- $y_i$ is word i.
$$ t_d \sim Dirichlet(\bf{\theta_2})$$
$$ w_t \sim Dirichlet(\bf{\theta_1})$$
$$ z_i \sim Categorical(t_d)$$
$$ y_i \sim Categorical(w_z)$$
## Gibbs sampling
- The likelihood is then given by:
$$ \prod_d p(t_d) \prod_t p(w_t) \prod_i p(z_i | t_d) p(y_i | w_t, z_i) $$
## Collapsed Gibbs Sampling
- We only care about $z_i$.
- $t_d$ and $w_t$ can be recovered from all the $z_i$s.
- i.e. If you know all the latent states, you know the overall proportion
of states.
- We can therefore marginalize out $t_d$ and $w_d$.
## Collapsed Gibbs Sampling
- This considerably simplifies sampling
- We start with random topic allocations
- Then update each topic based on those allocations.
- Finally, we sample the allocations based on those updated distributions.
- Repeat.
## Code
Let's look at the
[reproducible analysis](https://github.com/bayes1/neuralnetwork-project/blob/master/LDA.ipynb)
## Results
Example topics (named by hand)
|Family|War|RomCom|
|---|---|---|
|love|war|love|
|life|men|comedy|
|story|world|characters|
|woman|american|romantic|
|wife|battle|together|
|husband|country|humor|
|man|government|women|
## Discussion
- Works better on some data sets than others
- Tried first on inaugruation addresses, but they're all about the same thing
- Works in other contexts as well, especially data that fits the 'bag of
words' model (i.e. order is not important)
- Example: finding "topics" of car models in different areas in U.S.
## Discussion: Extensions
- Put a dirichlet process prior on the number of topics
- Incorporate a notion of ordering with an HMM