Topic Model Tutorial
A basic introduction to topic modelling for web scientists by Christoph Carl Kling, Lisa Posch, Arnim Bleier and Laura Dietz
Presentations and Animations
Presentations
Topic Model Tutorial - Part 1: The Intuition
(pdf)
Topic Model Tutorial - Part 2: Technical Foundations
(pdf)
Topic Model Tutorial - Part 3: Extensions and Adaptations
(pdf)
Topic Model Tutorial - Part 4: Evaluation of Topic Models
(pdf)
Topic Model Tutorial - Bonus slides 1: Parameter inference
(pdf)
Animations & Game
Animation Polya urn (restaurant) scheme for Dir(1,2,1,2,1,3)
(link)
Animation Polya urn (restaurant) scheme for Dir (.1,.1,.1,.1,.1,.1)
(link)
Animation LSA on AP corpus
(link)
Word intrusion game
(link)
Word intrusion live boxplot
(link)
Bonus: Animation Chinese restaurant process (CRP) for DP(2,H)
(link)
Standalone Animations
Animation Polya urn (restaurant) scheme
(link)
Animation Chinese restaurant process
(link)
(Re-usable version of the word intrusion game in preparation)
This tutorial is a basic introduction to topic modelling for web scientists.
Prior knowledge on probabilistic modelling or topic modelling is not required. The idea is to explain the fundamental mechanisms and ideas behind topic modelling, without using distracting formal notation unless necessary.
Outline
In this tutorial, we teach the intuition and the assumptions behind topic models.
Topics explain co-occurrences of words in documents with sets of semantically
related words, called topics. These topics are semantically coherent and can
be interpreted by humans. Starting with the most popular topic model,
Latent
Dirichlet Allocation
(LDA), we explain the fundamental concepts of probabilis-
tic topic modeling. We organise our tutorial as follows: After a general intro-
duction, we will enable participants to develop an intuition for the underlying
concepts of probabilistic topic models. Building on this intuition, we cover the
technical foundations of topic models, including graphical models and Gibbs
sampling. We conclude the tutorial with an overview on the most relevant
adaptions and extensions of LDA
Developing an Intuition
In the first part, we provide the participants with an intuition of the ideas and assumptions behind probabilistic topic models. First, we present easily understandable metaphors (following the Polya urn scheme) to introduce the multinomial and the Dirichlet-multinomial distribution and the role of the parameter of the Dirichlet distribution for probabilistic modelling. Furthermore, we introduce the notion that a corpus of documents can be modelled as a mixture of Dirichlet-multinomial distributions. We then train LDA on text corpora and demonstrate the effects of different parameter settings on the trained topic models. In order to deepen the intuition, we conclude this part with a game with a purpose, enabling a human evaluation of model parameters.
Technical Foundations
After developing the intuition, in the second part of the tutorial we show how the assumptions in the metaphors translate to the single parts of Latent Dirichlet Allocation (LDA), the most cited topic model in the scientific community. We provide a translation of the gained intuition to detailed definitions. In particular, we aim to cover concepts such as closed form inference, approximate inference with a focus on Gibbs sampling, generative storyline and plate notation. For each of the introduced concepts, we provide illustrative implementation examples.
Adaptations and Extensions
LDA has been adapted and extended to a wide range of specific settings. In the final part of the tutorial, we will present adaptations relevant for the social sciences.
Examples include models exploiting context information such as L-LDA, a supervised variant of LDA; PL-TM, a topic model for multilingual settings; Citation Influence Model, modeling the influence of citations in a collection of publications.
Evaluation and Discussion of Pros and Cons
While a useful tool for exploitative analysis of unfamiliar
data collections, topic models were disputed in the recent
past. Common error modes are discussed and the critique
about topic models is summarized. We emphasize the im-
portance of evaluating any exploratory tool in domain of
interest before drawing conclusions. To enable participants
to make an informed decision, we discuss several avenues for
in-domain evaluation