Topic Modelling Portal

Stochastic data analysis for web scientists and computational social scientists

Topic Model Tutorial

A basic introduction to topic modelling for web scientists by Christoph Carl Kling, Lisa Posch, Arnim Bleier and Laura Dietz

Presentations and Animations

Presentations

Topic Model Tutorial - Part 1: The Intuition (pdf)
Topic Model Tutorial - Part 2: Technical Foundations (pdf)
Topic Model Tutorial - Part 3: Extensions and Adaptations (pdf)
Topic Model Tutorial - Part 4: Evaluation of Topic Models (pdf)
Topic Model Tutorial - Bonus slides 1: Parameter inference (pdf)

Animations & Game

Animation Polya urn (restaurant) scheme for Dir(1,2,1,2,1,3) (link)
Animation Polya urn (restaurant) scheme for Dir (.1,.1,.1,.1,.1,.1) (link)
Animation LSA on AP corpus (link)
Word intrusion game (link)
Word intrusion live boxplot (link)
Bonus: Animation Chinese restaurant process (CRP) for DP(2,H) (link)

Standalone Animations

Animation Polya urn (restaurant) scheme (link)
Animation Chinese restaurant process (link)
(Re-usable version of the word intrusion game in preparation)

This tutorial is a basic introduction to topic modelling for web scientists. Prior knowledge on probabilistic modelling or topic modelling is not required. The idea is to explain the fundamental mechanisms and ideas behind topic modelling, without using distracting formal notation unless necessary.

Outline

In this tutorial, we teach the intuition and the assumptions behind topic models. Topics explain co-occurrences of words in documents with sets of semantically related words, called topics. These topics are semantically coherent and can be interpreted by humans. Starting with the most popular topic model, Latent Dirichlet Allocation (LDA), we explain the fundamental concepts of probabilis- tic topic modeling. We organise our tutorial as follows: After a general intro- duction, we will enable participants to develop an intuition for the underlying concepts of probabilistic topic models. Building on this intuition, we cover the technical foundations of topic models, including graphical models and Gibbs sampling. We conclude the tutorial with an overview on the most relevant adaptions and extensions of LDA

Developing an Intuition

In the first part, we provide the participants with an intuition of the ideas and assumptions behind probabilistic topic models. First, we present easily understandable metaphors (following the Polya urn scheme) to introduce the multinomial and the Dirichlet-multinomial distribution and the role of the parameter of the Dirichlet distribution for probabilistic modelling. Furthermore, we introduce the notion that a corpus of documents can be modelled as a mixture of Dirichlet-multinomial distributions. We then train LDA on text corpora and demonstrate the effects of different parameter settings on the trained topic models. In order to deepen the intuition, we conclude this part with a game with a purpose, enabling a human evaluation of model parameters.

Technical Foundations

After developing the intuition, in the second part of the tutorial we show how the assumptions in the metaphors translate to the single parts of Latent Dirichlet Allocation (LDA), the most cited topic model in the scientific community. We provide a translation of the gained intuition to detailed definitions. In particular, we aim to cover concepts such as closed form inference, approximate inference with a focus on Gibbs sampling, generative storyline and plate notation. For each of the introduced concepts, we provide illustrative implementation examples.

Adaptations and Extensions

LDA has been adapted and extended to a wide range of specific settings. In the final part of the tutorial, we will present adaptations relevant for the social sciences. Examples include models exploiting context information such as L-LDA, a supervised variant of LDA; PL-TM, a topic model for multilingual settings; Citation Influence Model, modeling the influence of citations in a collection of publications.

Evaluation and Discussion of Pros and Cons

While a useful tool for exploitative analysis of unfamiliar data collections, topic models were disputed in the recent past. Common error modes are discussed and the critique about topic models is summarized. We emphasize the im- portance of evaluating any exploratory tool in domain of interest before drawing conclusions. To enable participants to make an informed decision, we discuss several avenues for in-domain evaluation

Promoss Topic Modelling Toolbox

The Promoss topic modelling toolbox is free software, developed by GESIS, Leibniz Institute for the Social Sciences in Cologne.

Download jar Java source code

Latent Dirichlet Allocation (LDA)

Promoss implements LDA with an efficient online stochastic variational inference scheme, meaning that the memory consumption is lower than for standard implementations and inference is significantly sped-up.

The Usage is simple: You create a corpus.txt file in which each line corresponds to a document. Then you execute the promoss.jar with

java -Xmx11000M -jar ./promoss.jar -method "LDA" PATH_TO_DIRECTORY/ \
-MIN_DICT_WORDS 100 -T 50

Where -T 50 sets the number of topics to 50 and -MIN_DICT_WORDS 100 gives the minimum occurrences required to include a word in the analysis (in this case 100). There also exists an alternative input format based on a dictionary and documents given in SVMlight format, which is documented in the readme file.

Hierarchical Multi-Dirichlet Process Topic Model (HMDP)

You want to include multiple document metadata in your topic model, such as geographical location, timestamps or ordinal variables? But you do not want to spend weeks writing your own topic model and want an efficient inference?

Store the document metadata separated by semicolons in a file named meta.txt. The documents have to be put in a file named corpus.txt in which each line corresponds to a document. Documents can be raw and will be processed by Promoss. You have to tell which metadata are geographical locations, timestamps, ordinal or nominal data. Timestamps can be used to extract yearly, monthly, weekly or daily cycles.

Then you just have to execute the .jar file with a few parameters (documented in the readme file). Example command line usage:

java -Xmx11000M -jar promoss.jar -directory PATH_TO_DIRECTORY/ \
-meta_params "T(L1000,W1000,D10,Y100,M20);N" -MIN_DICT_WORDS 1000

If you need any support in using Promoss, feel free to contact us:
topicmodels (ät) c-kling.de

Contact

Do you have suggestions how we could improve our material? Or do you want to host the topic model tutorial at your institute?
Please feel free to ask us any related question: topicmodels (ät) gesis.org