This post is Part 1 in a two-part series about topic modeling for qualitative research. In this post, we focus on what topic modeling is, how EPAR uses topic modeling, and the technical nitty-gritty of writing and running your own topic modeling code. In Part 2, we offer helpful tips to the challenging and creative process of interpreting the results of a topic model. Download the topic modeling script from our Github page and follow along! 

What is Topic Modeling? 

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents.  This process is useful for qualitative data analysis, particularly automating the review of large numbers of documents early on in the research process. Imagine you are new to Shakespeare and interested in evaluating whether his plays fit into general categories. You could feed the plays (the “documents”) into a topic modeling algorithm, provide the number of “topics” you want, and receive two sets of results: a cluster of words for each of the (in this case two) topics you selected (Figure 1) and the probability of each document fitting into one of the two topics (Figure 2):

Topic 1

Topic 2

Lord, king, god, must, hand, death, love, give, heart, know, man, speak, princ, enter, duke, son, fear, time, queen, men hast, blood, live, father, mine

Sir, love, man, know master, give, must, ‘ll, time, pray, eye, fool, ladi, speak, mine, sweet, fair, hear, mistress, lord, word, hous, hamlet, father, marri

Figure 2. Probability of Document Fitting Each Topic (Only Subset of 5 Documents Shown)

Document Name

V1

V2

Topic

The Winter's Tale.docx

0.490543

0.509457

2

A_Midsummer_Nights_Dream.doc

0.357087

0.642913

2

Julius Caesar.docx

0.775332

0.224668

1

Much Ado About Nothing.docx

0.282183

0.717817

2

Romeo and Juliet.docx

0.727093

0.272907

1

Each document is forced to be classified into the number of topics you provide at the outset, so the probabilities of each row will always equal 1. For example in Figure 2, The Winter’s Tale has a 49% probability of being Topic 1, a 51% probability of being Topic 2, and a 100% probability of being one of those two topics. The third column in Figure 2 (“Topic") assigns the document to the topic with the highest probability of fitting the document.

Given the context of these documents and your domain knowledge, you may conclude that Topic 1 generally classifies tragedies while Topic 2 generally classifies comedies. This may be less obvious in Figure 1 using the frequency of words being used in each topic, and more obvious in Figure 2 where we ask what A Midsummer Night’s Dream and Much Ado About Nothing have in common that is different from what The Tragedy of Julius Caesar and Romeo and Juliet have in common. Which results are more useful depends entirely on the types of documents you have and your research questions. It is interesting to notice that The Winter’s Tale balances on the edge of being classified a tragedy or a comedy, which mirrors current debate!

It is important to note that designation of the topics as differences in genre is driven by you as the researcher; the model only provides a cluster of words for each topic and the classification of each document into these topics. Additionally, topic classifications cannot be deductively falsified with topic modeling, and therefore these topics should be used to guide the development of hypotheses, not be treated as a method for testing hypotheses about the topics present in the documents.  

How is EPAR Applying Topic Modeling?

EPAR has been using topic modeling to look for groupings among documents from grants in an investment area and seeing what concepts relate to one another that a person would not have otherwise suspected. The focus has been less on identifying topics and more on what makes grants similar and different from each other. For example, among grant documents about agricultural development, are there natural groupings among methods applied and constituents served?

As you may expect, interpreting the results of topic modeling for this purpose is not as straightforward as our first example. In the process of applying topic modeling to the applied social sciences and philanthropic research, we faced unique challenges and have gathered specific tips that could be useful to others considering applying this method to their own ambiguous, qualitative research questions. These tips are available in Part 2 of this blog series.

The next section dives into the nitty-gritty of topic modeling and how the EPAR topic modeling algorithm is coded, which you can see for yourself on our Github page.

Technical Nitty-Gritty of Running a Topic Model

Topic modeling can be done through a number of algorithms, but the most widely used is Latent Dirichlet Allocation (LDA). The goal of the algorithm is to uncover the hidden or “latent” topics by attempting to re-create the individual documents. It is an iterative procedure, and Gibbs sampling is used to arrive at the right topics. The set of all documents constitute a corpus, and the assumption is that each document is constituted of a variety of topics across the corpus. In other words, LDA assumes that every document can be a mixture of topics (e.g. A Midsummer Night’s Dream is 36% tragedy and 64% comedy) and every topic is a mixture of words (e.g. the word “father” is used in tragedies and comedies).

General Stages of the Algorithm:

  1. The user (you!) specifies the number of topics, k, as an input for the algorithm. Because k needs to be specified upfront, it is prudent to save the results for different k values and inspect them to decide on the optimal number of topics. We provide more tips on how to determine the best number of topics retroactively in Part 2 of this blog post.
  2. Random assignment --- every word in a document is randomly assigned one topic out of k.  
  3. The procedure iterates over every word in a document, and updates the association between topics and the specific word assuming that every other assignment is correct. It is a generative model, which means it tries to recreate the original document from the attached topic probabilities, and subsequently words within those topics.
  4. After a lot of iterations, we achieve a better state where the associations between words and topics is pretty good and assigning distribution of topics within a document is possible.

The main packages for text mining and applying the LDA model are tm and topicmodels. Textreadr reads in a variety of file formats (.doc, .docx, .xlsx) and streamlines the input text into a text string. The collections of documents with text (the “corpus”), and all pre-processing is performed on the entire corpus. Standard pre-processing for text includes operations such as:

  • Converting all text to the lower case

  • Removing punctuation, digits, and white space

  • Removing “stop words” that do not add much meaning to the text.

    • There are some standard stop words like “by”, “to”, and “the” that come with the package. However, you should add stop words based off the context of your work. For example, we added the words “project”, “program” and “proposal” while running our topic model across grant proposal documents.

  • Stemming to identify the common roots across different usage (win, winning, and winner), using the package SnowballC

Topic Modeling expects a document-term matrix as an input. A document-term matrix has documents as rows and all unique words as columns—elements in the cells are the word counts for each document.

Gibbs sampling requires many parameters to arrive at acceptable results through a random walk approach. Burn-in value specifies the number of steps to discard during the initial phase, as the initial steps are random. Nth value (thin) is used after a set number of iterations (iter). Different starting points are used by nstart along with set seeds to ensure reproducibility. The iteration with highest posterior probability is identified by setting best to TRUE. It is a good idea to run the algorithm with different combinations of parameters and repeatedly check if the results make sense, as Gibbs sampling does not ensure a globally optimal solution and results might vary based on different starting points.

This process outputs two spreadsheets. The first spreadsheet lists each topic as a column and ranks the words that fall into that topic based on the prevalence of those words in each topic (Figure 1).  The second spreadsheet specifies the probabilities of each document belonging to each topic (Figure 2).  

You can run our code on Shakespeare plays or your own set of documents and recreate Figures 1 and 2 by visiting the Topic Modeling repository on our Github page. Let's move on to Part 2 about how to interpret topic modeling results! 

Resources for Learning More About Topic Modeling

  • Text Mining with R. Chapter 6. Topic Modeling. Link.
  • Topic Modeling with Scikit Learn. Link.
  • Patrick van Kessel's series of posts about topic models for text analysis. Link.

By Namrata Kolla, Rohit Gupta & Terry Fletcher