Topic Modeling applied to Gazette de Lausanne and Journal de Genève

Introduction:

Suppose you input into your computer the last few recent speeches of American billionaire Donald Trump, and that your program tells you that the famous business magnate is talking about the future presidential elections, the economy of the United States, and the war on terror. This is called topic modeling. In formal terms, topic modeling refers to the techniques and algorithms used to reveal the main themes contained in a document or a collection of documents.

Coming from the world of computer science, machine learning, and information retrieval, the fifteen year old field has been utilized in many different domains ranging from world literature to bioinformatics, and for documents ranging in size from tweets to monographs [1]. Topic modeling made its definitive way into Digital Humanities (DH) in 2003 with David Blei et al’s paper about Latent Dirichlet Allocation (LDA), which is currently the most popular topic modeling tool used in DH. Ever since, it has been a relatively hot field in the DH world, even pushing some authors to refer to it as a synecdoche (a part referring to the whole) of DH [1].

We present in this blog post our Master’s Social and Human Science project, which consists of applying topic modeling to more than 100 years of articles taken from the Swiss newspapers “Gazette de Lausanne” (GDL) and “Journal de Genève”  (JDL). We will first describe the potential and relevance of our work by looking at different projects in the Digital Humanities community that employ Topic Modeling. We then proceed with a formal definition of the project objectives and deliverables. After that, we present the methodology we intend to follow in order to achieve these objectives and finish with a detailed section about the plan and milestones for the next semester.

Topic Modeling in Digital Humanities:

Before we get into the details of our plan, we present the motivation behind our project and its potential interest in the Digital Humanities community by examining the use of Topic Modeling in DH projects.

The beginning of the 21st century was marked by the rise of many library digitization efforts such as Google Books or the Open Content Alliance. This gave rise to a very large available collection of digitized books, newspapers, and other texts. However, the process of exploring and analyzing the contents of thousands of texts is very cumbersome. Topic modeling has been used to explore automatically and quickly some aspects of these large collections of texts. This is called “Distant Reading”. As an example, Riddell used topic models to “read” and study 22,198 journal articles and book reviews from four US-based German studies journals [3]. Mimmo does a similar thing with 24 journals in classical philology and archaeology [4].

Another trend in using Topic Modeling by Digital Humanists is to detect and analyze changing topics in the news over time. For example, Newman et al. from the University of California, Irvine, extract using Topic modeling the themes covered by 80,000 articles and advertisements from a colonial U.S newspaper, the Pennsylvania Gazette, between 1728 and 1800 [5]. They then analyzed how those topics changed over time, uncovering important information about the content of the newspaper and contributing towards a better understanding of early American print culture and society. Another popular example is the “Mining the Dispatch” project by Nelson of the University of Richmond [6]. In that project, over 112,000 articles and advertisements from the Richmond Daily Dispatch dating between 1860 and 1865 were analyzed using a very popular Topic Modeling software tool, MALLET. Nelson analyzes the rise and fall of topics over time in order to get an insight into the changes and continuities in the social and political life of Richmond during the American civil war. In [7], Yang et al. provide a more general overview of topic modeling and newspaper.

In another application, Cameron Blevins uses Topic modeling to analyze the diary of Martha Ballard, an American midwife, healer, and diarist [8]. He in particular hails the power of topic modeling when it comes to examining thematic trends across the entire diary and its ability to extract unseen patterns in word usage. For example, MALLET was able to reveal that the topic “emotion” occurred much more frequently in her entries between 1803 and 1804. Indeed, during that time, her husband was imprisoned for debt and her son was indicted by a grand jury for fraud.

Even finding its way into poetry [9,10], Topic Modeling is indeed a hot field and a utilized tool in Digital Humanities. Topic modeling applied on the GDL and JDG can bring multiple benefits for making research more accessible and more efficient. For example, the different articles could be grouped in themes to give researchers faster and easier accessibility to topics of interest. But what’s more important is that Topic Modeling can be used the same way as described above to reveal hidden themes and patterns, examine trends, detect meaningful events, and get a better insight into the culture and society of the French speaking part of Switzerland throughout the years. Given its immense usage in Digital Humanities, it is not far-fetched to say that our project’s impact may not be just localized to the two Swiss newspapers, but is worldwide.

Project Objectives

The main goal of the project is to develop a tool to perform automated topic modeling.  The objectives are:

  • To find an appropriate model and the best model parameters for performing automated topic modeling on newspaper articles from the Journal de Genève and the Gazette de Lausanne.
  • To experiment with the effect of noisy (erroneous) words in the articles on the distributions of topics.
  • To explore the rise and decline of topics over time, and examine the properties of that rise and decline (sharp vs gradual).
  • Prepare a process/method/tool for topic modeling new articles given after completion of the model configuration, in the same automated fashion, and producing good topic model results.

Deliverables

  • A software tool that is able to find the topics present in a given article and also analyze the dynamics of topics in a certain dataset.
  • Final Report of the project and Presentation.

Theoretical Background:

Before we dive into the details of the different methods we are going to adopt in our project, we first give some definitions and theoretical background that are necessary for better understanding the different stages of our methodology.

Definitions:

In our context, a topic is defined as a probability distribution of words. An article or document is defined as a probability distribution of topics, which is calculated by matching each word in the document with words in the distribution of each topic. In our project, since the number of words that is needed to identify a topic is not clear before analyzing the distribution of words associated with it, it might be better to focus on the pivotal part of the distribution. This makes the definition of topic more narrowed and robust.

Topic Modeling Methods and Tools:

Topic modeling has been a trending subject in digital humanities during these past decades, which has encouraged the development of many techniques and algorithms to extract the topics from a textual corpus. The most popular technique is called Latent Dirichlet Allocation (LDA), which was introduced by David Blei in 2003, as mentioned before. It is an unsupervised natural language processing model, where the topic distributions are associated with a prior Dirichlet probability distribution. It aims to attribute words distributions to topics and topic distributions to documents, thus it is a ‘bag of words’ approach. To simplify, LDA starts with randomly associating words to a certain topic, and iteratively refines these associations until a stable solution representing the topic distribution among the corpus is found. The topics are called latent because they cannot be directly observed, as opposed to the words in the documents. LDA has been extended by its author [11] to include aspects of supervised learning. Supervised Latent Dirichlet Allocation uses the basics of LDA, but a ‘response variable’ is added to each document. sLDA thus conjointly models the topics and the response variables, allowing the inference of the responses on a new set of documents using the previously determined topics.

There exist many software tools and libraries which implement LDA and some of its extensions. The most used by digital humanists is MALLET, which is a Java library written by Andrew McCallum from the University of Massachusetts Amherst.  However, GenSim, which is a Python library that seemingly offers more freedom, might be better suited for our project.

Methodology

In this section, we present the methodology we are adopting in order to fulfill the objectives and deliverables already mentioned.

Data Exploration and Extraction:

This step is useful in order to gain a better insight into the content as well the structure of our dataset, and allows us to identify potential problems and limitations. If necessary, the dataset may be integrated in a new data structure that could facilitate its access and manipulation.

Preprocessing:

After getting familiar with the dataset and shaping it into an appropriate data structure, we proceed to a very important step: preprocessing. This may involve:

  • The removal of stop words: The term ‘stop words’ refers to the most common words, which, due to their presence in almost any text, would not provide valuable information on said text.
  • Word stemming: Words that share the same root can be reduced to this common part, since they mostly refer to the same topic or are grammatical variations of the same idea.
  • The correction of OCR errors: Optical Character Recognition (OCR), which is used to digitize the articles, might introduce some errors in the text due to its sub-optimality. To address this issue, two possibilities exist: for samples where OCR mistakes are sufficiently low in number, these mistakes can simply be removed or ignored, whereas for a sample where OCR is dramatically hindering comprehension, the text has to be entirely removed from the corpus. A quick exploration of the data showed that this problem is more frequent with articles that were not written in the close past.

Preprocessing is very important in order to remove redundant and unnecessary data, as well as noise which could affect negatively the outcome of our model.

After the preprocessing step, we plan on using MALLET or GenSim to perform Topic Modeling on our “cleaned” dataset. However, the use of these software tools depends on the appropriate choice of several parameters, which can affect positively or negatively their outcome.

Evaluation Set Construction:

To be able to measure the quality of the parameters as we modify them and explore the configuration possibilities, we plan to build a small set that consists of articles and their associated topics. This set, which we call the “evaluation set”, is built manually and is used as an evaluation criterion for finding the optimal configuration of parameters for our model.  Without this evaluation set, it would be impossible to measure the quality of the results output by the topic modeling procedure. Since this evaluation set must be built manually, the number of articles will be relatively small, around the order of 1 per year for the time visualization, and one every 5 years for the overall topic modeling.

Constructing our model:

After the construction of the evaluation set, we look at the output of applying different models using different sets of parameters on the evaluation set. We note that these parameters include also some parameters used in preprocessing, so that different preprocessing variations are tested. We compare the output with our own labels to assess the “goodness” of the model using heuristics. This best configuration defines our final model and can then be used on the rest of the articles.

Topic Extraction and Analysis:

Using the optimal model and configuration found in the previous step, we use it to extract the topics of the articles of the whole dataset.

As our dataset consists of newspaper articles that span more than 100 years, it is evident that new topics emerged during this time, and some others were subject to changes along the way. Therefore, an interesting application would be to study the dynamics of topics over time, such as the emergence of new topics or the decline of some others. However, classical topic modeling doesn’t take into account evolving content in the documents, as it treats words as being exchangeable, i.e. independent of their time/order of appearance in the corpus. Luckily, this can be done using what is called dynamic topic modeling: the documents are segregated into fixed intervals, and then sequences of random variables are modeled by chaining Gaussian distributions [12].

We will use at least two approaches in our project. The first is to apply topic modeling using our chosen classifier on the whole set of articles at once, which assumes that documents are not ordered, and therefore the topics generated reflect the whole duration span at once. The second approach consists of segmenting our documents into intervals of 5 to 10 years, and applying topic modeling on each segment separately. This will make our topics change depending on the time period they’re used in, and will also allow us to observe how various topics are changing.

Project Plan and Milestones:

In this section, we explain how we are going to apply the methodology we just described to our dataset during the spring semester.

During the first couple of weeks, the main activity will be article preparation. We will obtain, as a result of this 2-weeks phase, the articles resulting from the preprocessing detailed above (article discarding, stop word elimination, stemming, etc.).

In the second phase, we will go through testing different tools, models, and parameters. We will extensively test MALLET, GenSim, and other tools to evaluate the results obtained with each.  A priori, it seems that GenSim is our best bet and the one that will prove to be more extensible to our needs, but this must be more thoroughly examined. The initial work done will be the construction of the evaluation set, and then the evaluation of all the different models and configurations against this expected result. After this stage we will have the modeling process optimally configured with the best parameters to extract/model the topics in a way which matches our expectations.

The next phase consists of applying our topic modeling process on the entire corpus of data, and continue to fine-tune the model configuration to improve the output of the whole set. During the second half of this time period, we will also start to simultaneously post-process and filter the results. This would consist of refining the topic distributions and look to interpret the raw results obtained from the topic modeling process.  The resulting milestone for this phase is the processed final topic model built over all the corpus of data.

In the following phase, we will focus our attention towards a more dynamic topic modeling, taking into account the temporal dimension.  We will first move towards a generalization of topics, to have a less granular overview of the texts, and then attempt different temporal segmentation techniques on the articles. We have a couple ideas on how to approach this problem. First, there is the possibility of merely dividing the articles into segmented periods and performing topic modeling on each article set independently. Then, there is the possibility of using a “sliding” window.  Using both of these focuses should give us more information on the nature of the emergence of new topics, and whether they are sharp appearances, gradual increases, or perhaps both cases, with the occurrence of one or the other depending on the case/topic.

With all this information, we will dedicate the final weeks of the project to preparing the report and presentation. A summary and a Gantt Chart of the project plan are found below.

  • Week 1: Data Exploration and Familiarization
  • Week 2: Elimination of non-useful articles, Text Extraction
  • Milestone:  Set of articles ready for work
  • Week 3 & 4: Construction of Evaluation Set
  • Week 5 & 6: Testing of different models and parameters on Evaluation set and get final set of parameters.
  • Milestone: Final model and parameters set for topic modeling
  • Week 7 & 8: Application of the resulting model parameters on the corpus of data
  • Week 8 & 9: Fine tuning and re-evaluation of parameters
  • Week 9 & 10: Post-processing and filtering of the results
  • Milestone: Topic models of all articles
  • Week 10: Go from distribution of topics to general topics
  • Week 10 & 11: Analyze the emergence of new topics over time
  • Milestone: Model of topics over time
  • Week 11 & 12: Consolidate data and prepare skeleton of final report
  • Week 13 & 14: Write final report and prepare presentation
  • Milestone: Final report & presentation ready
Topic Modeling Gantt Diagram
Topic Modeling Project Plan – Gantt Diagram

References:

[1] Meeks, Elijah, and S. Weingart. “The digital humanities contribution to topic modeling.” Journal of Digital Humanities 2.1 (2012): 1-6.

[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research 3 (4–5) (2003): 993–1022. doi:10.1162/jmlr.2003.3.4-5.993.

[3] Riddell, Allen B. “How to Read 22,198 Journal Articles: Studying the History of German Studies with Topic Models.” USP Seminar,“Big Data and the Humanities,” Tuesday, Oct. Vol. 9. 2012.

[4] D. Mimno, “Computational historiography: Data mining in a century of classics journals,” Journal on Computing and Cultural Heritage 5 (1) (2012): 3:1–3:19. doi:10.1145/2160165.2160168.

[5] Newman, David J., and Sharon Block. “Probabilistic topic decomposition of an eighteenth‐century American newspaper.” Journal of the American Society for Information Science and Technology 57.6 (2006): 753-767.

[6] Nelson, Robert K. “Mining the dispatch.” Mining the Dispatch (2010).

[7] Yang, Tze-I., Andrew J. Torget, and Rada Mihalcea. “Topic modeling on historical newspapers.” Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Association for Computational Linguistics, 2011.

[8] Blevins, Cameron. “Topic Modeling Martha Ballard’s Diary.” Pers. Blog (2010).

[9] http://lisa.therhodys.net/2012/04/why-use-visualizations-to-study-poetry/

[10] Nelson, Robert K. “Of Monsters, Men—And Topic Modeling.” New York Times (May 29, 2011) (2011).

[11] Mcauliffe, Jon D., and David M. Blei. “Supervised topic models.” Advances in neural information processing systems. 2008.

[12] Blei, David M., and John D. Lafferty. “Dynamic topic models.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.