Tag Archives: newspapers

Topic Modeling on 200 Years of Swiss Newspapers – Final Report

Introduction

Suppose you input into your computer the last few recent speeches of American billionaire Donald Trump, and that your program tells you that the famous business magnate is talking about the future presidential elections, the economy of the United States, and the war on terror. This is called topic modeling. In formal terms, topic modeling refers to the techniques and algorithms used to reveal the main themes contained in a document or a collection of documents.

Coming from the world of computer science, machine learning, and information retrieval, the fifteen year old field has been utilized in many different domains ranging from world literature to bioinformatics, and for documents ranging in size from tweets to monographs [1]. Topic modeling made its definitive way into Digital Humanities (DH) in 2003 with David Blei et al’s paper about Latent Dirichlet Allocation (LDA), which is currently the most popular topic modeling tool used in DH. Ever since, it has been a relatively hot field in the DH world, even pushing some authors to refer to it as a synecdoche (a part referring to the whole) of DH [1].

In this project, we apply Topic Modeling to 200 years of articles taken from two Swiss newspapers “Journal de Genève” (JDG) and “Gazette de Lausanne” (GDL). The main objective of the project is to evaluate whether or not useful information can be extracted by applying topic modeling on the mentioned newspaper dataset.

Methods

Before starting to apply topic modeling algorithms on our data set, many pre-processing steps are necessary. They are carried out because natural language is full of redundancies, and not every word that appears in a document carries semantic meaning. Also, some articles are too short or contain too many OCR errors to be interpretable. The pre-processing has been done according to the following pipeline using custom-developed scripts and existing libraries:

  1. OCR Correction
  2. Removal of non-corrected OCR errors (Spell Check)
  3. Removal of short words (shorter than 2 letters inclusive)
  4. Stemming of the words
  5. Removal of Stop Words
  6. Removal of articles with very few words (fewer than 10 words inclusinve)

In our context, a topic is defined as a probability distribution of words, and a document as a probability distribution of topics. Topic modeling has been a trending subject in digital humanities during these past decades, which has encouraged the development of many techniques and algorithms to extract the topics from a textual corpus. The most popular technique is called Latent Dirichlet Allocation (LDA) [2], which was introduced by David Blei in 2003, as mentioned before. It is an unsupervised natural language processing model that starts by randomly associating words to a certain topic, and then iteratively refines these associations until a stable solution representing the topic distribution among the corpus is found. By feeding the corpus of articles in the LDA algorithm, we obtain a list of topics as well as the contribution of each topic in each article. The only user defined input required for the algorithm is the number of topics, which we choose to be 30 based on trial and error and evaluations on a manually topic-labeled set of articles.

Results

The pre-processing steps resulted in the removal of around 165 million words (24.5 %) and 207,000 articles (10%), leaving us with a data set of 506 million words distributed over 1.8 million articles.

Depending on the number of topics fixed, LDA gave out different topics that were more general as the number of specified topics decreased and more specific as the number of specified topics increased. In the figure below, we present 6 out of the 30 topics that we extracted from our dataset. As mentioned earlier, each topic is defined as a distribution of words and can be associated to a certain more common theme: Industry, War, Entertainment, Finance, Health, and Sports.

topic_finance2topic_health2topic_sports2

topics3
Fig 1: Six out of the thirty topics extracted from our dataset and which can be associated with Finance (yellow), Health (Violet), Sports (Light Blue), Industry (Green), War (Blue), and Entertainment (Red)

In the two figures below, we look at the rate of apparition of these topics over time. In figure 2 we represent three topics which display an interesting course over time.  First, the topic associated with “Industry” can be observed to start increasing around 1820, reach more or less a plateau around 1860 and to decrease starting from 1900. It is relatively noisy, which can be explained by the fact that this topic shows maximal occurrence in years before 1900, for which our corpus contained significantly more artifacts. However, a higher rate of appearance can still be observed during a period coincident with the Swiss industrial revolution. It is led by the words ‘eau’ and ‘electr’, which probably refer to hydroelectric production. This might explain why the downfall of the topic was not complete, hydroelectricity still being very important in Switzerland to this day. The second topic that is commented here, strongly related to war, shows a peculiar time-course with three easily distinguishable peaks of various importance. The two most prominent peaks are observed respectively before 1920 and right after 1940 and can obviously be associated with the two World Wars. The third peak occurs approximately between 1960 and 1975 and we hypothesize it to be related to the Vietnam war. The topic which is depicted in red in Figure 2 relates to films, TV, and entertainment, and, as could be expected, is subject to an increasingly steep growth in the past 100 years, in particular after 1990.  It is even observed that it overpasses ‘War’ in terms of importance right before the year 2000, since it displays a higher apparition rate among the two journals around that date. In Figure 3, we can see that the topics associated with “Health” and “Finance” are monotonically increasing in apparition rate, with a different starting point for each topic. The topic related to “Sports” starts appearing strikingly towards the end of the 19th century and an increase in popularity can be seen at towards the end of the 20th century, with a more or less fluctuating behaviour in between.

topics_graph_norescale
Fig 2: The rate of apparition of the three topics associated with War (Blue), Industry (Green), and Entertainment (Red)
topics_graphN
Fig3: The rate of apparition of the three topics related to Sports (Light Blue), Health (Violet) and Finance (Yellow)

Conclusion:

Based on the observations above, we can safely say that we have strong evidence that topic modeling can indeed be applied to extract meaningful topics from our newspaper dataset, and that a study of the rate of apparition of these topics over time correspond to interesting events in history. The advantage of this approach over others lies in that the topics are unknown beforehand and are extracted automatically from the corpus of texts. This means that this technique can be used to uncover unseen or non-obvious patterns in word usage. This allows for the detection of potentially new and unexpected events or thematic trends, in a more robust way than using, for example, N-grams which can be affected by noise such as misspelling, OCR errors, or word ambiguity.  LDA uncovers more abstract topics with word association, which give more stable topics/concepts than a simple N-gram tool can. Therefore, we are confident that a more optimized use of LDA and a more thorough analysis of the topics and their dynamics will allow digital humanists around the world to reveal hidden themes and patterns, examine trends, detect meaningful events, and get a better insight into the culture and society of the French-speaking part of Switzerland throughout the years.


Team:

Maurice Abou Jaoude
Samer Bekhazi
Nicolas Fumeaux
Daniel Gray

References

[1] Meeks, Elijah, and S. Weingart. “The digital humanities contribution to topic modeling.” Journal of Digital Humanities 2.1 (2012): 1-6.

[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research 3 (4–5) (2003): 993–1022.