Topic Modeling on 200 Years of Swiss Newspapers – Progress Report 1

At the intersection of computer science, machine learning, and information retrieval, Topic Modeling has been a hot topic in Digital Humanities (DH), pushing some authors to refer to it as a synecdoche (a part referring to the whole) of DH [1]. It refers to techniques and algorithms used to reveal the main themes contained in a document or a collection of documents. As part of our Master’s Social and Human Science project, we apply Topic Modeling to 200 years of articles taken from two Swiss newspapers “Journal de Genève” (JDG) and “Gazette de Lausanne” (GDL). The outcome can be used to reveal hidden themes and patterns, examine trends, detect meaningful events, and get a better insight into the culture and society of the French speaking part of Switzerland throughout the years.

As stated in our project plan, the first two weeks were devoted to exploring and familiarizing with the dataset, and carrying out the first pre-processing steps. The dataset consisted of around 5.5 million articles (approximately 2.5 million for GDL and 3 million for JDG), dating from 1798 until 1998. They are presented in XML format, which contains metadata and the text article itself. The pre-processing steps consisted first of the elimination of articles that were too short to be relevant to our study or didn’t contain text at all. To do that, we have built a script using python that goes through all the articles in our database, retrieve the number of words of each article, and eliminate the ones that have a small number of words. Articles that contained too many OCR errors to be exploitable had to be eliminated as well. To do that, we also built a script in Python that goes through each article in our database, identify OCR errors, and eliminate the ones that have a large percentage of errors. OCR errors were identified using a Python library called PyEnchant, which contains a built-in french dictionary: If a word does not have a hit in the dictionary, it is considered as an error. It is worth noting that we are currently exploring the possibility of correcting part of the OCR errors but this process will necessitate more time due to the important amount of evaluation and control it requires.

When we tried to run the scripts on the whole data set, we found that it is taking too much time and resources to be done on our laptops. A group member has ordered a server with 32 GB of ECC-RAM and two AMD Opteron processors which is due to arrive next week. We will use that server to store our data and execute tasks that need a considerable amount of time. Nevertheless, to get a first idea on the quality of the articles, we have run some scripts on a subset of them, and found out that out of 721935 articles, there were 89637 empty ones (12.4%). Even though the first milestone in our project, which is constructing the final set of articles ready for analysis, has been delayed, we are working in parallel on other tasks that are more or less independent of the first milestone, so that the overall timeline plan of the project is not affected.

In fact, we are currently working on constructing the “evaluation set” for our project, using articles that have been manually and automatically verified as valid. The evaluation set consists of relatively small number articles to which we associate manually a certain topic and is used to assess the quality of our model parameters as we modify them and explore different configurations. We plan on having 40 articles in the evaluation set. At the same time, we are tinkering and getting familiar with GenSim and Mallet – two software tools widely used in Topic Modeling – which we are going to extensively use in the next phase of our project.

Team:
Maurice Abou Jaoude
Samer Bekhazi
Nicolas Fumeaux
Daniel Gray

References:
[1] Meeks, Elijah, and S. Weingart. “The digital humanities contribution to topic modeling.”
Journal of Digital Humanities 2.1 (2012): 1-6.