Topic Modeling on 200 Years of Swiss Newspapers – Progress Report 2

Our SHS project consists of applying Topic Modeling techniques to 200 years of Swiss newspapers, the Gazette de Lausanne and Journal de Genève. In the first phase of our project, we got familiar with our dataset, which consisted of around 5.5 million articles dating from 1798 until 1998. They are presented in XML format, which contains metadata and the text article itself. Many of these articles are empty, have very few words to be useful, or contain too many OCR errors to be interpreted. Therefore, a second objective that was achieved in the first phase is the development of scripts that evaluate the suitability of a certain article for the purposes of the project. A third objective was the elimination of all the non-useful articles using these scripts. This goal, however, has been delayed due to the big amount of time and resources needed for these scripts to be executed on the entire dataset. However, this delay did not affect the overall timeline plan of the project.

The second phase of our project consisted of working with a small set of articles in order to get a fundamental understanding of the effect of different parameters used to construct our model. In fact, many transformations can be applied to a certain article, which can enhance or degrade the performance of the topic modeler. The first step was the construction of this set, that we call evaluation set, in which we associate manually a topic to each article. We chose 2 articles from every decade (1 from Journal de Genève and 1 from Gazette de Lausanne) for a total of 40 articles that date between 1800 and 1998. The scripts developed previously were used to determine the suitability of the chosen articles. A subset of the chosen articles with their dates and associated topics are shown below:

Date Article Number Newspaper Topic
1799-07-05 8 GDL Politics, War
1810-05-11 2 GDL Politics, International
1826-06-15 2 JDG Politics, National
1832-08-15 8 JDG Politics, Law
1840-12-22 3 GDL Short Story
1850-03-02 4 GDL Politics, International
1860-09-03 8 GDL Miscellaneous
1870-12-05 2 JDG Politics, International
1886-06-03 7 JDG Economics, National Politics
1894-10-16 1 JDG Politics, International
1905-12-27  11 GDL Politics, Elections
1911-07-05  11 JDG Politics, International
1922-01-01  35 GDL Economics
1930-4-4  17 GDL Education
1942-06-06  28 JDG Sports
1950-01-17 23 GDL Health
1961-01-17 81 JDG Sports
1971-01-20 84 GDL Economics
1981-01-17 78 GDL Religious Events
1991-01-17 132 JDG Economics/Financial News

 

The second step consisted of getting familiar with Topic Modeling software, in particular Gensim (which is a Python library), in order to get a better understanding on how it works and what we can do with it. In fact, in most pieces of software, we are going to work in a Vector Space Model (SVM), where each document will be represented as an array of features. A very common representation of a text is the “bag of words”: A text is represented as a vector, where each dimension is a feature, that is a number. The feature corresponds to a certain word, and the number corresponds to the number of times this word appears in the text. The more unique words there are in the dataset, the higher the dimensionality of the feature space.

We went over the documentation of Gensim and completed the various tutorials that are available on their website, practicing on sample texts before actually using it on our dataset. We have found that the Python library is relatively easy to use and convenient for prototyping purposes. We have also found that with Gensim, we can apply many transformations to our feature matrix. For example, instead of representing an article as described in the previous paragraph (bag of words), we could transform it into a “Term Frequency * Inverse Document Frequency” (TF-IDF) representation, or into a feature space with reduced dimensionality.

We have identified two transformations that are potentially useful for the purposes of our project: Latent Semantic Indexing (LSI or LSA) and Latent Dirichlet Allocation (LDA). They both are transformations from bag-of-words counts into a topic space of lower dimensionality, with LDA being a probabilistic extension of LSA, so LDA’s topics can be interpreted as probability distributions over words. Texts are then interpreted as a mixture of these topics.

Before working on the articles, there are many pre-processing steps that can be applied to the documents, and that can be beneficial or deleterious to the performance of the topic modeler. This procedure is carried out because natural language is full of redundancies, and not every word that appears in a document carries semantic meaning. Removing these words therefore can help revealing the true topics of articles and avoid what we call “Garbage in, Garbage out”. We are currently working on seeing the effects of different pre-processing parameters on the quality of our topic modeling. These parameters include but are not limited to:

  1. Removing articles, prepositions, conjunctions, and pronouns.
  2. Removing common verbs (ex: avoir, être, faire)
  3. Removing common adjectives (ex: grand, petit)
  4. Removing words that appear in every document
  5. Remove words that occur only once
  6. Stemming words
  7. Removing OCR errors
  8. Correcting OCR errors

We are studying these effects by applying the pre-processing parameters as well as some model parameters on our evaluation set, computing the feature matrix using GenSim, and looking at topic distributions that are returned by applying LSI or LDA. We are comparing them to the topics that were manually assigned to the articles to see if they make sense. We are also using similarity measures to assess the effect of the parameters using the following hypothesis: texts with the same assigned topics should have a high degree of similarity than texts with different assigned topics.

In parallel to the steps described above, we are working on mitigating the problem that presented itself in the first phase of the project. For this purpose, we ordered and set up a 32 GB RAM server at the home of one of the team members. We installed CentOS 7 on the server, and created a virtual machine for external access through TeamViewer. The virtual machine runs Linux Mint 17.3 and is virtualized through KVM. The actual files are stored on the physical server and shared to the virtual machine via NFS. This setup allows us to quickly access the data and test many different options with much more immediate feedback.

At the end of this phase, we will have a set of parameters and pre-processing steps that are quasi-optimal for extracting meaningful topics from articles. In the third phase of the project, we will go from working on the evaluation set to applying the resulting model parameters on the whole corpus of data set.

Team
Maurice Abou Jaoude
Samer Bekhazi
Nicolas Fumeaux
Daniel Gray