In the previous phase of the project, we manually built an evaluation set that consists of 40 valid articles with associated topics such as Economics, Sports, and Politics. The purpose of this set is to evaluate different parameters and find a relatively good parameter set to use when applying topic modeling techniques on the whole collection of articles.
In the previous blog post, we enumerated the different parameters that are to be tested, which are mostly related to pre-processing steps applied to the corpus of articles. First, we tried to see what topic models LDA returned without doing any pre-processing, which resulted in topics such as “0.047*de + 0.031*la + 0.026*le + 0.023*les + 0.023*des + 0.021*et + 0.020*à + 0.015*a + 0.013*; + 0.013*du”. This shows that articles, prepositions, and punctuation that do not carry any semantic meaning are effectively noise and will lead to bad results (Garbage in, Garbage out). Therefore, it is crucial to remove these words before attempting to extract topics from the articles. After removing these words, we tried to remove all the words that occur only once in the articles and calculated the cosine similarity of one article in the evaluation set with respect to the other articles using the tf-idf document representation. It happened that the most similar articles had also the same attributed topic: Sports. This gave us the idea of using clustering techniques to group articles together. However, in order to do that, we need to find the appropriate representation (in a feature space of a lower dimensionality) for each article.
We applied our pre-processing to the documents in 3 stages, in order to be able to test our classifier on the output of each stage separately. Ideally, the classifier would provide best results on the output of the last processing step. However, this depends on the severity of the errors in each processing stage and the ability to get rid of useless information and retain what’s useful. The first stage of the pre-processing consists of stripping the articles of all possible punctuation and white spaces, ignoring short articles (we set the limit to 100 words), removing the stop words, numbers, and short words (we set the limit to 3 letters). The output of this step is then fed to the second pre-processing stage, which gets rid of all words that are neither French, nor English, nor German. This will remove all words that contain OCR errors. Then, the last step of the pre-processing is to transform each word to its word stem form, in order to allow our classifier to group words with the same stem.
In order to build our training set, we selected a whole month every 5 years (June of every year that is a multiple of 5). This corresponds to around 26,000 articles for Gazette de Lausanne, and around 37,000 articles for Journal de Genève. We applied our 3 stages of pre-processing to the training set of each journal, and saved them in a Gensim Dictionary for word to id mappings, and in a Matrix Market Corpus. We then applied LDA to these two dictionaries, using 20 topics. One of the resulting topics is “0.018*sud + 0.017*nuageux + 0.013*nord + 0.013*beau + 0.012*juin + 0.012*ouest + 0.010*plui + 0.009*temp + 0.008*pilot + 0.008*metr” which clearly correspond to weather forecasts.
It is worth noting that our topic modeling algorithm has some hyperparameters, in the LDA as well as in the pre-processing stages. These hyperparameters are:
- the word count limit to ignore articles,
- the letter count limit to ignore words, and
- the number of topics to be constructed by LDA.
These hyperparameters should be fine tuned and optimized to give the best results and the topics that represent our journals the most. Also, another possible way to optimize our results is to try generating our topics from different stages of the pre-processing, and to reorder the pre-processing stages. Applying LDA to the output of the second stage (i.e. no stemming) would result in a topic “0.011*nuageux + 0.010*beau + 0.010*juin + 0.009*nord + 0.008*soir + 0.008*après + 0.008*mètres + 0.008*temps + 0.007*pluie + 0.007*sud”, which is a weather forecast article, more readable than its stemmed counterpart, and having a comparable performance. For reordering pre-processing stages, as an example, moving the stop word removal to the end of the pre-processing, after stemming, could be more robust. In fact, some of our topics (after stemming) contain the word “tout” although it is in the stop words list, and this is because we are stemming “toute”, “toutes”, and “tous” after stop word removal. But stemming before removing stop words could also prevent some other stop words for being identified.
In parallel to that, we were still facing problems running our algorithms on the whole dataset because of its huge size. A processing run through the articles would take around 24 hours (and maybe more). We therefore had to look for ways that make our scripts faster and more memory efficient. We implemented threading for the parsing of the articles, but were experimenting several problems (strange erratic errors, and occasional “segmentation faults”), but after checking the different possible causes, we found it was due to the PyEnchant library. Our first attempt of a solution was to create separate instances for each thread, but the error was still present. This is probably due to the back-end still being unified (and we saw that each call to PyEnchant, returns the same instance), so we have to ensure that only one thread does this at one time. Additionally, we have one thread calculate the progress and estimated time remaining, to not be blind concerning how long an execution will take. In any case, we noticed (especially during the initial phases of article processing) that multithreading the execution increased the processing speed quite significantly.
In conclusion, the algorithm is now ready to be applied on the whole corpus, with a valid set of parameters that has been inferred from our experiments with the evaluation set. Of course these parameters may still be fine-tuned as we include more of the corpus in our pipeline. The topic modeling is currently being executed on the corpus on our server. Logically, the next step is to analyze the results yielded by the topic modeling algorithm in depth. This comes with the challenge of finding ways to represent the data in an intuitive and understandable way. It is necessary in order to analyse the dynamic emergence and disappearance of topics, as well as the similarities or differences between our topics or their presence within the different articles. At the end of this next phase, we hope our topic modeling analysis of the articles will be close to be complete, and ready to be included in a poster and a final report.
Maurice Abou Jaoude