Progress Report 3: Distant Reading 100 years of Archivio Veneto

Project Recap:

Our project involves conducting topic modelling on articles from the Archivio Veneto journal, with the aim of extracting interesting and meaningful information about the history of Venice. We also aim to develop a visualization interface in order to make it easier to understand our results.

Planned Progress by Week 9:

According to our project timeline by the end of week 9 we had planned to finish pre-processing the data source,  testing different TM Models, evaluating results and be partly done with post processing of our results. Additionally we also planned to finish the Python Visualization Platform, and partly complete JS Visualization platform.

Actual Progress by Week 9:

During these last three weeks our focus was to improve our topic modeling results on the entire dataset by improving the data pre-processing and testing different topic modeling algorithms by varying its hyper parameters and further manually evaluating the topics obtained. We also expanded our output to include topic frequencies per article and list of the articles corresponding to a particular topic. One major deviation from the plan was we did not do much work on a visualization platform that one can interact with, rather we simply created visualizations of our results for our own testing.

Improved Data Preprocessing

In order to improve our results, we manually identified some meaningless tokens like “cfr”, “vol”, “ss”, “ut” and many others. We wrote a script that parsed our topics and identified the words that occurred in greater than 50% of topics(for 20 topics as LDA parameter), and manually added them to our stop-words list after checking their meaning using Google Translate and consulting a friend who spoke Italian.

 

Results before manually finding and removing common words
Results before manually finding and removing common words (4 out of 20 topics shown as an example)

After a couple of iterations of this process we were able to generate topics that were more varied. While this was good for improving our results, we realized that there are some words which may be important but commonly recurring for example “papa” (Pope) and “francia” (France). So we decided to remove these nouns from our stop words, and leave words that do not give meaning to topics like “qui” (here), “nè” (neither)  and “mentre” (while) in the stop words.  The NLTK library had already contributed 279 Italian stopwords and using this technique we added 174 words.

Blog3_newResults
After manually finding and removing common words, except words considered important. (3 out of 20 topics shown as an example)

Of course, there are still commonly occurring tokens across topics, but overall the topics are much more varied , also we increased the number of words per topics.

Other improvements:
  • The articles are actually written not only in Italian, but contains other languages as well. We have encountered blocks of text in Spanish and Latin. Thus we decided exclude stopwords from these languages as well.
  • We made an  effort to further clean the data. We manually removed some non-ascii characters that tokenizers could not handle properly. We additionally improved our tokenization process by using efficient sentence tokenizers. The final result was satisfying, since the unique tokens dropped from around 390 thousand to 300 thousand. This means that previously, the same words were manipulated as different words because they have not been tokenized properly.
  • Tokens that appeared in more than 40% and tokens in just 1 documents are removed. We were sceptical of removal of the tokens by former step but the topics generated were much more varied so we proceeded with it.

Algorithm Development

  • This time, besides LDA, we also tested two other models. The first one was LSI. LSI is similar to LDA, but uses a variant of PCA, which is a data dimensionality reduction technique. Before applying LSI we normalized the data using tf-idf. Even after that, however, the results produced were consistently worse than LDA results. The second one was HDP, which is the hierarchical version of LDA. What essentially does, is to include another latent variable into the model corresponding to the number of topics. This way, the number of topics does not have to be specified apriori. The results extracted from this technique are comparable to those of LDA.
  • In order to systematically test different parameters of the model we implemented an application to automatically run experiment over ranges of hyper parameters. As it is natural, the first hyper-parameter we optimized was the number of topics. Apart from the number of topics though, we optimized the way we filter extreme values in token occurrences. Currently, we filter words that appear in a high percentage of the total number of documents. It is common in NLP task to exclude these words, because they do not offer any value in distinguishing between documents. We additionally filter words that appear in only a very small number of documents, since LDA is not able to extract proper correlations thus topics. By manually judging the produced topics, we decided on the appropriate number of topics for every model.
  • Initially we were just using just one pass over entire training corpus to train the lda model. This time, we increases the number of passes to 15 and chunk size to 2000. Although this took lot of time but now our lda algorithm is converging properly on training corpus and we are getting much more meaningful topics with higher likelihood probability.

Visualization

As we maintained our focus on improving our algorithm, we did not work on developing an interactive visualization interface as planned. Instead, we worked on creating visualizations of our algorithm development results. Some examples are shown below:

Topic trends over time:
Topic Distribution across decades

In this chart you can see the topic-article distribution averaged over a decade (or less

depending on data available – for example 2010s include articles from 2010-2013). To produce this visualization, we took the average probability of a topic occurring in an article across every article from a given decade. We then stacked these up in our graph to see which topic is more likely in a given decade, and also overall. Clearly, topic 0 overwhelms the visualization making it harder to identify when other topics were prevalent.

Topics stacked to see probability that an article from a given decade has that topic
Topics stacked to see probability that an article from a given decade has that topic

This visualization gave us far more meaningful results as it illustrated the broad range of topics encountered in articles before the 1980s, however from then on Topic 0 was very likely to be the main topic of every article. We need to confirm if this is indeed the case, or if our algorithm requires further work.

Project Timeline

Green = Complete, Blue = To do, Brown = Did not follow plan

Conclusion

We have improved the results in terms of the quality of the words making up the topics after additional Pre-processing.  We still have to validate the quality of the topics and their correspondence with the articles. Following this we plan to optimise the hyper parameters for the LDA ,HDP and improve our results.  For the visualisation, we will work on Topic Tracker and Article composition interface.