Progress Report 2: Distant Reading 100 years of Archivio Veneto

Project Recap:

Our project involves conducting topic modelling on articles from the Archivio Veneto journal, with the aim of extracting interesting and meaningful information about the history of Venice. We also aim to develop a visualization interface in order to make it easier to understand our results.

Planned Progress by Week 6:

According to our project timeline by the end of week 6 we had planned to finish pre-processing the data source, and be half-way done with testing different TM Models. Additionally we also planned to finish the Python Visualization Platform, and have started to develop the JS Visualization platform.

Actual Progress by Week 6:

This month our focus was on algorithm development. As we had a suitable visualization platform for testing our results, we prioritized working on the algorithm, data processing and testing TM models.

Data Preprocessing
  • We saw last time that there were many punctuation included in topics even after we had punctuation removal in our pipeline. The reason for that which we found later was that the NLTK library has only ascii punctuation and our text had some punctuations in UTF encoding. Adding separate filter for that solved the problem.
  • In the dataset we could see many tokens which are just one character long as they can be possibly due to bad OCR output. We now filter out these too.
  • We used RDRPOSTagger for Part Of Speech Tagging and removed verbs ,determinants, articles. This step wasn’t big help as the above two step and step two mentioned below are also doing very similar task but just without a context.
Algorithm Development
  • We continued with LDA as our modelling algorithm and tried to improve the result on it.
  • We experimented with the filtering out the tokens which had frequency in the corpus more than a threshold. We varied this frequency threshold between 30% to 80% of corpus and saw that topics generated for the for lower threshold were better. We still need to validate the quality of the topic generated. By this experiment we saw the keeping of threshold low the topics generated were made up of more unique tokens.
  • We also experimented with varying the two essential parameters in LDA number of topics and number of words in the topic. Initially we didn’t have the formatted dataset with us so didn’t exactly know the number articles we had in the corpus. When tested with the new dataset which has this metadata and formed corpus of with varying number of articles in the corpus. We have to experiment more on these.
  • We also looked into the Hierarchical LDA where number of topics are learnt by the algorithm itself. But we need to experiment with it more as the number of topics generated were same as the number of articles given as the input.
The figure shows topic composition of first ten articles. The first number corresponds to the topic number and second is the the probability that the article contains that topic.
The figure shows topic composition of first ten articles. The first number corresponds to the topic number and second is the the probability that the article contains that topic.
  • This month we had a reduced focus on the visualization platform, nonetheless we were able to achieve some of our goals.
  • We are now able to generate the graphs by reading the data from a file, and no longer have to manually entered manually.
  • We deviated from the project plan, as we decided to post-pone the development of the JS Visualization Platform. THis is because we discovered that Plotly has a feature that enables us to generate HTML+JS files using Python. Since the results using this method were satisfactory, we focussed on algorithm development this month.

Project Timeline

Green = Complete, Brown = Yet to be done


Since our last progress report we have improved our pre-processing steps and have managed to acquire better results. We got the final dataset last week so we were able to run our experiments on a subset of it. From here on we plan to experiment with the entire dataset to generate the topics. We plan to expand our output to include topic frequencies per article and list of the articles corresponding to a particular topic. Lastly we would produce the visualisations for these results.