Tag Archives: gensim

Distant Reading 100 years of Archivio Veneto — Final Report

Introduction

We are provided with full digitized text of the journal “Archivio Veneto” which consists of 821 articles published between 1927 and 2013. Given access to this large dataset that has been compiled over many years, we can extract a lot of interesting and meaningful information about the History of Venice. However, it is very difficult to do so by just reading every article of every journal of the Archivio Veneto ever published. It would be much easier to have an automated way to help us summarize, visualize and discover a hidden structure in the articles. Topic Modeling is one such document clustering technique that can be applied to distant read the entire run of journals by identifying the main themes and topics contained individual or a set of articles.

Our primary objective in this project is to distant read 100 years of Archivio Veneto journal entries in a systematic way using topic modeling. Through this project, we want to answer following research questions:

  • Can Topic Modeling be applied to distant read the entire run of a scholarly journal, and if so to what extent?
  • Does Topic Modeling help to single the articles on Venice or even on specific topic related to Venice?
  • Is it possible to identify topics discussed by individual articles?

Input Dataset

The dataset is in the text format generated after Optical Character Recognition (OCR) on the images of journal. Each article is stripped of all its metadata and is stored individually in a text file. The metadata for all 821 articles is provided in a JSON file and includes:

  • Title
  • Author
  • Year
  • Filename
  • Issue Number
  • Start Page
  • End Page

 

Method

We used Gensim and Natural Language Toolkit (NLTK) libraries for modeling algorithms and Natural Language Processing(NLP). Before running the topic modelling algorithms, we preprocess the text to reduce it to a meaningful data. The preprocessing step includes tokenization, punctuation and stopwords removal.

Pre-processing

We had to extend the stopwords list to 539 as one provided by NLTK library(279 words) was inadequate. We wrote a script that parsed our topics and identified the words that occurred in greater than 50% of topics(for 20 topics as LDA parameter), and manually added them to our stop-words list after checking their meaning using Google Translate and consulting a friend who spoke Italian. They were some meaningless tokens like “cfr”, “vol”, “ss”, “ut”, possibly OCR error and non-noun words like “nè” (neither) ,“mentre” (while), etc. Furthermore, we remove the tokens which are appearing in at most 2 documents, as they are too rare to form a topic.

We had few article consisting of  Spanish, Latin, French and English words. We planned to keep them as it was interesting to cluster the articles by language other than Italian.

Topic Modelling

Once we had created the processed tokens we use bag of words model to feed it to Topic Modelling Algorithm. We tested Latent Dirichlet Allocation(LDA), Latent Semantic Indexing(LSI) and Hierarchical Dirichlet Process(HDP). LSI is a data dimensionality reduction technique, a variant of PCA. HDP, which is the hierarchical version of LDA. which includes another latent variable into the model corresponding to the number of topics. This way, the number of topics need not be specified a priori. 

The output of the topic modeling techniques is two-folds, i.e. words distribution within a topic and topic composition of an article. The former implies the significance of a word in a topic and latter about the probabilistic distribution of the topics in an article.

Evaluation

With each variation of the algorithm, we proceeded to analyze the actual topics generated and their corresponding articles. As none of the members of our team spoke Italian, we required the assistance of online translation tools to caption each cluster of words. We then sent the list of topics we “suspected” to have either meaningful or interesting results to Dr. Matteo Romanello from EPFL’s DH Lab to confirm our labels for the topics.

Example for an uninteresting topic as they are too general and appear in all the articles. This is probably due to the article mention dates with months.
Example for an “uninteresting” topic as they are too general and appear in all the articles. This is probably due to the fact that most articles mention dates with months.
good-topic
Example for a “good” topic as all the words are coherent. This topic refers to Venetian merchants, trade and shipping. Occurs only in some relevant articles.

Out of them all the algorithms, LDA worked best, as the topics generated where different from each other and the words in the topics were more coherent. Once we had finalised the model we further tuned the parameter i.e. k , number of topics for LDA, chunk size and passes . The best results were obtained of  k=60, chunk size = 200, passes = 50.

Results

Now that we had decided the best tuned version of the algorithm, we attempted to answer one of our guiding questions: “Does Topic Modeling help to single the articles on Venice or even on specific topic related to Venice?” Analyzing the topics, we were able to identify which ones were about Venice and which ones were not. Moreover, the topics about Venice were not generic, rather they focussed on a specific aspect of Venice. For example, here is Topic-13 from our LDA-60 algorithm:

Topic-13 using LDA-60
Topic-13 using LDA-60

This topic discusses the history of Venice, more specifically Venetian elements society. We also verified our topics’ validity by examining the articles that contained the topics. For example, this topic (Topic-13) occurred in an article from 2009 titled “Fisco statale e fisco locale nei domini di Terraferma. La Repubblica di Venezia e il dibattito sulle gravezze de mandato domimii nel 700” among several others. While the topic doesn’t seem like an exact match, just by the title, the article certainly contained elements that matched with this topic. This and other similar results lead us to conclude, that it is indeed possible to identify not just articles about Venice, but specific topics related to Venice.

Our second test of our data involved identifying whether we could identify patterns in how the topics varied over time. To investigate this, we identified the probability distribution of all the topics in each decade. We then plotted this on a time-probability chart, and tried to identify interesting patterns, such as prominent global maximas/minimas, as well as noticeable increasing and decreasing trends. We also used a Sankey diagram as another visualization tool to easily identify interesting results.

Topic Probability over time - 60 topics generated by LDA
Topic Probability over time – 60 topics generated by LDA

In the graph above we identified Topic-13, Topic-34 and Topic-51 from LDA-60 along with several others as having rather interesting results.  Topic-34 is a historical topic about war. Topic-51 is about the relationship between Italy as a secular state and the religious institutions.

Topic-34 (left) and Topic-51(right) using LDA-60
Topic-34 (left) and Topic-51(right) using LDA-60
graph
Topic Probability over time for Topic-13, Topic-34, and Topic-51

We identified Topic-34 as one of the most prominent topic across all decades, with a 10-15% chance of it occurring in any article in every decade. However, what really caught our eye, was a spike in the 1950s, where there was almost a 35% probability that it occurred in an article from that decade.  Our first assumption was that it was in reference to World War II that had recently concluded – but closer examination of the articles revealed that they discussed the civil wars that took place in Europe in the 1840s and 1850s – exactly a century earlier. This was a significant result as we were able to use topic modelling to identify what were popular topics during specific periods of time.

Finally, we tried to answer “Is it possible to identify topics discussed by individual articles?”. To do so we tried to visualize each article using a pie chart, to see its topic composition. Some articles were fragmented into 1 dominant topic, and many marginal topics. However, some articles clearly broke down into several distinct topics. For example, the article “Venezia a la quarta crociata” written by R. Cessi in 1951.

Breakdown of the article "Venezia a la quarta crociata" into component topics.
Breakdown of the article “Venezia a la quarta crociata” into component topics.

The article was about Venice in the 4th Crusade, and the topics we identified were significant and relevant. Such examples showed that topic modelling could be used to find out the topics discussed in individual articles.

Conclusion

To summarize our work on this project;  we were able to identify topics in a corpus of texts that were unknown beforehand though topic modeling – a form of unsupervised learning. After trying  various algorithms, and tuning them, the topics generated were mostly coherent and meaningful. We were able to single out articles that were about Venice, and more specifically related to a specific theme in Venice, and we were also get a preview into an article by identifying the topics and themes it discusses. Finally we were able to identify interesting patterns in the occurrence of topics over time, and justify some of them.

Based on our results, we can conclude that we have strong evidence that topic modeling can be applied to distant read the entire run of a scholarly journal. We feel fairly confident that with further tuning, topic-modeling can be used in the field of Digital Humanities to identify themes and topics in a corpus of documents. This can prove to be a useful tool to get a preview of the nature of the documents without having to go through every single page.

References

[1] David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, April 2012.

[2] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407, 1990.

[3] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 50–57, New York, NY, USA, 1999. ACM.

[4] Radim Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.

[5] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101, 2004.

[6] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, January 2008.


Team Members:
  1. Fokeas, Sotirios
  2. Jain, Smriti
  3. Lal, Sourabh
  4. Vidit, Vidit