The Archivio Veneto is a historical Italian journal published by the Deputazione di Storia Patria per le Venezie (Department of Venetian National History). It was first published in 1871, and focusses on the history of the Republic of Venice and Veneto. Published annually, we have access to a large amount of historical information about Venice through these journals.
Project Objectives and Deliverables
Given access to a large dataset that has been compiled over many years, we can extract a lot of interesting and meaningful information about the History of Venice. However, it is very difficult to do so by just reading every article of every journal of the Archivio Veneto ever published. It would be much easier to have a visual interface which you can query, and it can present the required results. Such a task necessitates the development of algorithms to help us summarize, visualize and discover a hidden structure in the articles.
Topic Modeling is one such document clustering technique that can be applied to distant read the entire run of journals by identifying topics discussed by individual journal articles. It is a form of text mining and a way of identifying patterns in a corpus. Our primary objective in this project is to distant read full digitized text of journal, Archivio Veneto in a systematic way using topic modeling. Using Topic Modeling, we hope to develop a system which can identify an appropriate number of topics in an article and can also identify a journal article based on a list of topics.
Topic modeling output is not easily human readable. Therefore, once we have used Topic Modellng to analyze our journal article, our secondary objective is to develop a visual interface which will output different visual representations in order to show the relationship between different articles in terms of topics as well as the breakdown of a particular article by the topics discussed in it. In addition to being the interface for an interactive exploration of the content of the journal, this visualization will also help us test and debug our topic model for accuracy.
The Archivio Veneto journals are divided into articles, which are mostly in Italian. There are 100 such journals spanning 100 years of publication.
We have the text format of the journal obtained after running Optical Character Recognition(OCR) on the images of it. We plan to take an article as our data unit, for generation of topics. Before running topic modelling algorithms, we plan to convert our journal articles into reduced meaningful data. We will try out our processing with the following well known methods.
- Tokenization: The sequence of words and paragraphs are sliced into tokens, which can be individual word called unigram or more words called ngrams.
- StopWord Removal: The removal of stopwords which are generally most common words and lose meaning in the tokenized data.
- Stemming: This method trims the tokens to its morphological roots, hence plurals or suffixes will be trimmed to be considered as same token. For examples, ‘apple’ and ‘apples’ won’t be counted as separate token after stemming.
- Part Of Speech Tagging: We can further process our journal articles with the part of speech tagging and removal of the words which are not noun and present lesser ‘thematic’ meaning of the document.
These processing methods also have caveats, as data processing is said to be more of an art than science. The choice of stopwords list , light-weight or aggressive stemming, the choice of token length strongly affect the end result.
Various algorithms exist for topic modelling, and each one of them has its own pros and cons. In our project we will focus on modern topic modelling algorithms such as the Latent Dirichlet Allocation algorithm (LDA). The LDA algorithm is the standard model used by most of the topic modeling projects, therefore we will use it as a baseline model.
Next, we plan on applying Pachinko Allocation Model (PAM). PAM is an algorithm which improves upon LDA by modeling not only word correlations but also correlations between topics. PAM generally produces better results due to the increased flexibility and expressive power.
Another algorithm we are considering is the Non-negative Matrix Factorization model (NMF). We are not very optimistic about NMF, as it has a tendency to identifying more incoherent topics. The end-users that will interact with topics extracted in our case are human, therefore coherency between topics is critical for them to effortlessly comprehend the results. Moreover, it will be much easier to represent topics with human readable labels if they are coherent. Nonetheless we will test and evaluate the algorithm ourselves if we have enough time.
While LDA and PAM are very competent models, they require the manual specification of the number of topics to be extracted. This is often proven not to be an easy task, thus we plan to experiment with the hierarchical versions of these models. Hierarchical Dirichlet process and hierarchical Pachinko allocation automatically extract the number of topics needed to express data.
After evaluating the results from each of the above mentioned Topic Modelling Algorithms, we will be able to draw conclusions on which technique is most appropriate.
We searched for alternative software packages which will help us perform the task at hand. We took a look at some of the frameworks but we quickly singled out two of them. The others were either incomplete or obsolete. The only real candidates were the Gensim and the Mallet library. There was not an observable difference in capabilities or popularity, but we decided to work with Gensim, because it is implemented in python while Mallet is implement in Java. We want to work with python since in it is more succinct and concise, hence we picked a framework native to python.
In order to verify whether our models are identifying relevant topics, we need to have an evaluation method. We plan to pick a few journal articles and manually identify the topics by reading them. By this, we will have a reference set for testing the performance of our models. The size of the set should be small enough, for the task to be feasible, but also large enough for a good evaluation. We will experiment with the number of the articles to be in evaluation sets. The articles should be selected randomly, in order for the evaluation to be unbiased.
As the output from Topic modeling is not easily human readable, once we have completed topic modeling we want to process the results by automatically assigning human readable label to each topic. This is quite tricky and there are very few examples of feasible solutions to this problem. We believe we can use a Word Association Library in order to find a label that is appropriate for the set of words in a topic. While several such libraries exist, we are currently still looking for one in Italian.
Data Presentation and Visualization:
There are many patterns that can be identified through the results of Topic Modelling. We would like to output a visual representation for each of the following:
- Topic tracker across articles:
In order to display a topic tracker across articles, we would like to output a histogram or a line graph that displays how the relevancy of a topic varies for a series of articles. This can be further used to show the same information for a cluster of articles (for example by year).
- Article Network linked by topics:
A second result we would like to present is, given a set of articles, we want to output a visualization of a graph, where each node is an article, and the edges between nodes are the common topic between the articles.
- Topic Composition of an article:
In order to present the breakdown of an article by topic, we plan to use a pie chart, where the segment size of a topic is associated with the relevancy of the topic in the given article
Project Plan & Timeline:
Through Topic Modelling, we intend to summarize, visualize and discover a hidden structure in the articles from the Achivio Veneto. Through this, we want to answer questions such as the feasibility of TM being used to distant read the entire run of the scholarly journal, and whether we can identify topics given a journal article, as well as if we can identify journal articles given a topic. We then plan to visualize these results, in order to make it easier to interpret the results and visually identify patterns.