Topic Modeling of ambassadors’ relations from England – Second progress blogpost

In these three weeks we have focused on refining and analyzing the Topic Model founded on our clear dataset – a collection of twenty-two Relazioni edited by S. Villani spanning from 1489 up to 1763.

Training and qualitatively analysis

First, we took a definitive decision about the tool to use – Mallet. From our perspective, it is a good compromise between performances, results and user-friendliness.

In order to improve our model, we worked on our Italian stopwords database on a recursive basis. After running the program with a standard Italian stopwords list, we manually checked the results finding other stopwords such as archaic Italian stopwords (e.g. et, ipso) and words that were recurrent among different topics. Then we ran our model again. By iterating this procedure, we created a list of almost 500 stopwords that we used to obtain the results we will show later on.

Before moving to the post-processing phase, we needed first a qualitative evaluation of our results. For this purpose, we labelled each topic according to what it talks about by summarizing its twenty most frequent words. Actually, this task ended up to be tricky for some topics, while for other ones it was quite straightforward. As an example, consider the following list, referring to topic 28:

grande corte prencipe partito guglielmo prencipi grandi grado tempi governo poter parte olanda particolare europa oranges auttorità appresso nazione corso

Some keywords (as “guglielmo”, “oranges”, “olanda”) let us think that we are talking mainly about William III Prince of Orange (1650-1702). Consequently, this topic is likely to appear especially in the Relazioni from the second half of 17th century while being almost absent in the previous ones. In other words, we cannot suppose topic distributions among documents to be consistent over time. Therefore, the post-processing has been also oriented to detect the impact of time evolution on our model. This is why we divided the dataset and the related results into 5 time chunks. Each chunk has a time span of 50 years, except for the last one, which covers a period of 74 years to include the last Relazione of our database that was written almost 60 years later than previous one. With this division, the number of Relazioni in each chunk vary from a minimum of three (1489-1539) to a maximum of eight (1589-1639).

Post-processing

We have employed Matlab to perform the post-processing. To have an insight of the topic evolution over time, we averaged the document-topic distribution over each time slot and gathered the results in the following stacked bar chart.

Topic-document distribution averaged over time chunk
Topic-document distribution averaged over time chunk.

This seems to confirm the unsteadiness of this model. Only a few topics are equally distributed over the full dataset (see topic 12), while the majority has a non-symmetric distribution. As another test, it is useful to check the consistency of this graph with the labels previously attached to each topic. Let’s consider the topic 28 again: as expected, its frequency is small up to the end of the sixteenth century, while increasing in the second half of the dataset. On the other hand, there are topics whose distribution is concentrated in the earliest years, as for topic 25. This is in accordance with the upper-right peak in the topic-document distribution color plot (see figure below).

Topic-document distribution for the full dataset.
Topic-document distribution for the full dataset.

The last analysis we performed concerns the topic-word distribution. In labelling the topics, we just looked at the top twenty words assigned to that topic. Since the level of relevance can vary, it is necessary to detect the actually relevant words. Inspired by the idea presented in [1], we extracted the first 10 words from each topic and visualized the result in the following plot. In each word-topic intersection, we put a circle whose radius is proportional to the level of relevance of that word to that topic.

Word-topic distribution for the most relevant words for each topic.
Word-topic distribution for the most relevant words for each topic.

Considerations

Up to now, we are aligned with our original schedule. We think the results obtained on the clear dataset are reasonable and quite satisfactory. We can now evaluate the possibility to extend the dataset by looking for other sources in the Internet Archive.

Reference

[1] Chuang, J., Manning, C. D., Heer, J. Termite: Visualization Techniques for Assessing Textual Topic Model, Advanced Visual Interfaces, 2012.