Topic Modeling of ambassadors’ relations from England – First progress blogpost

The objective of our project is to combine stylometry and topic modelling in the study of the relazioni released by Venetian ambassadors after their stay in England. So far, we have mainly focused on testing different tools so to select the most suitable for us, as provided by our plan.

Stylometry

In order to perform stylometric analysis we employ stylo, a package of R. Given a corpus, stylo can extract statistics and stylistic features, which can be later used to group texts according to common characteristics. To proof its efficacy, we have tested stylo on the clear dataset available online ( http://www.storiadivenezia.net/sito/index.php?option=com_content&view=article&id=105%3Arel-inghilterra&catid=41%3Aricerca&Itemid=64 ), made up of 22 relazioni from 1489 to 1763. As can be seen in the figure below, one text is isolated from all the other ones, meaning some important difference in style. Indeed, the relazione of 1519 by Giustinian is the unique one written in a mixed version of vernacular and Latin rather than Italian.

DH_CA_100_MFWs_Culled_0__Classic Delta__001 DH_MDS_100_MFWs_Culled_0____001

Topic modelling

Different tools can be found online to build a TM. We decided to restrict ourselves to the comparison of three of them: StanfordTM, Gensim and Mallet. We have founded our evaluations essentially on two aspects: their usability and the output of the training phase. Indeed, a topic model needs first to be learned. For this purpose, we have used again the clear dataset, with its texts grouped in a unique CSV or txt file. This file is then tokenized by the toolbox. Here we have faced with some difficulties. Since all the tools are designed to work with English-written text, some of their filters are not applicable here. For example, they cannot automatically detect Italian (or Latin!) stop words that are so included in the model. To overcome this issue, we first just looked at words attached to each topic, in order to individualize the ones to disregard. Moreover, we manually typed a relatively complete set of Italian stop words into a txt file (stop words were found online and compiled together from these two sources: http://snowball.tartarus.org/ and www.ranks.nl). On the other hand, we decided to remove from the model also those words that are really rare. As a result, we have been able to carry out a relatively ‘clean search’.

Once the model is learned, it has to be evaluated. To do this, we look at the terms-per-topic distributions and we analyse the coherency among the words associated to the same topic. Another evaluation technique consists in testing the model against the same dataset used for the training. Indeed, it is likely that distributions are not exactly the same: the entity of such differences tells us about the goodness of our model.

So far, we have not taken a decision about which is the best tool for us yet, even if we have noticed some differences. For example, in the tokenizing session, StanfordTM requires each user defined stop words to be explicitly written in the Scala script, while in Mallet we can just indicate the text file (e.g. stylo output) where we have put them. In general, taking into account also the quite satisfactory level of the first results, Mallet may be our choice. The reader can find some screenshots of Mallet’s output at the end of the blog. Anyway, we prefer to take few more days in order to better evaluate the models generated by the toolboxes, in the meanwhile we are working in parallel on all of them.

Next steps

One of the most active research area regarding TM involves its representation. Our next task will be to find an adequate visualization tool, which will have to be compatible with the output files generated during the building of the model and, most of all, give easy-to-read results. In this sense, the Termite tool could be a good choice. Moreover, this will provide us a new way to evaluate the three TM toolboxes and so help us reaching a definitive choice.

After that, we are going to split our dataset into time slots – let’s say 50 years long chunks – and look at how topic distribution changes over time and whether topics are related. This will also be useful to draw a network graph among topics.

Immagine1
Manually compiled list of Italian stopwords
Immagine2
20 sets of most prevailing topics in all of the tested texts (numbering order & prevalence)