Category Archives: Students’ blogs

Topic Modeling of ambassadors’ relations from England – Final Report

From the XV to the XVIII century, Venetian ambassadors returning from their postings had to provide an oral end-of-mission report in front of the Senate. These relations, or relazioni in Italian, were transcribed and, now, digitalized. The relations covered a wide range of topics and represented the attitudes from the Venetian government towards the other states. Therefore, they have become one of the most valuable sources for modern history [3]. However, the large amount of available relations calls for a distant reading, which should be capable of extracting the main topics, and possibly detecting their shift over time. For these objectives, we decided apply two different approaches – topic modeling and words’ network – on a clear dataset of relations edited by Villani [5], which is made up of 22 relations by Venetian ambassadors in England, spanning almost three centuries from 1489 to 1761. For both methods, we will present here the employed methodology and the obtained results.

Topic Modeling

We use Mallet as software for topic modeling since, from our perspective, it is a good compromise between performances, results and user-friendliness. We found our evaluations essentially on two aspects: their usability and the output of the training phase. Indeed, a topic model needs to be learned in the first place. For this purpose, we used the clear dataset, with its texts grouped in separate .txt files. These files were then tokenized by the toolbox.

As a pre-processing step, stop words have not to be included in the model. In order to do that, we manually type a relatively complete set of Italian stop words into a .txt file (stop words were compiled together from two sources: http://snowball.tartarus.org/ and www.ranks.nl). On the other hand, we decided to remove also the rarely used words from the model. At a later stage, we worked on our Italian stop words database on a recursive basis. After running the program with a preset number of 30 desired topics, we manually checked the results to find other stop words such as archaic Italian stop words (e.g. et, ipso) and words that were recurrent among different topics. Then we ran our model again. By iterating this procedure, we ended up with a list of almost 500 stop words. As a result, we have been able to carry out a relatively clean search. Once the model is learned, it has to be evaluated. We limited ourselves to a qualitative assess, looking at the terms-per-topic distributions and analyzing the coherency among the words associated to the same topic.

In order to study the temporal shift of topics, we split our dataset into five time slots, each one with a time span of 50 years (except for the last one with 74 years) and comprised of 3 to 8 relations. To have an first insight of topics’ evolution over time, we averaged the document-topic distribution over each time slot and gathered the results in the following stacked bar chart.

AvarageProportion30
Figure 1: Stacked bar chart reporting the topics-per-document distributions averaged over each time period

This seems to confirm the unsteadiness over time of this model. Only a few topics are equally distributed over the full dataset (see topic 12), while the majority has a non-symmetric distribution. To better understand the content of the topics, we can identify typical statistical distributions among topics’ shifts, yielding a topology on the topics’ space. Some of the most effective examples are reported in figure 2. For each topic we also suggest a possible label summarizing its most characterizing words. These labels often ended up to be consistent with the observed distribution. For instance, consider topic 3 that from the most frequent words we can associate with Elizabeth the First; the time distribution follows a Gaussian curve having its peak between 1589 and 1639, a period covered in part by her kingdom; however it is a relevant topic also just before and after while in the first and last periods it is almost irrelevant. This kind of distribution is typical of topics that talks about specific people. A different behavior can be found in topics about more general arguments whose relevancy maintains at a high level; for example topic 17 is about a general description of England that is present in almost all the documents. Finally, a third typical behavior regards topics whose importance increases or decreases in time. Also these topics are about general arguments so that they can be relevant during different time periods; however they are typically linked to historical aspects and for this reason they acquire or lose importance. For example, topics 18 and 11 are about the Catholic Church and the Continental wars; the former topic loses progressively importance due to the born of Anglicanism while the latter becomes more and more important since England becomes more and more involved in wars with France.

Benchmark-grouped for Wpress
Figure 2: Benchmark distributions of topics

In order to test the reliability of our results, we performed a robustness analysis towards changes of the stop words. In the previous model we had listed manually all the Italian stop words to be excluded from the model. Now in our new model we have excluded hundreds more frequent words and all the words that are less frequent than three times of presence. To compare the results, we used the topic distribution over time:

Distrubution-grouped for Wpress
Figure 3: Stacked bar charts for the output of our model with two different stop words lists

On the left we can find the previous model output and on the right the new model. Visually the outputs are quite similar. In both charts we can recognize similar pattern of time distribution. Observing the most frequent words of the topics, we can also find the same topics in the two models. For instance, topic 23 on the left and topic 30 on the right have many common words and can be considered as the same topic. These are good results for our model that can be considered robust towards changes of stop words.

The last analysis we have done is to change the number of desired topics. We have run our new model searching for 10, 20, 30, 40 and 50 topics. These are the results for the extreme cases:

Distrubution-10-50-grouped for Wpress
Figure 4: Stacked bar charts for the output of our model when run with 10 (left) and 50 (right) topics
Words’ network

So far, we have focused mainly on topics: we built a distribution over topics for each document and we studied the temporal shift of these distributions, considering words only at a later time. An alternative approach, developed in[4] and explained hereunder, is to work directly on words. The final goal is to construct a weighted oriented graph, where a node represents a word and the weight of an edge is some suitable distance between the end-point terms. Then nodes are gathered according to modularity and the resulting words’ clusters will stand for our new topics, which can be compared with TM output.

After having applied the same pre-processing routine already employed for topic modeling, from the resulting filtered set of terms we extract the most N = 400 frequent ones, which make up our vocabulary – the words we take into consideration. Since we are interested in time evolution, we need again to divide the dataset in time chunks. Anyway, instead of employing equally-long time intervals as previously done, we make our splitting be driven by a suitable criterion. Once we computed the occurrences in the whole dataset for each word in the vocabulary, we can calculate the tf.idf score for word w in the t-th document:

A^t(w) = f^t(w) log(\frac {M} {f(w)}),

where  f^t(w) andf(w) are the occurrences of  w in the t-th document and in the corpus, respectively, while M = 22 is the total number of texts. Basically, the tf.idf gives a measure of how peculiar a word is to a certain document. We use the normalized tf.idf score vectors to compute a cosine-like dissimilarity between two documents t and t':

D(t, t') = 1 - \sum_{w \in W} A^t(w) A^{t'}(w)

Now we divide the texts in two groups, p^-(t) = [1:t] and  p^+(t) = [t+1:M] , and compute the average dissimilarity between documents belonging to same period. Then we make a weighted convex combination H(t) of the two resulting dissimilarities, where the weights are proportional to the length of the respective intervals. The idea is that a minimum of  H(t) corresponds to a maximum of the homogeneity within each period, so we take the minimum point of  H(t) as time cut. The left plot of figure 5 shows how this criterion suggests us to apply two slices in the first quarter of the 17th century.The same procedure can be iterated on the third period (right plot of figure 5), so to get a total number of four time windows.

Timecuts-grouped
Figure 5: First (left) and second (right) time cuts

We are now ready to build the graph between the words in the vocabulary. The following routine has been applied to the full dataset and each time chunks, so hereunder with “set” we will denote one of the five possible intervals.

First, for all pairs (w_1, w_2) \in WxW and for all the texts in the set, we compute the co-occurrence rate f(w_1, w_2) , i.e. the number of times that w_1 and  w_2 appear in the same paragraph. Let us denote by I(w_1, w_2) the pointwise mutual information between w_1 and  w_2:

I(w_1, w_2) = log \frac {p(w_1, w_2)} {p(w_1) p(w_2)} \cong log \frac {\frac {f(w_1, w_2)} {N}} {\frac {f(w_1)} {N} \frac {f(w_2)} {N}},

with f(w_1) and f(w_2) the occurrences in the set for w_1 and w_2 , respectively, and N the total number of occurrences of vocabulary’s words in the set. The weight of the edge from w_1 to w_2 is given by their proximity score

S(w_1, w_2) = \frac {\sum_{c \in W \setminus {w_1, w_2}, I(w_1, c) > 0} min(I(w_1, c), I(w_2, c))} {\sum_{c \in W \setminus {w_1, w_2}, I(w_1, c) > 0} I(w_1, c)}

We remark that the proximity matrix S is not symmetric, so that the graph is actually oriented. Moreover,  S is usually dense, resulting in many connections in the graph. In order to reduce the computational effort, we filter the graph minimizing the number of edges and maximizing the sum of the weights, keeping the graph connected. In other words, we introduce a threshold \vartheta to consider only the edges whose weight is greater than \vartheta . Starting from \vartheta = 0, we keep increasing it until the filtered graph is not connected any more.

Once the graph is set, we can clusterize its nodes. To accomplish this task, we employ the code developed by Blondel et al. implementing the community detection algorithm, also known as Lòuvain method [2]. Starting from an initial configuration where each node is associated to a different community, words are moved from one cluster to another in order to maximize a cost function called modularity. Figures 6 and 7 illustrate the resulting networks for the full dataset and the first interval. Nodes have been colored accordingly to the cluster they belong, while their size is proportional to the degree, i.e. the number of edges incident to the vertex. Both graphs have been realized with the Gephi toolbox.

We can detect a couple of shortcomings in the output networks, namely the large number of nodes per cluster and the difficulties in identifying a general trend among words in the same group, leaving room for different interpretations. Although the former is a well-known issue of a modularity-based gathering, here another possible reason may lie in the excessive homogeneity between texts and the limited size of the dataset, making the search of the relevant links hard.

graph-full-dataset
Figure 6: Words’ network for the full dataset
graph-first-interval
Figure 7: Words’ network for the first interval

Conclusion

We have tested two different methods for a distant reading of a corpus of twenty-two relations by Venetian ambassadors: topic modeling and words’ network. The former ended up to be effective for this task. Looking at the most characterizing words for each topic, detecting a general trend was often straightforward. Moreover, averaging the per-documents topics’ distributions over temporal subintervals gave us the chance to introduce a topology on the space of the topics, yielding a clusterization criterion also at topic-level. On the other hand, although promising, words’ network has proven itself not suited for a small, homogeneous dataset as ours. We cannot exclude it would work for a different dataset – further tests are needed, maybe employing a clusterization criterion not driven by modularity. Anyway, working directly on words has some advantages, as the possibility of splitting the texts in a meaningful fashion according to their dissimilarity. Finally, a distance between clusters belonging to consecutive periods may be introduced so to detect how they split and join over time [4]. This is left as future development.

References

[1] Blei D. M. Probabilistic Topic Models. Communications of the ACM, Vol. 55, No. 4, 2012.

[2] Blondel V. D., Guillaume J., Lambiotte R., Lefebvre E. Fast unfolding communities in large networks. arXiv, 2008

[3] De Vivo F. How to Read Venetian Relazioni. Renaissance and Reformation, 34.1-2, Winter-Spring, 2011

[4] Rule A., Cointent J., Bearman P. S. Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790-2014. PNAS, Vol. 112, No. 35, 2015.

[5] Villani S. Relazioni degli ambasciatori veneziani in Inghilterra. http://www.storiadivenezia.net/