Topic Modeling of ambassadors’ relations from England – Third progress blogpost

According to our original plan, at this point of the semester we should have already applied TM to a wider dataset – the Internet Archive. However, following our TA’s suggestion, we decided not to move to a new, unexplored database but to approach our clear dataset again with a new method. So far, we have mainly focused on topics: we built a distribution over topics for each document and we saw how these distributions vary over time. The words are only considered at a later time. On the contrary, the new idea, developed in [1] and explained hereunder, is to work directly on words. The final goal is to construct a weighted oriented graph, where each node represents a word and the weight of an edge is some suitable distance between the end-point terms, and to detect cluster of words within the graph. These clusters stand for our new topics, which will be compared with our original TM output.

Methodology and results

First of all, we need to pre-process the text and remove the stopwords. We use the same list of Italian stopwords already employed for training our topic model. From the resulting filtered set of terms, we extract the most N (e.g. N = 500) frequent ones, which make up our vocabulary, i.e. the words we take into account. Since the vocabulary still shows terms that are not desired, like adverbs or plural forms, we have to manually detect and remove them in order to obtain a new vocabulary. The procedure is iterated until we reach a satisfying result.

Since we are interested in time evolution, we would like to split our dataset into two time intervals according to some criterion. For this purpose, once we computed the occurrences of each word in the vocabulary in the whole dataset, we can compute the tf.idf score for word w in the t-th document:

A^t(w) = f^t(w) log(\frac {M} {f(w)})

where  f^t(w) andf(w) are the occurrences of  w in the t-th document and in the corpus, respectively, whileM is the total number of texts. Basically, the tf.idf gives a measure of how peculiar a word is to a certain document. We use the normalized tf.idf score vectors to compute a cosine-like dissimilarity between two documents t and t':

D(t, t') = 1 - \sum_{w \in W} A^t(w) A^{t'}(w)

Now we divide the texts in two groups, p^-(t) = [1:t] and  p^+(t) = [t+1:M] , and compute the average dissimilarity between documents belonging to same period. Then we make a weighted convex combination H(t) of the two resulting dissimilarities, where the weights are proportional to the length of the respective intervals. The idea is that a minimum of  H(t) corresponds to a maximum of the homogeneity within each period, so we take the minimum point of  H(t) as time cut. Figure 1 shows how this criterion suggests us to cut the dataset between the seventh and the eighth Relazione, i.e. at the beginning of the seventeenth century.

first-time-cut

We can now apply the same procedure on the second period and so on to get a total of four time sections.

second-time-cuts

We are now ready to build the graph between the words in the vocabulary. The following routine has been applied to the full dataset and each time chunks, so hereunder with “set” we will denote one of the five possible intervals.

First, for all pairs (w_1, w_2) \in WxW and for all the texts in the set, we compute the co-occurrence rate f(w_1, w_2) , i.e. the number of times that w_1 and  w_2 appear in the same paragraph. Let us denote by I(w_1, w_2) the pointwise mutual information between w_1 and  w_2:

I(w_1, w_2) = log \frac {p(w_1, w_2)} {p(w_1) p(w_2)} \cong log \frac {\frac {f(w_1, w_2)} {N}} {\frac {f(w_1)} {N} \frac {f(w_2)} {N}},

with f(w_1) and f(w_2) the occurrences in the set for w_1 and w_2 , respectively, and N the total number of occurrences of vocabulary’s words in the set. The weight of the edge from w_1 to w_2 is given by their proximity score

S(w_1, w_2) = \frac {\sum_{c \in W \setminus {w_1, w_2}, I(w_1, c) > 0} min(I(w_1, c), I(w_2, c))} {\sum_{c \in W \setminus {w_1, w_2}, I(w_1, c) > 0} I(w_1, c)}

We remark that the proximity matrix S is not symmetric, so that the graph is actually oriented. Moreover,  S is usually dense, resulting in many connections in the graph. In order to reduce the computational effort, we filter the graph minimizing the number of edges and maximizing the sum of the weights, keeping the graph connected. In other words, we introduce a threshold \vartheta to consider only the edges whose weight is greater than \vartheta . Starting from \vartheta = 0, we keep increasing it until the filtered graph is not connected any more.

Once the graph is set, we can clusterize its nodes. To accomplish this task, we employ the code developed by Blondel et al. implementing the community detection algorithm, also known as Lòuvain method [2]. Starting from an initial configuration where each node is associated to a different community, words are moved from one cluster to another in order to maximize a cost function called modularity. Actually, so far this algorithm has not produced the results we would have expected. For the full dataset, it tends to create big clusters (only three clusters for a 500-words vocabulary). This may be due to the too few data at our disposal and the homogeneity in the topics throughout the dataset, resulting in a highly connected graph even after filtering. On the contrary, for the single time intervals even one-word topics may appear too.

Upcoming

We will first try to tackle down the problems with the clusterization phase of the new algorithm. A possible solution would be not to consider the most frequent words in the dataset but the terms with the highest tf.idf scores for each document. Another possibility is to replace the community detection algorithm, which is modularity-based, with an edge betweenness-based one, provided also by Gephi – an open-source graph viewer which we will use to represent our results.

References

[1] Rule A., Cointet J. P., Bearman P. S. Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790-2014. Proceeding of the National Academy of Sciences, vol. 112, no. 35, 2015.

[2] Blondel V. D., Guillaume J. L., Lambiotte R., Lefebvre E. Fast unfolding of communities in large networks. Physics and Society, 2008.