Topic modeling of ambassador’s relations from England

Introduction

For centuries, Venice sent ambassadors to other European states. The ambassadors were in charge for years, and obtained valuable information about each country. On returning to Venice, the ambassadors gave a relazione — an oral speech about the government, politics and diplomacy, as well as geography and economy, in front of the Venetian authorities. Later on they were transcribed and printed, and eventually identified themselves as a genre.

Relazioni played a vital role in Venetian diplomacy. They were considered as manuals of politics, and even spread outside Venice. It’s not only because they covered a large amount of topics and sovereignties, but also because they were well preserved.

These features make relazioni an important source to identify the trends in Europe during the centuries between Renaissance and the French Revolution. Our goal is to use different approaches to catch two main trends:

  1. study how linguistic style changed over time with a stylometry approach
  2. study how topics in relazioni shifted over time with a topic modeling approach

Our first source will be the partial collection of relazioni from ambassadors to England between 1489 and 1763, with which we will use a train and test strategy. Then we will extend our database to other relazioni found on the internet.

In order to clarify our intentions, we made some examples from this very first data set. With stylometry, we would like to catch changes in linguistic style. For example, to express the meaning of English word and,  Latin word et was used in the first relazioni, but the Italian word e is dominant in more recent ones. With topic modeling, we could identify topics that changed their importance over time like religion, due to the spread of reformation, or party politics with the Whigs and the Tories. Additionally we could also detect changes in how a topic was dealt with. For example, in the first relazioni, the ambassadors referred to beer as a pretty unknown thing in Venice (“bevande, alle quali poi danno il nome di birra” Falier 1531), while later on they showed knowledge about beer (“l’accisa sopra la birra” Mocenigo 1706).

Methodology

To identify the trends of linguistic style and topic shift, we will use stylometry and topic modeling to approach this project.

1. Stylometry

The word ‘stylometry’ literarily means ‘detection of linguistic style’. Here in our project we use computer software to recognize and compare the patterns that lie within the collection of relazioni. Software systems like Stylo or Signature will be employed.

Through computational methods like artificial intelligence or machine learning, they could show subtle textual characteristics like frequency of functional words, the tendency to choose one form of conjunction word over another and even the prevailing combination of certain sets of words (like ‘poor peasants’ ‘powerful nobles’ and so on).

In our project, stylometry serves as a supplementary tool for studying the relationship between stylistic features and topics. As the relazioni contain vast amount of volumes, features like word frequencies, word choices, sentence structures and lengths differ from one article to another. Whether there’s any correlation between these features and different topic categories would be of our interest here.

Stylometric analysis of these different features will firstly be done within specific articles. Indexes between feature datum records and article names will be made. After that, we need to conform to the categorizing method of topic modeling in order to discover the trends of history behind letters.

A first method is that we divide the years covered by the relazioni into several periods, according to the criteria of say, every 50 years, or periods of different British rulers. Then we attribute those article names to the era they are written. Thus there would be proper links between stylistic features and the period. Further, if we have the chance to study in more depth, the categorizing methods can also be writers’ political views, writers’ age, and so on.

Thus by putting the datum of stylometric feathers and topics under the same category, we will easily check and obtain the relationship between these two. A database might be built for this.

Meanwhile, some false factors should be eliminated from the datum. i.e. If some word is only used in one era and stopped using during another, we should drop this word from the database or to use its counterpart in the new era for the same function.

2. Topic modeling

As said in the introduction, we focus on detecting discourses about England and its relationship with Venice: this is where topic modeling comes into play. Topic modeling refers to a gamma of probabilistic tools that aim at finding persistent topics in a large corpus of texts. Among these models, we adopt Latent Dirichlet Allocation (LDA). Formally, a topic is defined as a probability distribution over words. Starting from the documents (observed data), the algorithm computes for each word the probability it is related to a topic and for each document the distribution over topics. At the end, we come up with a list of topics and a list of the most relevant words per topic.

As for stylometry, the model needs to be learned, so we split our clear dataset in a training and testing set. This separation depends on how we are going to evaluate the algorithm. One choice would be to identify paragraphs in the documents, split them in train and test, let the algorithm learn the model on the paragraphs and then perform a manual control – namely, read the test paragraphs and check the results. Referring to paragraphs instead of full texts is time saving. Since we do not know how many topics will be interesting for us, we train the model with several ones.

Two other important issues are which software to use and how labelling the topics. For the first, we have not decided it yet but, as we will mention later in the milestones, we are going to spend the first days in testing three different software: Stanford toolbox, Mallet and Gensim. For the labels, a standard way is to denote a topic by one of its most characteristic words and intervene manually when this can potentially be misleading.

Once the model is learned, we can apply it to the new documents available on internet. Our goal is not merely to detect topics but also to look at their time shifts. So, as done for stylometry, we split the documents in temporal chunks, we apply the algorithm to each period and then we analyze whether and how the associated words change over years. However, we remark again we have to take into account that language is continuously evolving. A word may disappear from a topic not because it was not relevant any more but simply because it was not used any more. This suggests to use topic modeling combined with stylometry to understand how linguistic features influences the discourses of the relazioni and to detect which topic shifts are actually meaningful.

A meaningful visualization of a topic model is essential to get results. This is actually one of the challenges in nowadays digital humanities and different solutions have been proposed. We can start by computing the correlation matrix of topics against topics and then plot it by a network graph or exploiting PCA (Principal Components Analysis). Combined with a spreadsheet listing all the topics and the relating words, this lets us detect the topics we need and relations among them.

Milestones

  • Week 1: Gather information. Study softwares and tools.
  • Week 2: Test different softwares for topic modeling and R packages.
  • Week 3-6: Train the “clean” datasets. Make experimental datasets to evaluate the model. Redo the training-testing process until we obtain a satisfactory model.
  • Week 7-9: Apply the model to expanded data sets. Retrieve and clear the results.
  • Week 10-13: Post processing. Visualize and compare the results. Identify trends and changes in topics and writing style.
  • Week 14: Presentation.

Group members

Francesco Migliorini, Jiacheng Jiang, Stefano Ubbiali, Jiaxi Gu

References

  1. De Vivo F. How to read Venetian relazioni. Renaissance and Reformation, 2011, 34(1-2): 25-59.
  2. Blei D M. Probabilistic topic models. Communications of the ACM, 2012, 55(4): 77-84.
  3. Eder M, Rybicki J, Kestemont M. ‘Stylo’: a package for stylometric analysis, 2012.
  4. Scarfato, C. I. Connections in the Italian Literature at the Turn of the 19th and 20th Centuries.
  5. Storia di Venezia, Relazioni degli ambasciatori veneziani in Inghilterra, 2012.