In this blog post, we will present the different steps that have led us to a first important result: grouping the contracts referring to the same individual. We will then introduce a web application that allows users to play with the data and experience the grouping procedure in real time. Finally, we will expose the final steps of our project.
In order to group contracts together, we proceed in several steps. We first compare each pair of contracts using a suitable distance and select only the pairs that have a distance lower than a certain threshold. This first procedure links those contracts that are likely to refer to the same master. Based on this distance, we then compute the coherence value for each contract. Finally, the last grouping is done by only considering the contracts having a coherence value belonging to a certain interval.
Having presented the global steps that lead to this grouping method, we will now present in more detail the distance measure we are using, illustrate the importance of the coherence value’s use and finally discuss the different thresholds.
This purpose of this measure is to represent the probability that masters in two contracts are actually the same person (despite the difference in spelling of the names, surnames or even the lack of some information). The lower the distance the more likely the masters are the same person.
For each contract, we have multiple information regarding the master (name, surname, profession, name of the father, geographical origin and the name of the shop he is working in), the guarantor (also name, surname and profession) and finally the starting year of the contract.
In order to compare two contracts, we have divided their information in two groups.
- Information that needs to be the same in both contracts
- Information that helps if they are similar
For example, in two contracts, if the master’s names are identical it is more likely that the masters are the same and if the names are completely different, it is likely that both masters are different persons. In the other case, if the guarantor is the same in both contracts, it is a strong indication that the masters might be the same. However, the fact that the guarantors are different does not imply that the masters are different.
The information of type 1 is penalizing the distance (adding positive value to the distance) and the information of type 2 is favoring the distance (removing positive value from the distance), which gives us negative distances in certain cases. Negative distances represent close to perfect match for the information of type 1 and also good match for the information of type 2.
We combined all these “atomic” distances using different weights that where chosen in an empirical way. After several tests, we have concluded that the threshold of 0.15 was the best choice. We are currently investigating a method that would allow us to estimate this parameter by performing a discriminant analysis on the projections of the distances on a multiple dimensional plane.
As explained in the previous blog post, the coherence measure represents the connectivity of the subgraphs. To have a better understanding of this concept, we will present different examples.
A coherence value of 1 implies that the contract belongs to a complete subgraph (see Figure 1). The lower the coherence measure, the less connected the subgraph is. In Figure 2 the highlighted contract has a coherence value of 0.8 due to the fact that it is the connecting node between two subgraphs. In Figure 3, the coherence value drops to 0.3 since the connectivity of the left (yellow) subgraph is high. The best empirical interval for the coherence measure has been found to be between 0.9 and 1. Note that a coherence value of -1 represents the fact that all contracts are at a distance greater than the threshold.
Shiny Web Application
The web application shiny is a good way for the visualization and the comprehension of this grouping procedure. It shows the connectivity between the different contracts depending on the threshold distance (slider « Maximal Distance ») and the coherence interval (slider « Coherence Range »). If two nodes are connected by a vertex, it implies that the grouping procedure supposes that the two corresponding masters are actually the same individual.
Since a classification of the dataset has already been done manually, we can test our procedure. Two nodes with the same color were linked manually. The tab « Classification Error » is a summary of how good the program is doing. Notice that there are three possibilities, either the link was detected both by the program and the human or the program misses connections the human has seen or the program finds connections that the human has not. This table is only an indication of the behavior of the procedure. Manual checks should be done in order to get a better understanding.
To do this manual check, the user only needs to click on the node he is interested in and then enter this index in the box « Select An Index ». There are two ways of comparing contracts, either the user wants to look at contracts that have been linked together manually (option « With the Same Id ») or he wants to look at all contracts at distance at most the maximal distance (option « Less than the max. Distance »). The different contracts will be displayed in the box « Contract Comparison ». Usually the first option is used when two contracts that should be linked are not. And the second is used when special connections are seen and we want to see what are the contracts that are linked together.
The last result we would like to achieve is to build a social graph representing connections between members of the Venetian society during the 16th and 17th centuries. Using a preprocessed dataset provided by our supervisor, the idea is to picture each master and guarantor as a node and connecting these nodes if they are cited in the same contract. However, this network will be established based on the raw data, that is with no disambiguation. The igraph package in R is a powerful tool for visualizing and analyzing network data. It requires loading two dataframes, the first one containing vertex information and the second one containing edge information. We decided to only consider masters and guarantors in our graph, as the apprentice is simply the individual which links a master with a guarantor in a given contract. This approach is based on the assumption that an apprentice cannot have more than one master, which actually happens for only a negligeable amount of contracts. More precisely, each vertex has a certain number of attributes, namely the ID, the type (master or guarantor), and the list of contracts in which the individual is cited. We could potentially add even more attributes such as profession and geographical origin while bearing in mind that a trade-off has to be found between quantity of information highlighted and lisibility of the graph.
We would then add edges between those individuals which are linked in a contract. The igraph package was designed to represent these attributes by playing with the color, the size and the thickness of the edges. Using concepts from graph theory, we can then compute vertex centrality measures, clustering coefficients and also retrieve connected components in order to get an insight on the groups that were formed in the Venetian society.
In future research, we could use the disambiguation process to collapse several nodes.