In this project, we build a similarity measure enabling the disambiguation of entities listed in the “Accordi dei Garzoni” database of contracts between Venitian masters and apprentices. We then derive a social graph representing the different relationships between masters and guarantors.
Step 1: Disambiguation using a similarity measure
The following figure gives an idea of where the data we are working with comes from.
One of the two main objectives of this project was to be able to find all contracts for a given person in which this entity was mentioned. To do that, we have constructed a measure that computes the distance between each entity in the set of all available contracts using the provided attributes.
The measure is based on the following assumptions: information is not necessarily discriminant in the sense that low similarity does not imply that entities are different. On the other hand, high similarity implies it is probable for both entities to actually be the same person (e.g. if the contracts’ starting years have a difference of 20 years it does not imply that the entities specified in these contracts are different, but the fact that the difference is 3 years gives a good indication of similarity).
To deal with this, we have created a measure that incorporates two types of distances. The first type only penalizes (increases) the distance between two entities (e.g. if the names are different), while the other only reduces the distance if the showcased information is similar (e.g. if the geographical origins are close to each other). From this we then heuristically weighted each “atomic” distance to produce a “global” distance (we define a distance to be atomic if it is a distance between the attributes of two contracts, as opposed to a “global” distance that combines all these atomic distances).
Since we not only have one type of entity, the measure exhibits little variation when comparing two masters, a master with a guarantor or two guarantors.
Using this measure, we are now able to compute the distances between each entity present in the set of contracts. At this point, we need to talk about computational complexity.
Since each contract is compared with all the others, the computational time of this procedure is quadratic in the number of contracts. Furthermore, since a great number of operations is done at each step, the time to compute all these distances grows quickly. We propose two ways to tackle this issue. The first consists in doing some pre-processing work: for example, we can group all entities by alphabetical order and only compare those entities for which the names start with the same letter. The other approach is to implement a divide-and-conquer-type method that groups the entities into different subsets of contracts and then collapses all groups into one entity summarizing the entire group. However, none of these methods have been tested and we need to dig deeper into their analysis in order to set them up. Nevertheless, we have been able to test our grouping procedure on a set of 300 contracts. We will discuss the efficiency of this procedure later on.
In order to decide whether two entities are in fact the same person, we use a threshold. All pairs of entities at a distance smaller than the threshold are supposed to be the same person and all pairs of entities at a distance above the threshold are supposed to be different. It is easy to see that this selection procedure is very strict. We also have some interesting cases, where a master is linked with two guarantors but these two guarantors are not linked together (we say that two entities are linked if, given the threshold, they are supposed to be the same person).
Intuitively, this link structure is not complete since either both guarantors should be linked or the master should not have been linked to at least one of the two guarantors. In practice, we have seen that compensating this lack of efficiency by logically completing the links highly increases the proportion of found links. This is showcased in Figure 2. From Figure 3, we can see that this completion does not create links that should not exist. Indeed, the number of elements that should not have been linked together stays very low while the number of found links quickly reaches a very high level.
Thus, for this first part of the project, we have been able to construct a similarity measure that determines whether or not two entities are in fact the same person. This step will be of great help in order to classify the different entities that will be transcribed in the following months.
To improve these results one could increase the number of explanatory variables when computing the distance between two entities, create a statistical model for the estimation of the measure’s parameters using Linear Discriminant Analysis or even logistic regression and finally implement an algorithm which might be able to speed up the computation of the similarity measures.
Step two: Social graph
The second major task of this project was to build a social graph from a pre-processed dataset containing 9559 contracts that was provided to us by Maud Ehrmann, who joined the Digital Humanities Lab during the semester.
The underlying idea is to represent each master and guarantor as a node and connect them if they are cited in the same contract. We used the igraph package in R, a powerful tool for visualizing and analyzing network data. It requires loading two dataframes, one containing the information on the vertices and another one containing the information on the edges.
We decided to only consider masters and guarantors in our graph, as the apprentice is simply the individual which links a master with a guarantor in a given contract. This approach relies on the assumption that an apprentice cannot have more than one master, which in fact happens for only a negligeable amount of contracts.
Surprisingly, we observe that the connectivity of the graph is pretty low. In other words, a lot of pairs of nodes are isolated. For this reason, we decided to focus on a subgraph of the social graph containing only the clusters with at least three elements. These clusters represent only 9% of the total number of vertices. The subgraph we are interested in is represented in Figure 4.
The igraph package was designed to represent these attributes by playing with the color, the size and the thickness of the edges. We used three colors to distinguish nodes representing masters, guarantors and those individuals which are both masters and guarantors. The size of the nodes is related to their degree, that is the number of edges for which the considered node is an endpoint. This has the effect of inflating the nodes which are at the « center » of a cluster.
We observe two major categories of structures : on one hand we have star-shaped clusters which suggest that a master appears in a certain number of contracts and chooses a different guarantor each time, and on the other hand we have tree-shaped clusters suggesting that a master knows a guarantor who knows another master etc…
To conclude, this social graph highlights in an elegant way the different relationships that were present in the Venetian society during the 16th and 17th centuries. We must however bear in mind that the connectivity of the graph is very low, hence there is only a very limited amount of data that is sufficiently interesting to be extracted and can then be compared with the existing literature.
We have managed to follow the milestones we set at the beginning of the semester and we believe that our work can be reused in order to account for the transcriptions that will be done in the next months.
In further work, we could try to determine what are the chances that indivuduals at endpoints of a tree-structured cluster know each other or not. Another direction for further development would be to create a dynamic graph where we could access information on the individuals represented in the graph by clicking on the nodes.
 Statistical Analysis of Network Data with R. E.D. Kolaczyk, G. Csardi, Springer, 2014.