# Network Analysis of the social graph of Venetian citizens based on the Garzoni font

Progress Blog Post 2

As mentioned in our milestones for the semester, we have spent the last few weeks developing a probability measure that will enable us to distinguish contracts. We decided to restrict our attention to the following four variables, which appear to be the most relevant to describe a contract: master’s name, master’s surname, master’s profession and enrolment year of the apprentice. We implemented the Levenshtein distance which is a similarity measure between two strings of characters. It can be seen as a generalization of the Hamming distance since it gives the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change a word into the other. For each pair of contracts, we can thus compute and add up (with certain weights that we considered as relevant for the moment) the distances between names, surnames, professions and enrolment years and store these distances in a dataframe. In the next weeks, we will dig deeper in the choice of the weights in order to get the best possible measure. For missing data, we created a boolean variable which equals 1 if that variable is missing and 0 otherwise. The function masterMeasure was specifically designed to represent these distances in a dataframe format. We also took advantage of parallel computing by choosing 2 or more processors to speed up our computations. In the next weeks, we would like to relax the distances between names and between surnames to account for potential typograhic mistakes in the transcription process.

Considering our four covariables of interest, we performed a principal components analysis to investigate whether it is possible or not to focus our attention on a space with lower dimension. While carrying out a PCA, one must standardize each variable to have mean zero and a scaled variance since each variable is measured in different units. The principal directions we obtained suggest that all four variables are mutually independent as each direction (which corresponds to a hyperplane in R^4) captures approximately a fourth of the total variance. For this reason, we moved back to a matrix representation of the distances and defined a threshold to form clusters. This leaves us with a sparse triangular matrix from which we can determine to what extent different masters are related and from which we can extract some valuable information. We illustrate this in the following figure. In this case, we observe an upper triangular matrix where the darkness of the pixels represents the strength of the relationships between contracts.

At this stage we must deal with clusters that have non-empty intersection. If two clusters happen to intersect, a first idea would be to choose the one with the highest probability of membership. For instance, if a contract x belongs to two clusters C1 and C2, keep the cluster such that x has the greatest probability of belonging to. Another approach that we have considered would be to measure the overlapping of clusters. More precisely, for a given contract x, we denote as G1, …,Gk the group of contracts that are sufficiently close to x, where sufficient is based on the threshold introduced earlier. We tested two alternative definitions of the so-called coherence measure which quantifies the intersection of these clusters. These definitions are given in the following figure. This allows us to only keep those groups of contracts that are sufficiently close to contract x.