Progress Blog Post 2
As mentioned in our milestones for the semester, we have spent the last few weeks developing a probability measure that will enable us to distinguish contracts. We decided to restrict our attention to the following four variables, which appear to be the most relevant to describe a contract: master’s name, master’s surname, master’s profession and enrolment year of the apprentice. We implemented the Levenshtein distance which is a similarity measure between two strings of characters. It can be seen as a generalization of the Hamming distance since it gives the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change a word into the other. For each pair of contracts, we can thus compute and add up (with certain weights that we considered as relevant for the moment) the distances between names, surnames, professions and enrolment years and store these distances in a dataframe. In the next weeks, we will dig deeper in the choice of the weights in order to get the best possible measure. For missing data, we created a boolean variable which equals 1 if that variable is missing and 0 otherwise. The function masterMeasure was specifically designed to represent these distances in a dataframe format. We also took advantage of parallel computing by choosing 2 or more processors to speed up our computations. In the next weeks, we would like to relax the distances between names and between surnames to account for potential typograhic mistakes in the transcription process.
Considering our four covariables of interest, we performed a principal components analysis to investigate whether it is possible or not to focus our attention on a space with lower dimension. While carrying out a PCA, one must standardize each variable to have mean zero and a scaled variance since each variable is measured in different units. The principal directions we obtained suggest that all four variables are mutually independent as each direction (which corresponds to a hyperplane in R^4) captures approximately a fourth of the total variance. For this reason, we moved back to a matrix representation of the distances and defined a threshold to form clusters. This leaves us with a sparse triangular matrix from which we can determine to what extent different masters are related and from which we can extract some valuable information. We illustrate this in the following figure. In this case, we observe an upper triangular matrix where the darkness of the pixels represents the strength of the relationships between contracts.
At this stage we must deal with clusters that have non-empty intersection. If two clusters happen to intersect, a first idea would be to choose the one with the highest probability of membership. For instance, if a contract x belongs to two clusters C1 and C2, keep the cluster such that x has the greatest probability of belonging to. Another approach that we have considered would be to measure the overlapping of clusters. More precisely, for a given contract x, we denote as G1, …,Gk the group of contracts that are sufficiently close to x, where sufficient is based on the threshold introduced earlier. We tested two alternative definitions of the so-called coherence measure which quantifies the intersection of these clusters. These definitions are given in the following figure. This allows us to only keep those groups of contracts that are sufficiently close to contract x.
In the next figure we showcase an implementation of our distance measure for the name Bortollo. In this particular example, we are able to recover the missing variables since the available information seems to be very similar between the four contracts.
In the final stages of this project, we would like to use this probability measure as a prediction tool for missing variables. We are also currently setting up a verification procedure that will be able to confirm with a certain degree of uncertainty nevertheless that a certain master belongs to a given category. In order to stick with our milestones, the main task in the following weeks will be to start developing a social network which reflects the different relationships between masters based on the measure we defined.