Progress Blog Post 1
We spent the first two weeks of the semester exploring the Garzoni database. We brought some slight modifications to the dataset in order for it to be more code-friendly. We give here a few summary statistics describing the Garzoni font: out of 4530 observations, there are 2401 different surnames and 547 different names for masters. A closer look at the masters’ surnames highlights a first difficulty: indeed, there are 1204 contracts where the master’s surname is missing. This could be due to the fact that at that time, masters were only referred to by their names. Another issue that is going to have to be dealt with concerns the high number (2764) of masters with only one contract. We note that the masters’ professions seem to be highly significant in order to differentiate masters therefore if we are able to form groups of professions we may be able to retrieve the missing surnames.
We also discovered a strong correlation between the annual salary of the apprentice and the length of the contract. However this correlation may originate from the age variable, since we expect older apprentices to be paid more because of their experience. These facts are illustrated in the next two figures.
Keeping in mind our objective of creating a probability measure that will quantify the degree of the relationships between individuals, we implemented a function in R called multipleEntries that selects only masters with a certain amount of contracts. The function gatheringListedInformations gives some information on a list of masters, for instance how much they pay their apprentices or the variance of salaries per master. Another function called statisticalSummary is designed to output basic descriptive statistics of a particular quantity, for example the mean, variance, and median of the salaries that masters pay to their apprentices.
We have decided to focus on masters since they appear to capture the most relevant information on how contracts between masters and apprentices were set up during the 16th and 17th centuries in Venice. The first step consists in preprocessing the dataset by considering for each master the variables name, surname and profession. We note here that we consider the strict code for masters’ professions since these are the transcriptions from the professions listed in the registers. This is done by building an array where each row corresponds to a certain master and the columns denote respectively the indices for master’s name, surname and profession. We can then define a Hamming distance that counts the number of positions at which the corresponding indices are different in order to distinguish two different masters. A major issue that appears with the introduction of this distance is the way in which we index the missing information. Indeed, assigning the value 0 for each missing variable does not allow to distinguish potentially different information. In the next weeks, we will have to tackle this problem in order to disambiguate the dataset and eventually build a social graph to illustrate the hierarchy among Venetian citizens at that specific period.