Introduction and Objectives :
We are provided with a subset of the Garzoni database containing information on contracts between Venitian masters and apprentices during the 16th and 17th centuries. The main purpose of our project is to remove any kind of ambiguity from this large dataset in order to get a better understanding of the Venitian society during this particular period. For instance, we will be able to get a better insight on the hierarchical structure of the society and on the inherent socio-economical conditions by building a social graph highlighting the relationships between the given individuals.
Project Methodology :
There are a total of 26 registers composing the « Accordi dei Garzoni ». A typical example of a contract from one of these registers contains information on the master, the apprentice, the guarantor of the contract, as well as information on temporal and financial conditions, such as enrolment year, duration of the contract and salary type. Since the provided database is very large and highly intricate, there is a plethor of methods to clean the dataset. Our goal is to target the potential ambiguities and remove them in order to get a better understanding of the information at hand. This will allow us to characterize relationships between masters, apprentices and guarantors. The approach to tackle the ambiguity issue is to construct a probability/similarity measure that will characterize individuals sharing a certain number of attributes (e.g. same master, same apprentice, same enrolment year). A possible solution would be to set a threshold and to take our decision by quantifying to what extent individuals are related, or in other words, belonging to the same « class ». However this can only be done by taking a careful look at the data to understand the underlying mechanisms. Indeed, the first step in our project will be to play with the data, explore the different relationships between the variables and find the variables that are most relevant in characterizing a person. We will use the statistical software R to achieve this objective. For example, we decided to shift the enrolment year to a scale between 0 and 103 for interpretability purposes and plotted several graphs to get a first flavor of the dataset. By building a regression model to explain salary based on enrolment year and contract length, we observe that apprentices are payed more for shorter contracts and that salaries increase over time for similar contracts. We could interpret this interesting piece of information as an inflation (which was probably a global phenomenon in Venice at that time, not only in the contract enterprise) over the hundred year time lapse covered by this study. Another issue we must consider is how to deal with missing data and how it affects our interpretation of the underlying assumptions. For computational reasons, since some names are missing, we can for example transform them to a string variable ‘noname’. Another problematic situation which often arises in statistical modeling is the presence of outliers. For all these reasons, a rigorous analysis of the data is a crucial step in our project.
Once this is done, we will be able to disambiguate people sharing the same name or other attributes. Indeed, the designed model will be capable of distinguishing individuals based on known data. We can further apply this model to unknown variables that still carry ambiguity.
In order to better grasp the information contained in the dataset, a natural motivation is to perform data visualization. The goal is to build a social graph of the Venitian citizens involved in our study representing the contractual relations. This will be done using Gephi. By analyzing this network, we will be able to compare our results with the existing literature on Venitian contracts at that time. It will be interesting to see to what extent our model reproduces the historical facts related in the literature.
Finally, we will propose a way to predict missing data by analyzing the structure of the graph obtained and performing statistical inference. Since we also have partial access to the geolocalization of the involved individuals, we could use a special software to create interactive maps called TileMill to localize masters as well as apprentices and hence get an insight on the sites in Venice that were particularly active at that time. Another challenging but interesting task would be to guess the relationships between individuals from the given data. For instance, people living in the same neighborhood at the same period could be strongly related even if the data does not specify this kind of relation.
Relevant probability measure allowing to group individuals sharing a certain amount of attributes into clusters and hence removing any ambiguity, social graph representing the interactions between individuals, network analysis and inference of missing data.
- Exploration of the Garzoni database : extraction and analysis of patterns [Weeks 1-2]
- Building a probability measure to characterize an individual [Weeks 2-3-4]
- Disambiguating multiple persons based on this probability measure [Weeks 5-6-7]
- Establishing a social network based on the interactions between individuals [Weeks 8-9-10]
- Analysis of the graph and inference of missing data [Weeks 11-12]
- Final presentation [Weeks 13-14]