Introduction, objectives and final deliverables
The dataset «Accordi dei Garzoni» contains information about contractual relationships between Venetian masters and their apprentices during the 16th and 17th centuries, which can give us an insight into the lives of Venice’s citizens at that time.
The aim of this project is to quantitatively describe the Garzoni font data and to discover relations between its main features. We are interested in gaining new knowledge about Venetian citizens from the dataset, as well as researching correlations between different aspects of Venetian people’ lives (salary, profession, gender, et cetera) and the evolution of them on a social graph.
Based on our findings, we will deliver a project poster with:
- Data visualisation of our research’s findings;
- Social graph’s time evolution visualization.
A typical example of «Accordi dei Garzoni» registers’ contract contains information about the master, the apprentice, the guarantor of the contract and financial information about the terms of a contract between them. The data is sparse, but we will implement a wide range of data mining techniques in order to extract previously unknown information from this dataset.
1. Data pre-processing
First of all, this task requires data pre-processing: exploring and cleaning the data, removing the contracts that are empty, not interesting or probably unreliable. This problem gives us an opportunity to collaborate with other course groups that are working on the same dataset. In any case, since this project is focused on treating the dataset as whole and not as isolated contracts in separate registers, the goal of data pre-processing in our case is to make sure that data is suitable for pattern classification and machine learning purposes.
That means that it is necessary to remove the outliers – instances in the dataset which are too removed from others – and to normalize the data. E.g., we think that our results will be more accurate if we will create additional features. If we consider the feature of a master’s profession, in Garzoni font professions which belong to the same category can be named differently due to sometimes irrelevant specifics. Putting, for example, the “Stampador al componer” and “Stampador al torcollo” into a wider category “Stampador” can greatly increase the quality of our predictions for this profession because of bigger number of samples for analysis.
If our hypothesis about a profession’s matching geolocation or any other theory will be proven, we can perform data enrichment by adding our guesses to the existing data in order to enhance the quality of our descriptive statistics and predictions. It is possible that we will be able to enrich the data not only from analysis, but based on the historical information as well.
2. Descriptive statistics
Secondly, we will use numerical computing environment and programming language MATLAB for exploring different relationships between the features of the Garzoni dataset. By treating the Garzoni dataset as whole in order to gain a new knowledge about Venetian citizens, we can use descriptive statistics methods – explore the overall sample size, sample sizes in important subgroups (e.g., for each profession), and demographic characteristics such as the average age, the proportion of subjects of each sex. The use of descriptive statistics will allow us to get a glimpse into something, that is not obvious from separate isolated contracts.
For example, we can easily determine a master’s popularity and skill level, a profession’s demand, etc. By using the geolocation data of masters and apprentices, we can try to find relevant geographical distribution that supports the hypothesis that historically craftsmen and artisans often settled in the vicinity of their colleagues, sometimes creating entire neighborhoods.
We will explore the gender distribution in contractual relationships and determine correlations between gender and other features.
3. Pattern recognition
Thirdly, we will use the cross-correlation functions for the dataset in order to discover correlations between features.
Unlike descriptive statistics pattern recognition is based on probability theory. We will try to assess which distribution is best suited to the features of the dataset.
Computing and visualizing all cross-correlation functions not only for the whole data set, but separately by time and geolocation can give us an information about local distributions of data. This step can allow us to see, for example, differences between family traditions in different parts of Venice and in different time periods.
Once we have completed the feature extraction, we will experiment with different classification algorithms, using supervised and unsupervised learning techniques, and implement the data clusterization, using high-order features (not simple correlations).
4. Data visualization
|12.03.2015||Exploration of the Garzoni database – extraction and analysis of features|
|09.04.2015||Data pre-processing – removing outliers, normalizing and data enrichment|
|07.05.2015||Implementing – descriptive statistics and pattern recognition methods|
|21.05.2015||Projecting our findings – features’ evolution on a social graph|
|28.05.2015||Data visualisation – a project poster|