In the beginning of this semester we started to work on our Digital Humanities project (the last semester proposal can be found here: Venice citizens social graph and descriptive statistics based on the Garzoni font). During first two weeks we explored our data and met our teaching assistant (Giovanni Colavizza) with whom we discussed our proposal and possible ways to improve it. As a result of our meeting, we’ve decided to modify slightly our project’s goals.
Looking on the data, we decided to concentrate our efforts mostly on discovering Profession profiles – key features that uniquely identify the profession. The important part of this task is to implement relevant algorithms that would automatically extract important features from the data and determine whether the new data instance belongs to already existing profession or it’s a completely new category.
Profession profiles could give us insights into the Venice’s citizens everyday life – knowing key characteristics of a profession we can provide and test interesting social hypotheses.
This modification of a plan does not change our plan much: we started with data preprocessing as we were supposed to, on the next stages we will apply descriptive statistics and machine learning methods to classify data into profession profiles. The only thing that differs from the previous plan is the absence of a graph analysis – we prefer to do our best in profession profiles discovering and presenting our results in a convenient way for future uses.
Now it’s time to present to you what we did during last two weeks.
Firstly, let’s describe our data, the Garzoni Font. In the left column you can see the dataset column name, in the right column – the description.
We started by taking a general look on the data and plotting histograms of most meaningful and interesting features:
- First of all, we were interested by the apprentices’ origins:It is very interesting that apprentices came from all around the Europe! We are sure, that this information will give us a fascinating insight in some profession profiles. During next iterations, we will take a look at it more precisely – a distribution of origins by year, profession, salary, etc. But for the moment we were only interested in getting the feeling of the data, so it is enough for our current purposes.
- Then we were interested by the apprentices’ annual salary distribution:There was no surprises here: in general, we see one salary peak around 5-7 ducats and some outliers that appears after 40 ducats. We will need to look at annual salary distributions in order to get more meaningful information.
- Apprentices’ age distributionPeak is at 13-14 years old – no surprises there as well.
- Apprenticeship lengthAn apprenticeship length distribution will reveal more interesting information when we will take a look at this data in correlation with other features, the same as with the age distribution and salary distribution.
- Enrollment yearThe histogram of enrollment year distribution is already interesting: we can see that we have the data for the end of XVI century, and then the gap of 20 years, then again a lot of the data between 1620 and 1630, then again a gap and the data between 1650 and 1660. It is due to various reasons: some of the data were probably lost or corrupted, but the main reason of the gaps is that this data were not yet digitalized. For the moment we will not use the data near this gaps, but at the end of the project we can try to extrapolate it (if we will have enough time for that).
These first two weeks were productive and interesting for us. We got the feeling of the dataset and we sharped our project’s goals.
Our plans for next three weeks are to find interesting correlations between features, extract most obvious profession profiles, implement naive classification algorithms. And of course plot more beautiful pictures that we will use afterwards for the visualization part of our project.