Previous progress report: http://veniceatlas.epfl.ch/garzoni-font-datasets-profession-profiling-and-descriptive-statistics-project-progress-report-i/
During last three weeks we were working on discovering interesting correlations. Also we started building a web-site where all our results could be visible in an interactive way.
The problems we have encountered were the syntax and specifics of the R language – we are learning it with this project and it took us a little bit more time that it was supposed to in order to understand how do more complex things work in it, not just histograms and map plots.
Right now we made 3 steps in our project:
I. Choose the most meaningful years to work with
As you could remember from our last progress blogpost, distribution of the enrollment year is way far from uniform one:
We decided to work with years that contain 80% of information of the entire dataset. Years that left after such a filtering are:
1582-1584, 1591, 1592, 1596-1598, 1620-1622, 1625, 1626, 1632, 1645, 1653, 1654, 1656, 1657, 1658, 1664
These years contain 80% of all information, independently of the feature we are interested in. We also implemented an easy function that will choose 80% years that contain 80% of information for the feature we are interested in.
II. Compute basic correlations
We were mostly interested in correlations between annual salary of apprentice and the following factors: an apprentice profession, a master profession, a length of apprenticeship and an apprentice age. All other features are not interesting for the correlation computation because they are binary or factorial.
Here are our findings for dependency of an annual salary for the entire dataset, independently of profession or enrollment year:
|Salary vs apprentice profession correlation||Salary vs master profession correlation||Salary vs apprenticeship length correlation||Salary vs apprenticeship age correlation|
Results are not surprising
- No correlation between profession and salary
- Slight dependence of salary on apprenticeship age. Greater age means greater salary, but the correlation value is not important and can be considered as noise
- Significant inverse proportionality on apprenticeship length: Apprentices that take more time to finish the apprenticeship are less paid
Also we were interested by same correlations, but in dependence on the apprentice profession. Because of a lot of noise in our data (misspelling in profession names, absence or corruption of a lot of data), correlation values could be retrieved only for some of professions.
|Profession||Salary vs apprentice length correlation||Salary vs apprentice age correlation|
|cuori d’oro marangon||—||1|
|stampador al torcollo||-1||—|
|depentor da casse||-1||—|
Values for professions “cuori d’oro marangon”, “desegnador”, “stampador al torcollo” and “depentor da casse” are probably just noise – it is incredibly unlikely that on such a noisy data correlation would be equal to 1 or -1. Other values are going well with general correlation values shown before.
III. Construction of web-platform that visualize our findings
In parallel with the work described previously, we started to construct a web-site that will visualize our results. For the moment it is in a very early development phase, so there is no URL to provide, but we can say some words about technologies that are used behind.
On the back-end we use io.js as server with express.js as framework. The clean data is provided to the client where it is filtered and shown using chart.js for plots. The website will be hosted on the Heroku platform with the following url: garzoni.herokuapp.com. For the moment there is nothing there, so please do not be surprised with “No such app” error.
We will continue to work on correlations and on the website for visualizations.
To finish this progress report, we would like to show you small GIF animation of an apprentices origins evolution per year: