Garzoni font dataset’s descriptive statistics

Abstract

While taking the “Digital Humanities” course (DH-101), our team was interested in describing a big amount of data collected by “Venice Time Machine” project. This blogpost report presents our findings and details of our implementation as well as problems that we have encountered while developing the project.

The results of our project can be found on this website. We will make the source code public in case of future students’ interest in developing our idea.

Motivation

We were wondering – what key characteristics of social groups are changing over time? In which fashion? What key characteristics stay the same for a long time? That was not an easy question to answer in 3 months of work, so we concentrated our efforts on the project that can help other people to find the answer.

During the DH-101 class we discovered the “Accordi dei garzoni”, a well-structured and often updated database of apprenticeship contracts in the ancient Venice. At the moment of writing this post, the accessible data is between 1579 and 1685 years. It contained various information, that was very useful for our research, such as annual master’s salary, length of education, information about finances of a given apprenticeship, etc. You can see the full description of provided information below:

 

We have decided that it would be beneficial to develop a tool for extracting information from such a dataset in order to help other people in the Digital Humanities community to work with it. It meant that our final product should contain as much information about that dataset as possible and also should be well visualized. To achieve this goal, we were required to aboard this dataset from a mathematical and graphical sides, that turned out quite challenging for us.

First steps

We started with exploring our dataset and extracting some simple information from it. We used the R language for that purpose. In the process of doing the project we decided to abandon that choice because of various reasons, that would be explained afterwards.

Our first plots were created in order to get a feeling about the data we had. For example, the distribution of apprentices’ origins showed us that apprentices do not come only from Venice, but also from neighboring regions:

 

And the apprentices age distribution has a peak at 14 years old:

These plots were made using R language. Even with the first steps in the project, we were not happy with the results. Firstly, our findings were not very good visually. R language allowed us to change styles of the plots, but we found these parameters insufficient for our purposes. Secondly, we realized that with this approach we will not be able to create a tool that can be used by everyone, we will just plot some graphics that we find interesting, but that will be useless for other people. We have continued to explore our data using the R language, but also we started to look for other tools, that would allow us to create the product that we wanted to provide to the Digital Humanities community.

New approach to analyzing our dataset

As it was stated in the motivation of our project, we wanted to create a tool that can be used by everyone who is interesting in analyzing the “Accordi dei garzoni”, not requiring any programming skills. The obvious solution is a website that can extract and visualize the information from the dataset that user uploads to it.

After finishing our project we can honestly say that to switch the direction  was an excellent decision. If we were working with R language to the end,  we would have a lot of beautiful plots about correlations between different variables, but this plots should be recreated at each dataset update. And we could not realistically generate all information that the Digital Humanities community can find interesting on the first scratch.

With our tool everyone can extract any information from “Accordi dei garzoni” dataset. For example, the histogram of average annual salary of an apprentice in “marangon” profession has the following shape:

Снимок экрана 2015-05-11 в 15.10.40

And the total number of apprentices for this profession, 383, is composed of apprentices between 9 years old and 20 years old. All this information can be found on the garzoni.herokuapp.com website in few clicks.

Another issue, the dataset is not complete. For example, there are years for which we have more data than for others. This can be easily seen on the enrollment year distribution:Снимок экрана 2015-05-11 в 15.57.46

While this histogram is supposed to be more or less flat, we have big holes between 1591 and 1596, 1632 and 1653, and a distribution that is far away from uniform. Using this information, we can easily get more representative information, by choosing only statistically valid years.

There are many other amazing things that we can do with this service. For example, most popular professions that pay more than others (more than 8 ducatti per year in average) are “marzer” (123 apprentices) and “spechier” (109 apprentices). The most popular ateliers for these professions are “in Marzaria” and “in Rialto” and they mostly do not pay clothes for their apprentices:

Снимок экрана 2015-05-11 в 16.17.50Снимок экрана 2015-05-11 в 16.17.58Снимок экрана 2015-05-11 в 16.17.25

 

 

 

 

 

Go ahead and try it out for yourself!

How does it work?

The website uses a .csv dump of the “Accordi dei garzoni”. All computations are done on the user’s machine because they are not very costly. Charts are created via an open-source library Chart.js, all other things are developed by us.

On the top left corner of a webpage you can find a general information about the dataset:

Снимок экрана 2015-05-11 в 16.23.15• “Source: file” means that the data is took from the .csv file provided by server

• “Number of records” is the total number of records in a given dataset

•  “Number of registers” is the total number of registers, used in a given dataset

In the middle of the webpage you can see the general description of the dataset: Distribution of records by registers and by years in a doughnut and bar plots

Снимок экрана 2015-05-11 в 16.26.42

On the right side of the webpage you can find all available variables that are present in the dataset with their descriptions

Снимок экрана 2015-05-11 в 16.28.08If you click on these variables, the detailed statistics will be shown. Let’s say, we want to see a distribution of “annual_salary” versus “length of apprenticeship”. In order to do so, we need to click on the “annual_salary” and “length” in the descriptions, and the following will be shown in the middle of the webpage:Снимок экрана 2015-05-11 в 16.30.52

At the same time, on the left side of the page you will see the list of selected columns with filters that you can apply to the data you are currently looking on.

Снимок экрана 2015-05-11 в 16.32.00If you click on the “Clear selections” link, all column that you selected will be unselected, and you will return to the general statistics view.

In the filters section, you can choose values for given columns, and all plots will change accordingly.

Let’s see the distribution of the length of the apprenticeship for the average annual salary greater than 8. For this we choose all values of annual salary greater than 8 (use ctrl/cmd + click for multiple choice):Снимок экрана 2015-05-11 в 16.37.21

Снимок экрана 2015-05-11 в 16.37.07

 

 

 

 

With the “Clear filters” button you can remove all current filters.

Using database instead of .csv file

Near the end of our project, we asked for a dump of MongoDB database. We wanted to connect the database directly to our service instead of using the .csv file. But we have encountered a big issue: the data schema in the database is completely different from what was used in the .csv file. We started to integrate our service with the database, but it will take some time to finish it. In our testing version all data that comes from database are computed very slowly because of the different scale of the data size as well as because of the different schema. But there are no other recourses that needed to make our service work with the MongoDB except more time, so the solution is scalable.

Future work

The most important and difficult thing that can be improved in this project is the integration with MongoDB. This should improve the number of statistical information that can be obtained from the dataset and would allow users exploring the most current data directly from the database. Otherwise, the service is ready to use by anyone interested in it. If our idea will prove useful for the community, we suggest assigning developing the integration with MongoDB to one of the teams in the next generation of this course students.

Acknowledgments

We would like to thank Frédéric Kaplan for teaching us the Digital Humanities course, Giovanni Colavizza and Maud Ehrmann for productive supervising and helpful advises for our project and, of course, the Digital Humanities community for their interest in our research.