Inference system based on Venetian extraction data – Second progress report

The aim of this post is to present the Garzoni data set, which was not available at the time of the previous blogpost, as well as the progress and the changes that our preliminary study of the data set lead to.

Garzoni data set

It consists of 11532 contracts between masters and apprentices though several decades in Venice, represented as RDF entries. In total, we identified 47760 different persons who appear or are referred to in the contracts. These figures already suggest the considerable size of the data set, which posed significant problems when approached using our initial choice of tools.

In our last post, we argued the choice of using Prolog for the analysis, based on the results on small-scale examples.[1] However, when trying to use Prolog on the whole data set, we encountered several difficulties, and have thus decided to consider other tools for the analysis. We hope that this change will allow us to have, by the end of the project, a functional app that can check the consistency of the entries and identify recurring patters between them.

New approach

The central decision was to trade our initial idea of working directly with the inferential tool of the RDF graphs, in favour of a web application that enables this feature. Indeed, we set off to develop a flexible and user-friendly app that can pass SPARQL queries to the SPARQL endpoint of the Garzoni data. This eliminates the need to store and analyse the data on our computers. Moreover, the output of a query can be stored in a data frame and then analysed in R.

The idea of our new line of action was influenced by a blogpost for one of last year’s projects, which also worked on the Garzoni data set and encountered similar difficulties. The alternative we propose here offers a solution to the problem of needing to regenerate the analysis results, as well as the plots, after each update of the data set.[2]

The app is developed using Shiny Apps, a tool provided in the RStudio environment.[3]

Advantages of the new approach

The size of the data is no longer problematic under this approach, as the computations are done entirely on the server where the data are stored, which clearly allows for faster computations than our PCs.

The app is very flexible and can pass SPARQL queries to the SPARQL endpoint of the Garzoni data. Moreover, the output of a query can be stored in a data frame and then analysed in R.

The results and plots are displayed in a concise and clear manner, which makes them accessible even for users with other backgrounds. Additionally, the web application can be hosted on a Shiny server, thus making it available for more users.

It is easy to add new features to the app, in order to expand its field of application.

Further developments

For the moment, we only have a simple, yet functional, version of the app. It is possible to access the data, store a data frame, write queries, and perform basic statistics. Here you can see the results of such an operation:

capture
Screenshot of a SPARQL query example.

For the remainder of the semester, we will focus on expanding the features of the app, while constantly checking the new additions on the Garzoni data set. The details on the new directions to explore will be decided based on the results of the analysis with the existing features and on the suggestions of our project supervisors.

References

[1] Inference system based on Venetian extraction data – First progress report http://veniceatlas.epfl.ch/inference-system-based-on-venetian-extraction-data-first-progress-report/

[2] Garzoni font dataset’s descriptive statistics http://veniceatlas.epfl.ch/atlas/graphs-data-visualization/garzoni-font-datasets-descriptive-statistics/

[3] Shiny Apps RStudio http://shiny.rstudio.com/

Authors: Loris Michel, Lavinia Ghita