Category Archives: Digitization

Inference system based on Venetian extraction data – Final report

Automatically Detecting Inconsistencies in Venetian Data


 

Despite the large amount of information on the World Wide Web, it is often not easily available due to the lack of organisation and unification in its storing. The power of algorithmic learning has an increasing importance as computational costs decrease, for example due to faster CPUs, parallelisation using cloud computing and more efficient algorithm designs. In spite of these developments, the classical HTML pages, which still represent the majority of the Web and are organised based on the XML standard, are meant to be readable by humans. This causes computational difficulties when aiming to extract information, thus justifying the need to better organise the content of the Web through systematic and unifying procedures, written in a computer readable standard. The Semantic Web is an extension of the Web which attempts to solve this type of challenges. For our project, we will work with a well-established and parsimonious standard known as Resource Description Framework (RDF).

RDF and URI

RDF is a standard that organises data as a logical triplet. An RDF statement is composed of a subject, a predicate and an object. Recall that a mathematical relation R for a set of terms D is defined as a subset of the Cartesian product D x D. In this sense, it is clear that RDF is the simplest method to encode relations in a mathematical framework. The RDF logic is not stratified in the sense that meta-knowledge about the terms is at the same organisational level as the terms themselves. This implies that a relation R defined on the set of terms D is again a term that can be used as an object or subject. In contrast to a simple set of words, a language also has rules, which are introduced in our framework as constraints on the terms. It remains to adequately define a representation of the terms in a machine-readable way. This is done using URIs, which are unique resource identifiers that can be accessed though the HTTP protocol. One example of URI is the URL, which can be accessed by the web.

Garzoni data set

The project studies the “Accordi dei Garzoni”, a database of apprenticeship contracts recorded during the 16th and 17th centuries in Venice, hereafter denoted simply as the Garzoni data set.

It consists of 11532 contracts between masters and apprentices though several decades in Venice, represented as RDF entries. In total, we identified 47760 different persons who appear or are referred to in the contracts. These figures already suggest the considerable size of the data set, which has played a central role in choosing the methods and tools for the analysis.

Moreover, new additions are constantly added to the data set. Any new tool for the Garzoni data set should be flexible enough with regard to on-going changes.

Building an App

The central decision was to trade our initial idea of working directly with the inferential tool of the RDF graphs, in favour of a web application that enables this feature. Further details about the reasoning behind this decision can be found in the intermediate reports [1], [2], [3]. Thereafter, we set off to develop a flexible and user-friendly app that can pass SPARQL queries to the SPARQL endpoint of the Garzoni data, then runs the logical consistency tests in the interface. This eliminates the need to store the data on our computers. Moreover, the output of a query can be stored in a data frame and then analysed in R.

The idea for this line of action was influenced by a blogpost for one of last year’s projects, which also worked on the Garzoni data set and encountered similar difficulties. The alternative we propose here offers a solution that no longer needs to regenerate the analysis results, as well as the plots, after each update of the data set [4].

The app is developed using Shiny Apps, a tool provided in the RStudio environment [5]. It is very flexible and can pass SPARQL queries to the SPARQL endpoint of the Garzoni data. The results and plots are displayed in a concise and clear manner, which makes them accessible even for users with unrelated backgrounds. Additionally, the web application can be hosted on a Shiny server, thus making it available for more users.

Results

Historical data are, by their nature, sparse and incomplete, however, the knowledge database can be extended by performing inference on the existing data. It is very important to ensure the consistency of the data sets before moving on to more complex analysis. Even a few faulty entries can greatly influence the results of test statistics and lead to incorrect hypothesis and conclusions. The project aims to check the consistency of the entries in the Garzoni data graph after the rule-based inference step.

Annotation errors may cause inconsistencies in the results of the inference process conducted directly on the RDF graph. There are two possibilities to detect logical anomalies in the knowledge graph:

  • Build the full query and check whether the resulting data frame is void.
  • Design complex variables from simpler ones based on one or more queries.

We prefer the second approach due to its flexibility and SPARQL simplicity.

Hereafter, we present three examples of logical consistency tests based on some queries from the Garzoni SPARQL endpoint [6].

They are performed by selecting the RDF entries that verify the given properties and checking if their inclusion and intersection relations follow the expected logic. Whenever some possible logical inconsistencies are detected, we output the results in a file and then the RDF entries can be verified and if necessary modified by hand. It should however be mentioned that most of the tested queries do not find any contradictions, which leads us to the conclusion that the Garzoni data set is generally clean and coherent.

Role ordering through time: Apprentice – Master

The query checks that a person has not become a master before being an apprentice. The results, partially displayed in Figure 1, indicate possible consistency errors for four persons.

Output of the Apprentice - Master example in the Shiny App.
Figure 1: Output of the Master before Apprentice query in the Shiny App.

 

Apprentices simultaneously in different contracts

The query checks whether a person is registered as an apprentice in more than one contract, at the same time. The results, partially displayed in Figure 2, should be studied in more detail, as such an event is unlikely, but not necessarily impossible.

Output of the Apprentice - Master example in the Shiny App.
Figure 2: Output of the query checking if an apprentice is in more than one contract at a given time.

Role ordering through time: Apprentice – Guarantor

Somewhat similar to the first example, this query checks whether a person appears as a guarantor before being an apprentice. Again, the results, partially displayed in Figure 3, should be studied in more detail, as such an event is unlikely, but not necessarily impossible.

example_3
Figure 3: Output of the Guarantor before Apprentice query in the Shiny App.

Further Developments

To conclude, we propose a few directions along which the project can be extended and improved.

  • Run more specific queries to identify other possible errors or inconsistencies across multiple data sets.
  • Use the R package data.table to manipulate the extracted data more efficiently.
  • Create a more complex and systematic procedure of logical verification using statistical rule-based methods.
  • Parse the SPARQL queries such that new queries can be automatically generated.

References

[1] Inference system based on Venetian extraction data – First progress report http://veniceatlas.epfl.ch/inference-system-based-on-venetian-extraction-data-first-progress-report/

[2] Inference system based on Venetian extraction data – Second progress report http://veniceatlas.epfl.ch/inference-system-based-on-venetian-extraction-data-second-progress-report/

[3] Inference system based on Venetian extraction data – Third progress report http://veniceatlas.epfl.ch/inference-system-based-on-venetian-extraction-data-third-progress-report/

[4] Garzoni font dataset’s descriptive statistics  http://veniceatlas.epfl.ch/atlas/graphs-data-visualization/garzoni-font-datasets-descriptive-statistics/

[5] Shiny Apps RStudio  http://shiny.rstudio.com/

[6] Series of queries on the Garzoni data written by Maud Ehrmann  https://gist.github.com/ehrmama/

The Shiny App is publicly available at: http://inconsistencydetect.cf/

Authors: Loris Michel, Lavinia Ghita