Category Archives: Profession profiling and descriptive statistics on Garzoni font dataset (2015/G2.2)

Inference system based on Venetian extraction data

The Venice Time Machine is a joint project between the Ca’Foscari University of Venice and EPFL, which aims to model the evolution of the city of Venice throughout its history, using the information found in the Archivio di Stato in Venice. The historical documents found in the archive must first be digitalised and structured in a database that allows for easy access. A significant difficulty in the interpretation of the documents arises from the diversity of the entries that must be studied and their format. The archive consists of hand-written records coming from multiple agents and written in a language that has evolved over time. Those aspects are analysed in several projects in the framework of the Venice Time Machine and lead to the development of a database structure that facilitates the extrapolation and inference of more complex relations between the people in the city of Venice.

The Inference System based on Venetian Extraction Data project sets off to develop an ontology, as well as a logic inference system that can structure the information extracted from the database.

Objectives

The way the data is stored and organised is deeply linked with the method of extracting information from the database. Their duality should be taken into account when choosing the structure, as good organisation might eventually lead to finding incoherencies in the data and clear information can suggest better ways to organise the data. Keeping this in mind, we will look into a method to efficiently store the data to facilitate the access to information and allow fast searching through the database. This choice will be heavily influenced by the chosen access protocol -public or private-, as this will determine the expected number of users.

On the same note, it should be decided who can contribute to the project, whether it is open-source or only open to members of the academic circles, or whether a hybrid approached can be used: for example, everyone can submit his or her contribution, however, it will only be added to the platform once it is validated by experts. The latter, while somewhat slower than a purely open-source method, would avoid some of the pitfalls that websites like Wikipedia have fallen into.

The ease of use of the platform should also be carefully considered. The design should facilitate its understanding and use by less experimented contributors and people from unrelated fields.

Methodology

Levels of variability

The first question that must be addressed is identifying the factors of variability involved in the processes of inference and information extraction. The objective of inference is the retrieval of information about past events, such as the context, the location and the people involved. We place the problem in the framework of an assumed knowledge structure, where common knowledge and relations can be represented by the structure of the language and its vocabulary.

Secondly, we must deal with the variability associated with the interpretation of the names, dates, quantities etc, especially when the ontology is not given explicitly, but rather implied. Take the example of a book. Without an associated ontology, it is just a collection of words, therefore it can only make sense when analysed in a contextual framework. Thus, the data can only be studied if we assume a standing knowledge structure or, in a weaker sense, a ontological framework imposed by the language in which the documents are written.

Another possible source of errors is the digitalisation step, where, using machine learning and pattern recognition algorithms, the original documents are scanned, translated and saved in a digital format. However, due to the age and state of the original records in the archive, technical malfunctions during the scanning, as well as the evolution of the language during the studied period, this step may partially alter the data.

Design of the model

Based on the conclusions of the initial parts of the project, we will decide on a practical way to organise the data and proceed to build an inference system. As a first step, we will restrict ourselves to a specific domain of inference, for example family relations, and then generalise it to handle more complex relationships. The inference algorithm should be able to check the consistency of the annotation system, suggest new structures and patterns in the data and validate new facts on the basis of the existing rules.

Technical aspects

The linking structure used for the data is Resource Description Framework (RDF), a graph-based representation format that uses three components to specify the relationship between the entries. For data extraction and the user interface we will use R and the additional tools offered by RStudio, while for the inference we have decided to use Scala. For the ontology we will use OWL API with Scala. The choice of programming languages is justified by their respective strengths. In particular, Scala offers a strong support for domain specific languages and constraint programming.

Project plan

The first step of the project will be an exploratory analysis of the data and of the existing ontologies. The conclusions might suggest several different approaches for the next stages, which will have to be carefully compared.

Secondly, we will design an inference system that is sufficiently robust to handle the difficulties mentioned in the previous paragraphs.

Then, we will wrap a user-friendly interface around it. This should allow all users, regardless of their background and familiarity with the subject to search the database based on the proposed selection criteria. The best approach for this step will hopefully be suggested by the conclusions of the previous ones.

Finally, we should be able to test it and get conclusive results on the Garzoni data or other available datasets. Moreover, we aim to structure the system in such a way that it can be further developed and improved by future projects, so that it can be easily integrated in the larger framework of the Venice Time Machine.

Group members:

  • Lavinia Ghita
  • Loris Michel