Inference system based on Venetian extraction data – First progress report

Abstract

The Semantic Web shows an increasing interest in efficiently and systematically organising data and extracting information. This article is a review of some of the well-known API support methods for RDF inference. The aim is to get a better understanding of their advantages and disadvantages and the tasks they are best adapted for.

Introduction

Despite the large amount of information on the World Wide Web, it is often not easily available due to the lack of organisation and unification in its storing. The power of algorithmic learning has an increasing importance as computational costs decrease, for example due to faster CPUs, parallelisation using cloud computing and more efficient algorithm designs. In spite of these developments, the classical HTML pages, which still represent the majority of the Web and are organised based on the XML standard, are meant to be readable by humans. This causes computational difficulties when aiming to extract information, thus justifying the need to better organise the content of the Web through systematic and unifying procedures, written in a computer readable standard. The Semantic Web is an extension of the Web which attempts to solve this type of challenges. For our project, we will work with a well-established and parsimonious standard known as Resource Description Framework (RDF).

RDF and URI

RDF is a standard that organises data as a logical triplet. An RDF statement is composed of a subject, a predicate and an object. Recall that a mathematical relation R for a set of terms D is defined as a subset of the cartesian product D x D. In this sense, it is clear that RDF is the simplest method to encode relations in a mathematical framework. The RDF logic is not stratified in the sense that meta-knowledge about the terms is at the same organisational level as the terms themselves. This implies that a relation R defined on the set of terms D is again a term that can be used as an object or subject. In contrast to a simple set of words, a language also has rules, which are introduced in our framework as constraints on the terms. It remains to adequately define a representation of the terms in a machine-readable way. This is done using URIs, which are unique resource identifiers that can be accessed though the HTTP protocol. One example of URI is the URL, which can be accessed by the web.

Existing inferential methods

Suppose that we have a RDF graph, we want to make inference based on the rules and produce new connections to expand the knowledge graph. This operation may be formalised as an action of subgraph pattern matching on the initial graph. The following paragraphs offer a review of several existing inference methods.

The Jena library

The Jena reasoner [3] is intended to be a Java framework for inference on RDF graphs. The idea is to combine RDF statements consisting of our data set and pre-existing rules defined in a specific domain langage known as an ontology.

It allows to bind ontologies with RDF graphs, to make inference with respect to user-defined or ontology-defined rules. It allows to logically validate the coherence of an ontology with a certain graph. The weakness of the API in terms of computational time appears when one tries to query the structure with a complex pattern involving multistage indetermination. This situation requires to keep track of the history of the whole model at each step.

RDFS and OWL reasoner

Jena allows the use of RDFS and OWL reasoners, which have different qualitative performance depending on the nature and structure of the ontology.

Rule engines

Three rule engines are available in Jena: forward chaining, backward chaining and a hybrid engine. In a forward strategy scheme, starting from the data, all the rules act on the RDF graph and produce new deduced triples. This dynamic works until we attain the logical closure with the given rules. Then the new enlarged graph is ready for the querying phase.

The drawback is that we have no control over the order in which the rules are applied to the data, nor over the internal ordering in which the graph is explored. Conversely, the backward strategy acts as a logical programming strategy and starts from a goal. It works by pattern matching queries in the RDF graph and by successively splitting complex rules into smaller ones until a match is found. Hybrid strategies try to optimally combine the two methods described above.

Prolog and RDF

Prolog is historically the first logic programming language and it enables the fast design of knowledge basis in a framework that allows to define complex relations from simpler ones, as well as the integration of a query system. The language is well adapted for RDF statements due to its format as a triple. Although there might be some drawbacks regarding the computational efficiency in the inferential step, Prolog seems quite well adapted to our problem.

GraphPath

GraphPath [1] is a small Python API for RDF queries and inference. It uses a backward strategy and is comparable to Prolog in terms of its dynamic. It inherits from Python its adaptability, however using it in practice may prove difficult due to the lack documentation and examples available.

Conclusions

Since we do not yet have access to the Garzoni data, at the moment we are unable to make a definitive decision regarding the inference methods that are best adapted to the data set. Therefore, we analysed and compared the two main tools for RDF inferential systems: the Java API Jena and Prolog. Even though Jena is better documented and user-friendly, enabling the development of fast working systems, we consider that Prolog is probably better suited for our needs. Therefore, during the following weeks, we will mainly focus on the Prolog design of an inferential system [2] in order to get a better grasp of the inferential algorithms and their efficiency.

References

[1] GraphPath, http://www.langdale.com.au/GraphPath/ consulted on 3rd March 2016.

[2] Expert systems in Prolog, http://www.amzi.com/ExpertSystemsInProlog/xsipfrtop.htm consulted on 3rd March 2016.

[3] Apache Jena: reasoner and rules engines, https://jena.apache.org/documentation/inference/ consulted on 3rd March 2016.

Authors: Loris Michel, Lavinia Ghita