On February 19th we started our journey across The Venice Time Machine with the aim of building a facebook-type representation of ancient Venetians characters. As this is the first phase of our project, we are focusing on data-extraction. In this post we will first mention the tools we are currently using to extract the information about these characters, then we will discuss some problems that arose while doing some research and finally, we will present the following step of our project.
The project plan that we had presented last semester states that the first phase of our project is extracting information, therefore we decided to start by querying DBpedia. As described in its homepage dbpedia.com:
“DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data”. Retrieved March 12, 2014 from http://dbpedia.org/About
The language used for querying DBpedia is called SPARQL and it is quite different than its widely used counterpart SQL. While learning SPARQL on the go, we were also querying DBpedia to get information about people who might be related to Venice by birth, by death or some other ways that we could think of. At this point, we realized that there exist some categories both DBpedia and Wikipedia utilise.
While we were on the phase of exploring dbpedia, we have noticed that the entries for people were already categorized in differents ways and one of those categories that we have come across was “People from Venice”, exactly the subgroup that we were looking for. In fact, this category led us to a new Wikipedia Page, which presents the subcategories and the entries of wikipedia which lie under this category. Among these subcategories we can find families, doges, princes, mayors, patriarchs, merchants, composers, painters, architects, explorers etc. This category also includes entries of people who does not fall under any of these subcategories, though.
DBpedia offers various ways to obtain the data. The SPARQL end-point that they provide is the most basic one and it is also the one that we have used. Users enter the query into the provided text box and hit “enter”, after a few seconds the data is on the screen as we can see in the following image:
It is also possible to download the data in many desired formats such as CSV, JSON, XML, Atom etc.
On our previous plan, we stated that we would use the above presented tools to gather information, but while checking dbpedia content it we realized that it was extremely diverse, for example we found connections among people that we may not have predicted without using dbpedia. So, How could we start to make sort the information that will appear on our website without skipping relevant information?
We decided that a good approach would be to automate as much as possible the whole extracting information process, which differs from our previous plan. In this way, we would be able to proceed more efficiently and the categories should appear more naturally.
What is next?
Wikipedia and DBpedia are great in providing information about individuals and families. However, they do not reveal the connections among these individuals which is the focus of our project. In order to be able to store and represent the personal relationships among people, we require to have our own database where we can keep the tables regarding friendships and social groups.
We have to think of a way of extracting information of wikipedia pages and gather the hyperlinks within them to be able to catch as much as connection we can. We also need to asses our skills and knowledge about technologies that we can use for this task so that we can use as much automation as we can. After gathering and classifying the information, we will be able to design and build a good and enriched database.