The Anciant Venice Routing Machine (S6.1)

Project

The goal of this project is to create a route planner based on the transit data of the Venetian ships leaving Venice between the 13th and the 14th century. We also aim to represent the queried routes on a map, for visualization purposes. In order to achieve this, we will start from a database containing the planned routes of all ships leaving Venice, as well as a list of mandatory stops. We will then provide a cleaned version of the route database, a script to query the database, as well as a graphical interface for the end user, and a visualization of the planned itineraries on map tiles.

Deliverables

  • Web interface to search for a given itinerary at a given time
  • Map representation of the chosen itinerary
  • Cleaned database in a given format, probably CSV and JSON

Deadlines

  • March 20th 2014: Import  and clean the database in a given CSV/JSON format, extrapolate some data based on the existing dataset. Start working on the query interface,
  • April 17th 2014: The web search interface is almost finished, start working on the map representation of the itineraries,
  • May 7th 2014: Project almost finished, work on the presentation,
  • May 21st 2014: Presentation ready, everything is working as expected.

Methodology:

  • Code versionning on Git. We are thinking about using GitHub
  • Framapad collaboration for writings
  • Use leaflet integration in WordPress to display the queried itineraries on a map

Technical details

We are given a database containing information about the routes of the Venetian ships leaving Venice between the 13th and the 14th Centuries. This database was obtained by running an OCR software on the data provided.

The first step of our project will be to clean the provided database. The coding is very inconsistent and lots of OCR errors are present. We will thus start by understanding the coding associated to the data relevant to our project (destinations, planned stops, …), before structuring it in a more detailed and intelligible way. To track down OCR errors, we will base ourselves on the raw data that was provided to us, alongside the database. As the dataset is not too big, we will spend some time cleaning it semi-automatically: we will not develop automated tools for this task but we will use as much as possible the functions available in spreadsheet software. We plan to keep the post-cleaning database as a .csv file, as its size is not large enough to require the use of more sophisticated management tools.

Also, even after the data has been cleaned and checked, it needs to be parsed. The details of the routes and number of ships is stored in a rather complicated format to parse which needs a lot of understanding. As it is quite complicated and big, we still do not know how we will handle that. It will probably need to take some time thinking about how that can be processed. We will also need to change the underlying data structure to be able to process it efficiently. That’s why we plan to keep the .csv file for our understanding and for all practical purposes, and another representation which is more suitable for machines, closer to a real database.

There is a lot of data missing in the database (missing stops for instance, or estimated dates of departure), and we will have to deal with it in order to construct a robust and complete database. We will choose to interpolate some of the missing data. For instance, we might add a “virtual” stop that is consistent with historical data between two planned stops, especially in the Adriatic Sea region. We can also estimate some departing dates and stopover lengths based on existing data. We will also contemplate whether it is necessary to interpolate data that is supposed to be missing. It is very likely that not all ships leaving Venice during the considered period are reported in the database. Thus, if a trip running twice a year suddenly stops for one year, with no obvious historical explanation, we will consider adding it to the database.

In doing so, we will lose strict historical accuracy , hence we will clearly show to the end-user that the data displayed is either historical or generated. However, we believe it makes sense, when designing a route planner, to be able to propose itineraries leaving “soon”. No one would realistically wait several years for a ship, and this would help us address that concern. Finally, we noticed that there is some very incomplete or inconsistent data that we will decide to ignore, as it is too vague to provide us with any relevant information in the context of this project.

Once the database is cleaned, the goal will be to implement a way to query it from the Venice Atlas environment. The user should be able to input a city of origin, a city of destination and a wished date of departure. He will then be presented with a list of all possible destinations available to him. The routing is done on the user-side, downloading locally the cleaned database in a JSON format. The algorithm we are thinking about is the famous A* pathfinder, well know for solving those kind of routing problems.

The last step of our project will involve geographical data visualization. Wouldn’t it be nice if the route you wished to take was highlighted for you on a map, after your search? In order to achieve this, we will be using the Leaflet JavaScript mapping library as a way to display map tiles on the blog page. This framework will also allow us to highlight the selected route on the map. To this end, we will use the provided KML (Keyword Markup Language) files containing the geographical data of all ports visited by the ships, which may have to be converted to GeoJSON for technical reasons.