The aim of this project is to organize information about movies that are set in Venice and to design a tool that allows users to experience Venice as portrayed in movies. We plan to develop an Android app which, based on current user location in Venice, gives the list of movies and relevant scenes which are set at or near that location. We also plan to export a database and/or map of movie locations in Venice.
In our previous blog post, we described how we planned to extract data from websites, and the schema of the database that we intend to store the data in. We also showed some sample data that we extracted. As per the timeline we had planned, we next worked on data extraction.
We had discussed some of the problems we would encounter when we extract data. There were very few websites with enough information that could be extracted. Some others had a textual description of locations which requires a dictionary of places. For this reason, we decided to extract information from an existing database such as IMDb first. We first thought of using a similar python script for this IMDb page as we used in our previous blog to show for filmaps.com. However, this proved to be impractical as the webpages were too complex, and we want to extract more than just movie names and locations. Extracting information such as actor name, genre, director etc from a page such as the one for Casino Royale cannot be done by a simple HTML parser, but it requires a complex parser that keeps state information specific to how IMDb organizes these information in these pages. However, we were able to import data in text format from an alternative interface. We parsed the locations.list file to extract 910 entries for movies/TV series set in different locations within Venice. Out of these only 239 entries had more information about the location than just “Venice”. We show these in the map below. (We shave stilled not solved the clustering issue discussed in the previous blog, so overlapping locations may not be visible).
We also extracted the genre of all these movies (or TV series) from genres.list. Similarly, the actors and directors can be extracted from actors.list, actresses.list and directors.list. We have not merged these different data into the database yet, and therefore they are not displayed in the map. You can also see the processed data for genres , locations and movies with preciselocations.
We will first complete data collection. With all these locations, we now have a dictionary of sufficient size to parse sources with a paragraph description of locations. For the remaining sources which are either too complex or have too few entries, we will manually extract the data. Then we will shift our focus on books. World Film Locations : Venice and Venezia, Si gira! are rich in information regarding movie locations. We are planning to buy or borrow the books and obtain information manually from them. After the data collection step is completed, we would proceed to the next step as per our planned timeline, which is database building.
IMDb web interface was too complex to parse, we used the text interface instead. However, the information is text interface is completely uncoupled, we have to link them together in the database. To link them, we need to use the movie names. But as movie names are not unique, we need to use both movie name and year to identify a specific movie.
Both locations and genre lists had movie name as the key and it was easy to identify information regarding specific movies. However, actors and directors are ordered by the names and movies are the values. Furthermore, these names appear only once for the first movie, and for the remaining movies for the same actor/director, we have just the movie name. This makes it difficult to extract actor/director given movie name as the parser needs to remember the current actor/director whose movies are being processed.
The actor list contains information regarding all movies that an actor has acted in. Even after we solve the above problem and do reverse lookup on actors for a particular movie, we will get ALL the actors for the movie while we are only interested those playing crucial roles (2 or 3).
When we use multiple sources, merging them into a single database could be difficult as information, especially those relating to location could be represented in different ways, and linking them could be an issue.
Also, there is problem similar to the first problem in the sense that we need to resolve ambiguities with respect to movies, both within a source as well as across sources.