VenitianBot: Progress post 1

This blog post summarizes the work we have done on the VenitianBot in the first two weeks. As our project plan suggests, we have already implemented the underlying skeleton of the bot: a simple Java program that can post in the name of @VenitianBot and retrieve tweets from Twitter. This part of the task was less time-consuming than our initial predictions, so we have already started working on the second phase of the project, which is tweet recognition.

Progress
Progress bar
10421492_10206171095214035_3878080522736893880_n
The Venitian bot can now do basic Twitter actions like tweet and search for tweets.

The search is actually implemented using normal queries to the REST API of Twitter.  However, there are some downsides such as the fact that searching for tweets in Venice and containing some keywords at the same time is not possible with Twitter4j, and there is a limitation on the number of requests we can perform.

So we will in the next days change to the Twitter Streaming API which allows to put multiple locations and search by keywords. This is not a request-response type of interaction. The Venitian bot opens a stream with Twitter’s servers and they send one by one all tweets corresponding to the given criteria (keywords, locations, …). One downside is that it does not seem like it allows us to send a list of words we do not want in each tweet which is possible with the REST API. But it seems easier to get all these tweets and filter with the blacklist of keywords locally.

We already implemented the first part of the tweet recognition mechanism using geo-location. We defined areas where important monuments are located (e.g. the Rialto Bridge, St Mark’s Square) and for each tweet we receive from Twitter we check if it is inside one of those areas. If not, we also check if it was sent from within Venice (again using a defined area). The areas are defined as a location and a radius.

 

11051228_10206170545120283_651997082_n
The area of Venice

We have assigned ranks to every tweet in order to classify them. The rank of the tweets is calculated with the classification function (see below), which will allow us to sort the tweets by rank and we will use a threshold t in order to decide if the tweet is relevant. For the next step of the implementation, we will do experiments to define the values of the parameters used in the classification function. Intuitively, we will try with parameters such that α > β > γ > δ and  α < β + γ + δ, since we think that knowing the exact location of a tweet will allow us to post a better response, hence this parameter should be the largest.

Our current classification function is the following: α·A + β·B + γ·C + δ·D > t with A the number of areas the tweet belongs to, B ∈ {0,1} which indicates if the tweet is in the Venice area, C the number of relevant keywords in the body of the tweet and C∈ {0,1} if the body contains the word “Venice”.
The locations received from Twitter are given as latitude and longitude. We also store the different areas this way, but then we have to compute the distance between those coordinates. We tried two approaches: first by using the Haversine formula, which computes the distance between two points following a sphere. Even though the earth is not perfectly spherical it is rather precise even for distant locations. The downside is that it is a bit expensive to compute, so depending on the number of tweets we have to treat it won’t be possible to use this method. The second approach is to use an approximation: the Equirectangular projection. It is a simple projection which maps meridians and meridional to straight lines with constant spacing. Then we use Pythagoras’s theorem to compute the distance between the two points (so it computes the distance between the two points in a straight line).
We ran some performance and precision tests for both solutions and it turned out that the Haversine based solution is around six times slower than the Equirectangular one, where the difference in distance computed is far less than a meter when considering only the locations close to Venice, so given the results we chose the second approach.

test

The following weeks, we will first change the API we are using to the Streaming API and we will continue with the tweet recognition task.

Resources:

@VenitianBot, www.twitter.com/venitianbot

Twitter Streaming API, www.dev.twitter.com/streaming/overview

Haversine formula, www.en.wikipedia.org/wiki/Haversine_formula

Equirectangular projection, www.en.wikipedia.org/wiki/Equirectangular_projection