As mentioned in the previous progress report, we are now dealing with the visualization of the entity-network. For this part it was necessary to define how to compute two important measures: the importance, in term of frequency, of an entity and the importance, in term of co-occurrence, of the relationship between two entities.
To compute the importance of an entity, we decided to take into account the number of articles in which this entity is present and not its absolute frequency. We think that, in a given article, it is not really important if an entity is cited more than once but it is more relevant if it is present or not. Also the number of citations it really depends on the article’s writer style.
To compute the similarity of two entities based on their co-occurrence, there exists a lot of methods. As said before, the frequency of an entity is made “per-article”, so to compute the number of co-occurrences we take into account the number of articles in which two entities appear together.
The first similarity measure we considered was the Sørensen–Dice coefficient which is computed as (2*|A and B|)/(|A| + |B|) where |A and B| is the number of distinct articles in which entities A and B both appear and |A| and |B| are the absolute number of distinct articles in which A and B appear independently.
When an entity A is very frequent and another entity B is not, if B always appear only with A, the Sørensen–Dice coefficient will give a low score because of A. To capture this information we decided to use the Overlap coefficient which is computed as |A and B| / min(|A|, |B|). This way if B is a subset of A or the converse, then the coefficient is equal to one.
Once these measures have been defined, we are able now to compute the entity-network which is represented by a nodes table (entity, importance) and an edges table (entity A, entity B, coefficient) starting from the table of all the entities described in the previous post. All the computation is made with Spark SQL by manipulating Dataframes.
For now we are not able to show a reasonable preview of our experiments, we give here some examples of what it will look like:
Some of the features that should be provided include the specification of a time window to restrict the network visualization to a certain period of time (in years) and the possibility to show either only person entities or location entities.
Until now the project milestones have been respected, we are even slightly in advance. The next step is to finish the construction of the D3.js scripts to be able to correctly display the network in a meaningful way. Then we will need to build a user interface to allow accessing the data by specifying some parameters; for this we are thinking to use Scala Play, which is a web application framework, and to connect it to our Spark application which will be used for computations.