Venice seen from 200 years of Swiss newspapers – Progress Report 3

As mentioned in the previous progress report, we are now dealing with the visualization of the entity-network. For this part it was necessary to define how to compute two important measures: the importance, in term of frequency, of an entity and the importance, in term of co-occurrence, of the relationship between two entities.

Entity Network

To compute the importance of an entity, we decided to take into account the number of articles in which this entity is present and not its absolute frequency. We think that, in a given article, it is not really important if an entity is cited more than once but it is more relevant if it is present or not. Also the number of citations it really depends on the article’s writer style.

To compute the similarity of two entities based on their co-occurrence, there exists a lot of methods. As said before, the frequency of an entity is made “per-article”, so to compute the number of co-occurrences we take into account the number of articles in which two entities appear together.
The first similarity measure we considered was the Sørensen–Dice coefficient which is computed as (2*|A and B|)/(|A| + |B|) where |A and B| is the number of distinct articles in which entities A and B both appear and |A| and |B| are the absolute number of distinct articles in which A and B appear independently.
When an entity A is very frequent and another entity B is not, if B always appear only with A, the Sørensen–Dice coefficient will give a low score because of A. To capture this information we decided to use the Overlap coefficient which is computed as |A and B| / min(|A|, |B|). This way if B is a subset of A or the converse, then the coefficient is equal to one.

Once these measures have been defined, we are able now to compute the entity-network which is represented by a nodes table (entity, importance) and an edges table (entity A, entity B, coefficient) starting from the table of all the entities described in the previous post. All the computation is made with Spark SQL by manipulating Dataframes.

Visualization

The visualization of the entity-network is a delicate process where a lot of decisions need to be made. To display the network we decided to use D3.js which is a popular JavaScript library for producing dynamic, interactive data visualizations in web browsers. D3 is very powerful and can handle a lot of data but as we are not used to JavaScript, we are still in an experimentation stage. We tried different implemented solutions and tried to adapt them to our problem. We noticed that is difficult to capture and display in a meaningful way both the entity and the entity-relationship importance so we are thinking that it could be useful to use more than one graph visualization style to improve the understandability of the information. There are also other aspects to consider like putting a maximal threshold for the number of displayed nodes and also maybe a minimal threshold for the importance of displayed edges to avoid displaying really weak relationships. We also need to normalize the size of the nodes and the edges.

For now we are not able to show a reasonable preview of our experiments, we give here some examples of what it will look like:

Some of the features that should be provided include the specification of a time window to restrict the network visualization to a certain period of time (in years) and the possibility to show either only person entities or location entities.

Next Steps

Until now the project milestones have been respected, we are even slightly in advance. The next step is to finish the construction of the D3.js scripts to be able to correctly display the network in a meaningful way. Then we will need to build a user interface to allow accessing the data by specifying some parameters; for this we are thinking to use Scala Play, which is a web application framework, and to connect it to our Spark application which will be used for computations.