The first goal of this project is to build a network of the entities related to Venice in order to represent the vision of this city from a Swiss point of view. The entities are extracted from a dataset of 200 years of Swiss newspaper articles taken from the Gazette De Lausanne (GDL) and the Journal De Genève (JDG) provided by Le Temps.
The second goal is to visualize this network by providing an interactive interface to the user in order to facilitate the reading and the interpretation of the data.
The initial dataset is a database of named entity mentions of type Person and Location. The data is structured as a Resource Description Framework (RDF) model and it can be queried by specifying subject-predicate-object triple expressions. For each single mention we extracted the following information:
- Name: the text string referencer of the named entity
- Type: the named entity mention type (Location or Person)
- Article: The article id in which the mention appears
- Journal: The newspaper in which the article appears (GDL or JDL)
- Date: The date of the article
Filtering Venice related articles
To capture the relationship with Venice we extracted all the named entity mentions from all the articles containing at least one “Venise” location mention. The resulting subset contains:
- 32’218 distinct articles
- 1’206’909 Person and Location entity mentions
- 106’000 distinct Person and Location entity mentions
Distributions for Location and Person mentions are shown in the following figures.
In order to construct the network of the entities we need to define how to measure the importance of an entity and the relationship between two entities.
To compute the importance of an entity, we decided to take into account the number of distinct articles in which this entity is present and not its absolute frequency. Therefore an entity that appears twice in two different articles will have twice the importance that one that appears several times in only one article.
We think that, in a given article, it is not really important if an entity is cited more than once but it is more relevant if it is present or not. Also the number of citations it really depends on the article’s writer style.
The relationship between two entities is computed by using the overlap coefficient. In a situation where an entity
X is very frequent and another entity
Y is not, if
Y always appear only with
X, then we would like to give a strong importance to this relationship. The overlap coefficient allows to capture this aspect, it is defined as the number of co-occurrences of two entities divided by the smaller frequency between them and it is computed as follows:
overlap(X,Y) = |X ∩ Y| / min(|X|,|Y|)
|X ∩ Y| is the number of distinct articles in which
Y both appear and
|Y| is their frequency computed as described before.
Once these measures have been defined, we are able now to compute the entity-network which is represented by a nodes table [entity, importance] and an edges table [entity X, entity Y, overlap coefficient]. All the computation is made with Spark SQL by manipulating Dataframes.
Since the number of nodes and links is quite important, it is not possible to display them all in one single graph. For this reason we decided to introduce a limit for the number of nodes to be visualized, in general the upper limit seems to be around 300-400 nodes, so only the top 300 high frequency nodes are displayed.
This limitation is a severe constraint which leads to an important loss of information over 200 years. To mitigate this problem we decided to allow filtering for a specific time window. However the number of links can still be important and difficult to visualize, for this reason we added to the application the capability to filter them by specifying a range for the overlap coefficient.
In addition, the user can filter by newspaper (JDG, GDL or both) and by mention type
(Person, Location or both). It is also possible to interactively highlight the directly connected neighbours of a node by clicking on it and to display information about nodes and links, this is useful to analyze in detail some particular relationship.
The following figure represent the entity network of the top-300 Person mentions between 1900 and 1950 extracted from the GDL and the JDG newspapers. By clicking on the figure it is possible to interact with the visualization.
The visualization of the network highlights the OCR errors and the entities disambiguation problem which, of course, have a strong impact on the final result. Furthermore, since the nodes are limited, we need to be aware that the network captures the relationships between the most important (frequent) nodes. As a consequence, the selection of a wide time window is not, in general, a good choice if we are looking for particular relationships between particular nodes.
Keeping in mind the limitations, the visualization of the network can show groups of nodes that are more related than others. These clusters can represent different topics or fields, this phenomena is observable when we see for example names of actors or directors grouped together highlighting the cinema area which is related to the Venice Film Festival. A cluster can also represent a particular historical event related to Venice that has been reported on Swiss newspapers, an example is the insurrection against Habsburg rule in 1848.
In the previous figure we can see a cluster related to the Venice Film Festival and the Venice Biennale and another cluster related to World War I and II.
Several further developments can be made to improve the quality and the meaning of the
network, for example:
- Correction of OCR errors
This part is very important to reduce the noise of the data. OCR errors are especially present in oldest newspapers because of the quality of the print, the font family used and the different use of some characters. This step may also significantly improve the Named Entity Disambiguation process.
- Disambiguate the named entity mentions
This part is also very important as it will improve the quality of the network. By mapping each mention to the correspondent named entity the network will display the relationships and the relevance of an entity in a more accurate way.
- More accurate filtering of Venice related articles
This task will improve the relationship of the visualized entities with Venice. In particular it is necessary to separate newspapers articles from other types of insertions. For example, we noticed that, especially in the 80s and 90s, the movie times page was often included as an article talking about Venice because of mentions of Venice Film Festival Awards. This resulted in an invasion of names of actors and directors.
- Allowing the choice of the similarity measure
Relationships between entities can be computed in different ways and the choice depends on what we would like to highlight the most. Therefore it would be useful to allow the user to choose the similarity measure as to improve the interpretation of the network.
- Bordag, Stefan. A comparison of co-occurrence and similarity measures as simulations of context. In Computational Linguistics and Intelligent Text
Processing (2008), A. Gelbukh, Ed., vol. 4919 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, pp. 52–63.
- Ramamonjison, Rindra. Apache Spark Graph Processing: Build, Process, and Analyze Large-scale Graphs with Spark. Birmingham: Packt, 2015.