This project consists of using 200 years of digitized swiss newspapers, the Gazette de Lausanne and the Journal de Genève, provided by Le Temps to construct the vision of Venice from the Switzerland point of view.
We have two main objectives and an optional one that will be pursued if there is enough time.
The first objective is to build a bot that performs named entity recognition (NER) over our dataset. This bot will label entities like locations, persons, organizations, etc. over all the articles.
It will be used to extract the articles related to Venice and to construct the entity network.
The second objective is to construct an entity network which captures the importance and the relationships between the entities contained in our dataset.
If we have time, we’ll consider a third objective. It consists of identifying the different topics of the selected articles to see the temporal evolution of the relative importance of these topics. It will also permit to perform a filtering in the entity network.
Step 1: Initialization
The first step consists of setting up our framework. The newspapers are already digitized so we start from a text dataset.
As we need to process a lot of text, we decided to use Apache Spark, an open source cluster computing framework, which simplifies parallel computing in order to reduce the computation time. Even if the dataset can be processed in a reasonable time (without parallel computing), by taking this approach we’ll allow our project to be more scalable. The data will then be stored in a HDFS (Hadoop Distributed File System).
Step 2: Named entity recognition Bot
This step consists of building a bot that can do named-entity recognition (NER). The purpose of the bot is to identify, disambiguate and classify the words in the articles into pre-defined categories like person, location, organization, etc.
This is certainly the most important (and time consuming) part of the project. It is important that the bot reaches a sufficient quality while performing such a task, not only because the bot itself is the first objective of this project but also because the subsequent objectives really depend on bot efficiency.
There are two main approaches to perform NER: statistical and rule-based. Both methods have their respective advantages and disadvantages. The statistical approach requires to provide a sufficient large corpus of annotated data that the bot uses to train itself. The rule-based approach requires grammar rules defined by linguists to infer entity classification.
So, different approaches and techniques must be considered and compared to determine the most efficient and suitable method that the bot will use to perform NER. A common measure to evaluate NER performance is a variant of F1 score which we will not explain in details here, it is basically the harmonic mean of precision and recall measures.
Once the bot is built, it will process all the articles to find and annotate the entities. We give here an example of what the bot output looks like. The text is an extract of an article of our dataset.
<Person>Cialdini</Person>, comme on le sait, après avoir passé le <Location>Pò</Location> à <Location>Pontelagoscuro</Location> s'est porté en avant sur la route de <Location>Venise</Location> jusqu'à <Location>Padoue</Location>. A la date du <Date>17 juillet</Date>, une avant-garde assez forte entrait à <Location>Mestre</Location> [...]
Step 3: Filter Venice related articles
Now that all the articles are annotated, we need to find the articles that are related to Venice. So the target articles need to contain geographical features that can be used to do such classification. For example articles that contain the entity ‘Venise’ are surely taken in account.
However, it could be possible that using only location entities might not entirely capture the Venice semantic field.
This step should be carefully considered.
Step 4: Entity network creation
At this stage of the project we should have a set of annotated (with entities) articles related to Venice. In more practical terms, what we have is a document-to-entity matrix which captures the occurrences of each entity in each newspaper article (document).
Now the goal is to use this information to build a network of those entities. To do this, there are two aspects that we would like to extract from the data: The relative importance of an entity compared to the others and the strength of the relationships that links these entities.
The way we compute these two aspects is really important and not trivial, this informations should reflect the reality as much as possible. For example, while computing the strength of the relationship between entities, it is not sufficient to only use the frequency that two entities occur in the same document; other measures like the average offset distance between two entities in a document and the probability that two entities occur in the same document should be considered.
To compute the relative importance of an entity we could use the frequency the entity appears in a documents. In addition we could use the relationships between entities as we could imagine that an entity that has a lot of relationships might have a greater importance.
At the end of this stage we should have an “entity-to-entity” matrix that contains the strength of the relationship between two entities and an “entity-importance” vector that contains the relative importance of the entity.
Step 5: Visualization
Finally it would be interesting to visualize the previously constructed entity network. It is maybe premature to discuss this step now as we are not completely sure about the results that we will find.
However we could describe some ideas in a general way.
Since we should have an entity-to-entity and an entity-importance matrices we could visualize this network as an undirected graph where nodes represent entities and arcs represent the relationships between them.
As we also have the importance of an entity and the strength of the relationship between entities, we could visualize these informations by adapting the size of the nodes and the arcs in function of these values.
Extra Step: Topics
This is an extra step that will be considered only if we have enough time.
It consists of identifying the different topics that arise from the filtered data, obtainded after step 3, by using machine learning techniques (K-means, LDA). In our labeled data, we should have now an extra-information which is the topic of the article (War, Biennale, Commerce, etc.).
With this extra-information we’ll provide two extras features : see the temporal evolution of the importance of the different topics and filter the entity network by topics.
The deliverable will consist of a web application (Scala Play).
The user will be able to select a time window, and to visualize the related entity network as an undirected graph, using an existing library (the choice has not been made yet).
Week 1-2: Setting up the framework. Design choices.
Week 3-7: NER bot construction.
Week 7-8: Filtering venice articles.
Week 8-11: Entity network creation.
Week 14: Presentation.
*Depending on time constraints, we’ll try to integrate the extras objectives.