As mentioned in the previous progress report we tried to run the mxS and nerc-fr NER bots on larger subsets of the data. During a discussion with the TAs, they told us that the DHLab had just finished to perform NER on the whole corpus and that we could access this data instead of running the two bots. Since the primary goal is to build and visualize the network of entities, we decided to use the data provided by DHLab to have more time to focus on the next steps. The provided data is encoded as an RDF graph, the one used for this application.
We wrote an application using Apache Jena to execute sparql queries on the graph. Then we extracted all the mentions of persons and locations (which is an article component identifier followed by an offset to locate the entity inside it) and their name values from the articles containing a mention of Venice (32545 article components). It is important to notice that a mention is only a reference to an entity, so if the entity Lausanne appears two times in an article, we have two mentions of Lausanne with two different offsets.
We stored this data as a csv file (which is a convenient input format for Spark SQL which is the Spark module for structured data processing). In total we had 792377 person mentions and 457485 location mentions. Here an example of an entry of this table:
Path_to_data/JDG_09-11-1957_Ar00507_ne/offset_95_114, Marco Polo
We encoded the newspaper names as ‘0’ for the Journal de Genève and ‘1’ for the Gazette de Lausanne. The date refers to the article date which the article component belongs to and it is followed by the article component identifier. Then we have the start and end offsets of an entity mention relative to its article component. The type column refers to either a location mention (‘0’) or a person mention (‘1’).
By working with Dataframe objects we can query them by using SQL-like statements and functional operators. For example we can query for the top 10 location mentions that appear most frequently by the following line of Scala code:
entities.filter(col("type") === 0) .groupBy(col("name")) .count() .orderBy(desc("count")).show(10)
The output is the following:
As another example, we can query for the top-10 most frequent person mentions occurring before 1950 by doing:
entities.filter(col("type") === 1 && col("date") < Timestamp.valueOf("1950-01-01 00:00:00")) .groupBy(col("name")) .count() .orderBy(desc("count")) .show(10)
The output is the following :
As we can see from the results above, one important issue that we need to deal with is that the entity names are not always written in the same way due to OCR errors and newspaper formats. For example we can find mentions of the Genéve entity as Geneve, genéue, GENEVE, Geheve, etc. To mitigate this problem we decided to remove the characters with accent and to lowercase them all. To deal with small OCR errors we thought that we could use the Levenshtein distance (edit distance) to compare names and eventually correct them. This implies a lot of problems like knowing which one is the correct one or it could happen that two names are very similar but refer to different entities. One approach could be to output all the names that have a small distance and then manually inspecting them (if they are not too much). We started implementing the algorithm.
The first next step is to finish cleaning the data. Then we will need to clearly define how to measure the relationship between entities. After that we will start implementing the visualization part, for this part we are thinking to use GraphX which is a Spark component for graphs and graph-parallel computation.