Venice seen from 200 years of Swiss newspapers – Progress Report 1

As soon as we had access to the dataset we performed data exploration to understand how we could manipulate such information. According to the milestones we should now enter in the NER bot construction phase, as we will explain after we are considering another approach.

Data exploration

The original dataset structure has one folder for each day. In each folder there is the digitized image of the article stored in a PDF format. Each article is then stored in one or more XML files. The XML files contain metadata and each article word has been tokenized and tagged. The tags <W> and <QW> can be used to reconstruct the plain text of the article.

Then we have two other structure derived from the one above. Here we have already the plain text for each article stored in XML files. The difference between the two is that one has the punctuation that has been tokenized with blank spaces like normal words while the other has correct punctuation which is more suitable for a NER bot. The two structure are also available in a month-clustered structure where there is an XML file per month which could be useful.

We noticed that old articles contains much more OCR errors than recent ones due to different factors (for example the quality of the print on the paper). Moreover the French language has evolved, the actual ‘s’ character was written like an actual ‘f’ as we can see in the 1808 article below. We also observed that sometimes articles are not well delimited in the dataset (a fragment of an article can be contained inside another article). This problem will certainly cause some noise in the computation of the importance of the relationship between two entities. We give here as an example an article from 1808 and one from 1995. Below the articles, the results of the OCR.

Gazette de Lausanne 28/03/1995
Gazette de Lausanne 25/03/1808

 

Plain text after OCR of the article from 1808:

Les connoiflances qu'il s'eft acquifes dans toutes les parties de l'éducation pendant plufieurs années, jointes au zèle qu'il mettra dans le foin de l'éducation morale de fes élèves, lui donnent le droit d'efpérer, qu'il ne manquera de répondre « nlièn ment aux vœux des parens. Pour ce qui concerne enfin l'éducation phyfique, il croit, que fans contredit, l'air pur fain, les agrémens champêtres qu'offre le village, qui d'ailleurs n'eft qn'à 2 lieues de Berne,  fur la grande route, en promettent un bon fucces, a ne rien laiflerdéfirer. Les perfonnes refp. qui voudront l'honorer de leur confiance, qui défirent avoir lesrenfeignemens néceffaires fur les conditions du dit inftitut, font priées de s'adrefler, au plus tard au mois de Juin, à Mr. Fasnacbt, miniftre à Hindelbank.

Plain text after OCR of the article from 1995:

La Confédération se montre trop gourmande CASINOS Loi sous la loupe des futurs gestionnaires George Hàberling, représentant de la société Tivolino souhaiterait « voir les casinos suisses compter parmi les meilleurs du monde ». Les patrons potentiels des casinos suisses ont mis les points sur les « i » lundi lors de séances d'information à Berne et à Thoune. S'exprimant sur le projet de loi en consultation, ils ont relevé que l'impôt prévu par la Confédération toucherait aux limites de la rentabilité des maisons de jeu. La loi prévue par le Conseil fédéral vise davantage à inhiber qu'à encourager les casinos, a averti Peter Stauffer pour le conseil d'administration du Kursaal de Berne. D'après le montant prévu dans la loi, il ne resterait plus de marge pour un impôt prélevé par les collectivités locales, a-t-il relevé.
NER bots

As discussed with the TAs we have decided to modify a bit our schedule. We think that building from scratch a NER bot it is not worthing because there already exists different valid solutions to perform NER on french text. In the very strictly lapse of time of five weeks we think that we could not do better than already implemented bots. As a consequence we would have more time to focus on the next tasks to improve the quality of the entity network and its visualization which are our main objectives. Until now we found and tried two NER bots which take different approaches.

The first one called nerc-fr which use Apache OpenNLP with a French model to perform supervised named entity recognition for French. The second one is called mXS which is rule-based , the set of rules was learned on the Etape corpus. Depending on the articles, one software can work better than the other. However we noticed that globally the rule-based one (mXS) seems to have a better performance with our corpus , in particular when the article is old. We give here an output example of the two bots with the article from 1995 as input:

nerc-fr output:

La Confédération se montre trop gourmande <START:organization> CASINOS Loi <END> sous la loupe des futurs gestionnaires <START:person> George Hàberling, <END> représentant de la société <START:organization> Tivolino <END> souhaiterait « voir les casinos suisses compter parmi les meilleurs du monde ». Les patrons potentiels des casinos suisses ont mis les points sur les « i » <START:date> lundi <END> lors de séances d'information à <START:location> Berne <END> et à Thoune.

mxS output:

La Confédération se montre trop gourmande CASINOS Loi sous la loupe des <func> futurs gestionnaires </func> <pers> George Hàberling </pers> , représentant de la société Tivolino souhaiterait « voir les casinos suisses compter parmi les meilleurs du monde » . Les <func> patrons potentiels des casinos suisses </func> ont mis les points sur les « i » lundi lors de séances d' information à <loc> Berne </loc> et à <loc> Thoune </loc>.
Next steps

Next week we’ll run the two softwares on an important (and representative in terms of age) subset, to make a choice between the two approaches (or perhaps merging  the two, if it significantly improve the performance). Then we will have to clearly define how to store the entities and then we will run a NER bot on all the corpus. After that we will need to filter the articles related to Venice.

References
  • Pattern Mining for Named Entity Recognition. Damien Nouvel, Jean-Yves Antoine, Nathalie Friburger. LNCS/LNAI Series volume 8387i (post-proceedings LTC 2011), 2014.