All posts by Gökçen Nurlu

Automatic Dating of Documents – Progress Report 2

We continue to work on our DH project, that is experimenting methods for date estimation of a given text using 200 years of articles. Last time we were struggling to work on data with our small powered laptops, but we’ve managed to find a way to work on whole data. Additionally we were trying TF-IDF analysis and some other methods on samples. We partially succeeded with these methods and we are still working on them.

To work faster with the data, we had converted XML files to plain CSV files without unnecessary columns, delimited with tab. This reduced the file size greatly and eliminated the ‘parsing’ phase of XML files with consumes CPU a lot. During same time, we wanted to see if we can eliminate non-useful articles that is filled with enormously high number of ‘digits’, which are probably badly scanned tables, lists or advertisements. We crafted a regular expression for filtering them out from results and rejected the articles that has a high ratio between broken characters and total number of characters and has a very small size such as titles. This method needs tweaking since we still see articles that are not useful but for now it helped a lot. In the end,  we have 2’139’833 articles in JDG and data size reduced from 10 to 4 GB.

The OCR errors are a big challenge for us. An example of an article is included below.

LE TEMPS OU’IL FAIT o h , soir y h taat . j h . soir 3 * liste . déc j MUÉn , t KV . 1 2 g h- soir 7 b . mat . * b . soir t ) Wi . Jar »* – i fli 9 . 0-3 f > . t > IO . O 3 , 0 Vent b SW 733 733 7 . 33 . Eaa eu 24 h . o CT 1 ! l . o , o NNfi 5 idinil . ; . 36 « 5 73 » 732 SW 5 SW 3 . NW 7 80 ï , 0 , 3 W . J | Eau en -24 h , 4 . S .

We clearly observe that this article doesn’t give any value to our model. Another problem is that the OCR errors are more frequent in the earlier years. The danger is that our model will use OCR errors for classification, and not the actual content of the articles. Our dataset was initially skewed towards more data in the later years, and the OCR errors make this an even bigger issue.

As we have talked about in earlier articles, our problem share similarities with authorship attribution. Lexical features as well as bag-of-words models can tell a lot about the data. AICBT Consulting Ltd has written a nice article[1] on the subject. We take inspiration of this, and we are currently working on testing similar methods to our dataset. One main difference in the cases is that our dataset is somehow continuous. Years that are close in history are probably also close in classification. This is very interesting. Initial analysis has shown some promising results, but it is still too early to present. One difficulty we have faced is the french language. A lot of natural language tools are trained on the english language. We therefore have to find a way to solve this problem.

After numerous of tries, we realised that our approaches need a balance between pre-computation on the data and actual one-by-one comparison of two texts. Former leads to fast computation but might result with loss of information and less accuracy. Latter might get good results but takes very long time and it is risky to try everything whatever comes to our mind.

We ran a simple test using some of the methods described in this paper[2]. Jaccard distance is one of the methods they describe, but they did not achieve great results. One of the test articles we used is the article written on the webpage . We found the Jaccard distance to all articles written between 1826 to 1949 in our corpora and sorted them in decreasing distance. The three articles with the best match was written in 1911, 1895 and 1914 (from lowest to highest distance). The test article describes an event that happened in 1900! At least this shows some promising results.

There are still a lot of things that has to be done. The project is structured in a way where it takes a lot of time to get some results that actually show good correlation between the years and the articles. We are still positive that we will get some good results in the end.