Automatic dating of documents – Progress Report 1

In our Digital Humanities project, we aim to continue the studies of our assistant Vincent Buntix and discover new techniques that might help us to estimate a given article’s year. The data we are working on is 200 years of articles combined from two Swiss newspapers, “Journal de Geneve” and “Gazette de Lausanne”. Successfully modelling of the data can provide us with better insight about trending or deprecating matters through years and a starting point for date estimation of an unknown text. Our analysis will hopefully reveal the features that defines a year and a decade from another.

As predicted in our project milestones, we have spent our first weeks to get and examine the data. The data is stored in XML files which has a lot of non-essential information in addition the the data we are interested in, the date and the content of an article. In addition, we observed that there are many articles with unrecoverable OCR errors. They will likely disrupt our future calculations, therefore, trying to find an accurate way to exclude them is now one of primary goals. In parallel, we have written a parser for raw XML files and now using a subset of the data to improve it. We have also realized an obstacle which is the hardship of processing huge amounts of data with small memory such as our laptops have. Thus, we are searching for best practices to store and process this kind of data as well. In conclusion, we have spent majority of our time for building a suitable framework as we stated in our project milestones, because carelessly done preparation would definitely make us suffer later.

Nevertheless, we started to investigate methods this week on small number of data. First is TF-IDF [1] (term frequency – inverse document frequency) which is a statistical analysis technique that reflects the relationship between words and documents. TF is a measurement of the term frequency in some documents. IDF is another measurement about the information provided by this word. If this word is common and appears in many documents, the IDF number of this word would be small. And if this word is rare and only appears in the specific documents, the IDF number of this word would be large. The TF-IDF method is quite simple, just multiply the TF number and IDF number. In our project, given some documents grouped by year and a dictionary, we can calculate TF-IDF number of for each word and for each year. For example, given a word, if the TF-IDF number of that word in a specific year is larger than other years, it means it is more likely to find that word in articles around that year, when compared with other years. Hopefully this method will reveal some significant trends and lead us on the path to further understand the difference time make on language.

We are all motivated to continue our research. We have an amazing opportunity to investigate a big and unique dataset, and our findings can help others to understand how language develop over time.

 

  1. Manning, C. D. (2008). Introduction to Information Retrieval. Retrieved from http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html