All posts by Martin Hallén

Automatic dating of documents – Final report

Introduction

Written language changes with time, both in content and style. Is it possible to date a document based on the linguistic features in an article? This is the main motivation for our paper. We were given 200 years of digitised articles from newspapers Journal de Genève and Gazette de Lausanne. Which differences in the language distinguish articles from one period of time to another and how can we detect those differences? There are previous works focused on calculating similarity and distance between two texts using surface matching methods, i.e Jaccard distance [1], but they are not satisfactory enough. This study extends those works by making use of machine learning methods and expresses our observations. We present a simple browser-based application that utilises our results from such methods and provides additional statistics about corpus. In the end, we discuss limitations and possible improvements.

Project Details

There are multiple methods of text classification. While some are vast and complex, others use simple probability theory to predict the class of a document. The one we picked after several experiments and discussions is the Naive Bayes classifier.

Naive Bayes is a popular method used in text classification. It is popular because of its simplicity and efficiency. It is a probabilistic classifier based on Bayes’ theorem. It is naive because it assumes independence between the features. While this is not usually true, it doesn’t seem to matter too much on the final results. For those interested in the details of the Naive Bayes’ classifier, this article [2] provide an excellent explanation.

Our data set for this project is large, and we need a way to represent the data. Most machine learning tasks like to represent the data in something called a feature matrix. A feature of a datapoint can be “price”, “color” or “numberOfChildren”, depending on what the data point represent. In our case, a datapoint is defines as a newspaper article. We decided to use the vocabulary as the features of the article. As we mention later, this is not the only way to represent the data.

A feature matrix is a matrix with each row representing a data point, in our case the articles. The columns represent the different features. An important decision is to decide what goes in the cells. Some methods perform better with a binary representation. The cell contains a “1” if the word appears and a “0” if it doesn’t. Tf-idf is another representation. It stands for term frequency – inverse document frequency and is a measurement of how important a word is for a document in a corpus. The last possibility is to just keep track of the count. We count how many times a word appear in a document. This is the representation we went for, and it looks something like this:

the red on beach school
Document 1 3 0 3 0 3
Document 2 1 1 2 0 0
Document 3 5 1 1 3 0

The reason we picked this representation is based on the performance of the classifier. We classified almost 20% more documents correctly with the counter based representation than tf-idf and binary representation.

A similar table is created for each of the classes we want to classify into. We therefore have to define what a class is. In this project, we decided that a year is one class, as we wanted to see how language changes with time. Another possibility could be to find the month the text is written in. Our method could easily be changed to do this.

Another method we tried was to use n-grams to model the text features. A n-gram is a combination of “n” words put together to form a new term. This can better catch differences in two texts, but suffer under the increased number of dimensions. We experienced a slightly decreased accuracy after introducing 2-grams to our model.

An alert reader might wonder why we try to do classification and not regression. Time is continuous and not discrete. While this is true, we achieved better results with classification. The regression tended to model the average on all the years, and not show a clear trend. Our hypothesis is that this cause to all the similarities between documents. The similarities out-weights the differences.

Pre-processing of the data

There are multiple things we can do to improve our results. In Text Classification Using Machine Learning Techniques paper [3], many of them were summarized in the aspects around text classification.

Stemming is a pre-processing method that reduces the size of feature space by removing words with the same stem. From instance will “train”, “trains”, “trainer” and “training” reduce to the same word “train”. Removing stop words is another pre-processing step that reduces the size of the feature space by removing very common words. The thought is that those words will not provide any useful information to the model anyway, and could potentially improve the final result as well as speed up the process. Spell-checking and repairing/removing is an additional approach. In our case, we took advantage of open-source softwares: PyEnchant’s for spell-checking and Snowball from NLTK package for stemming. After processing our data, we observed a significant improvement in our results.

Below we can see two figures. They show the correlation between when an article was written and the predicted authorship year. To the left, we can see the results before pre-processing of the data, and to the right we see the result after pre-processing. The accuracy improved from 35% to 57%, something that is significant.

13342462_10154129539409536_1080657938_n 13318699_10154129539299536_1867063388_n

Results and conclusion

requirements
The diagram is a plot of all the articles tested. Each row in normalized so it sum up to one.

We have found a clear correlation between the written year of an article and the label we got through classification. We got an accuracy of 33% on the data set from JDG, where we look at documents for every 5 years. As we can see from the figure, the earlier years was difficult to classify, and therefore lower this result. Some years achieved an accuracy of more than 60%.

We have observed a slight skew in the classified result towards later years. We believe this is due to the bad quality of data from the earlier years. It is also interesting to see that some years are more likely to be get documents classified to it.

We are happy with the results, but we know that there are improvements. Regression is a possible way to go. We also saw that the amount of data matters a lot in this model. Using more powerful computers with cleaner data can enable the continuation of this project to get improved results.

The code and results of this project will be handed over to our teaching assistant Vincent Buntinx, and we are excited about what will follow.

Team:

  • Martin Hallén
  • Gökçen Nurlu
  • Liang Jensen
  • Junxion Wang

References

[1] Manning, C. D., and Schutze, H. 1999. ¨Foundations of Statistical Natural Language Processing” The MIT Press.

[2] Ray, S. (2015, September 13). “Naive Bayes Algorithm”. Retrieved April 1, 2016, from http://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

[3] M, I., S, K., & V, T. (2005). “Text Classification Using Machine Learning Techniques”. WSEAS TRANSACTIONS on COMPUTERS, 4(8), 966-974. Retrieved April 1, 2016, from http://infoautoclassification.org/public/articles/Ikonomakis-et.-al._Text-Classification-Using-Machine-Learning-Techniques.pdf