We continue to work on our DH project, that is experimenting methods for date estimation of a given text using 200 years of articles. Last time we talked a lot about problems with OCR-errors, errors introduced before we received the data. These challenges are still present, but we will in the blog post describe some of the results we have gotten.
Naive Bayes is a popular method used in text classification. It is popular because of its simplicity and efficiency. It is a probabilistic classifier based on Bayes’ theorem. It is naive because it assumes independence between the features. While this is not usually true, it doesn’t seem to matter too much on the final results.
For those interested in the details of the Naive Bayes’ classifier, this article provide an excellent explanation: http://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/
Since we are still working on developing our methods, we have only used this method on a subset of the data. We have chosen years in the range 1920 to 1960. The picture below shows the correlation between actual written year of the article compared to the classified year in our model. For simplicity, we have trained the model on all the articles in this range except the articles written in June. The test data is therefore all the articles in this range written in June.
The diagram is a plot of all the articles tested. Each row in normalized so it sum up to one.
As we can see from the diagram, the classifier predicts a lot of the documents correctly. The main diagonal includes many of the documents. In addition, we can see a band around this line. Documents are often classified one year wrong, something that seems natural when considering real time events.
Another interesting discovery is that some years seems to attract documents more than others. Examples are 1926, 1929 and 1952. We still have to investigate what causes this.
We need some kind of measurement on how well our methods perform. In machine learning we often talk about precision or recall. We use a measurement called accuracy. This is defined as number of (true positives + true negatives) / (number of all samples). For each year, we get a number for how good this classification is. Some years get a score of almost 50 percent, but other years get a score of as little as 4 percent. The latter is almost random. 1940 is such a year as we can see from the diagram.
There are multiple things we can do to improve our results. Text Classification Using Machine Learning Techniques summarizes many of the aspects around text classification.
Stemming is a pre-processing method that reduces the size of feature space by removing misspelled words or words with the same stem. From instance will “train”, “trains”, “trainer” and “training” reduce to the same word “train”.
Removing stop words is another pre-processing step that reduces the size of the feature space by removing very common words. The thought is that those words will not provide any useful information to the model anyway, and could potentially improve the final result as well as speed up the process.
We had created buckets of words for each year before. This time we first ran PyEnchant’s french spell checker and discarded incorrect words. We could try to fix them though but it takes too much processing time. After getting a valid word from that, we used a french word stemmer called Snowball from NLTK package.
We will continue working on the pre-processing of the data. A great advantage is that the processed data can be plugged directly into some of the other machine learning methods. It will be interesting to see if this can provide even better results.