Automatic dating of documents

W5ritten language changes with time, both in content and style. Is it possible to date a document based on the linguistic features in an article? This is the main motivation for our project. We have been given 200 years of digitized articles from newspapers Journal de Genève and Gazette de Lausanne. Which differences in the language distinguish articles from one period of time to another and how can we detect those differences?

Previous work
The supervisor of this project, Vincent Buntinx, has already done some initial analysis on the data set. His methods uses the lexical features such as vocabulary of the text to try to classify them according to their publication year. Specifically the Jaccard distance was calculated between articles and the whole year. This gave promising results, but was not satisfactory enough. We will continue his work as well as investigating different methods.

We also realize that the task of dating articles based on textual features is similar to the field of authorship attribution. This gives us the opportunity to learn from other research projects and determine if the same methods are useful in dating these texts.

Project plan and goals
Historical events have an effect on the content of the newspapers. We have already seen this in class where some words are more frequent in some periods of time. For instance articles discussing the Cuba crisis are more common in the years of the actual crisis. This is information we might use in the dating of the articles. Not only the content, but also the writing style changes with time. Some of the features we expect to be significant is listed below.

  1. Lexical features
    • The average number of words per sentence
    • Sentence length variation
    • Lexical diversity, which is a measure of the richness of the authors’ vocabulary
  2. Punctuation features
    • Average number of commas, semicolons and colons per sentence

The main goal of the project is not to try to assign a precise date to an article, but rather to explain what makes us able to do that prediction. More realistically, our goal will be try to uncover features from texts that can help us in classification with respect to time periods. We will ask ourselves the question; What is the principal components of the corpus of the given range of dates?

Another main goal for the project is to investigate which methods that are effective on the dating of the corpora. Some methods in this problem domain are common and have proved themselves effective, but recent advances in machine learning have opened way to some new and interesting methods. Especially the field of artificial neural networks have shown some promising results. Our approach will be trying to extend these methods to our problem domain.

Project structure and milestones
Our project will be divided into three phases. A more detailed plan will be developed with the project, as this is an iterative process.

  1. Week 1-4: Analysis of the data
    An important first step will be to analyze our data and make a platform that makes it easy to extract the data in the format we want. This phase is highly dependent on the current state of the data set.
  2. Week 5-8: Investigating different methods
    One of the main goals for the project is to define what methods are the most effective in the domain of dating documents. As our problem is not commonly researched, this is one of the most important phases in the project.
  3. Week 5-10: A working platform for classification of the corpora
    We would like to make a program module that has an easy to understand user interface as well as being easy to extend. The code for the project will be available for other researchers.
  4. Week 8-14: An online demo for everyone to use
    At the end of the project, we will provide an online demo that will be publically available for everyone. The demo will be in the form as a web interface where users can load existing texts from our database or upload their own text. They will then see relevant statistics about the text they used.

Deliverables

  1. Web interface with a running demo
    One of the main goals of the project is to make an interactive demo where everyone, both students, researchers and the general public can use our platform to understand what features their text of choice makes it distinguishable from others, especially in terms of date classification. We are planning to build both the pre-processing and web interface parts modular and loosely coupled as much as possible so they can be modified easily. Overall, we want our analysis to be able to used easily by people in order to understand the characteristics of a given text.
  2. Software
    We want to provide a platform of code that people can extend to do their own analysis on their own data. We would like to make a open source project that will help other researchers in tackling similar problems. The project should be easy to understand and extend
  3. Research paper
    We will write a scientific research paper about our findings, discussing the used methods and relating it to existing research.

 

Group members: Gökçen Nurlu, Liang Jensen, Junxion Wang and Martin Hallén