Our project involves conducting topic modelling on articles from the Archivio Veneto journal, with the aim of extracting interesting and meaningful information about the history of Venice. We also aim to develop a visualization interface in order to make it easier to understand our results.
Goals for Week 2
According to our project timeline, we had planned to familiarize ourselves with the data source, work on planning and researching our algorithm, and begin pre-processing the data set. We also planned to have a partly completed Python Visualization Platform ready so that we can evaluate early results.
Project Progress by Week 2
Familiarization with Data Source
We have received the digitized version of every journal of the Archivio Veneto. However, this data set is segmented by journal issue, and not by article, and also has alot of meta-data that needs to be cleared such as page numbers. However, we will get the clean version of the dataset next month, and till then we will work with what we have by taking out segments of the journal that look like articles for testing purposes. We will not have to repeat this step later on with the new dataset, as we already know what the data will look like.
We are doing our development in Python using Gensim and Natural Language Toolkit (NLTK) libraries for modeling algorithms and Natural Language Processing(NLP). We have our current script running on above mentioned data using Latent Dirichlet Allocation (LDA).
- We faced a problem with the letters having an accent as they are non-ascii character the processing without specifying type of encoding gives an error. With UTF-8 format we were able process them but we still have strings displaying hex-code instead of actual characters but the library is able to handle it.
- There were few characters at many places which didn’t mean anything, for eg: ‘p’, ‘pp’, which probably specify the pages. Also ‘<<’ and ‘>>’ were observed at many places. We removed single and special character as mentioned above from the word list. As our current data is not very clean we are seeing these characters and have put checks to remove them.
- With our current data we started with important preprocessing steps.
- Stopwords Removal: We are using NLTK’s Italian stopwords list for that.
- Stemming: For removing morphological affixes from words. Here again we used NLTK’s stemmer function.
- We couldn’t integrate Part Of Speech(POS) Tagging in our pre processing steps because NLTK’s didn’t have tagger for Italian language. We can train NLTK for that but it wouldn’t be optimal. We could find few libraries which work for Italian and we plan to try them
- Rule-based Part Of Speech Tagger – RDRPOSTagger
- Advanced Rule based Unicode Aware Tagger – UCTO
- The topic that LDA outputs depends on the probability of generating particular words. Hence, we plan to integrate POS tagging. By doing so, we aim at removing conjunctions, determiners, articles, adjectives and validating whether topics generated have improved.
- After the preprocessing step, we constructed a script to extract topics from the corpus. Since we did not have the final dataset at hand, we could not produce the final result. However, we wanted to prototype our application, thus we set up a related experiment. Instead of considering each article as a document, each document was considered as one whole journal.
- Our algorithm first build a dictionary of the words present in the processed dataset.
- Then, it creates a bag of words, which counts for each word the number of occurrences in each document. This representation is passed as input to LDA, so that it can extract topics.
- The LDA model takes as a parameter the number of topics it will produce. We experimented with numerous values and we found 30 topics to be a good value. Nevertheless, in a later stage we will have to tune this parameter again.
- The resulted topics can be seen in the following table. Each line of the table is one topic and each topic is comprised by a set of words ordered by importance. The coefficients signify how much a word matter for the topic.
- Our current results are not adequate. Some special characters were not filtered out by the preprocessing step, so we have to modify it for better results. Furthermore, some words are predominant in every topic (e.g. venezia), making the distinction between topics difficult.
- We are currently working on applying hierarchical dirichlet process on the dataset. HDP will try and find the optimal number of topics, which will aid us with picking the right number of topics.
Python Visualization Platform
So far, we are able to generate appropriate data visualizations as per the plan. After closer examination of the libraries we planned to use, we decided that Plotly provides us with all the features we need, and so we will no longer be using Graph-Tool or Matplotlib.
Our results are as follows:
- Article Composition
- We are able to generate a breakdown of the topics in an article, given the raw data.
- We are able to hover over each sector to see the raw data in a tooltip.
- Article Network
- We are able to see the articles as nodes, and the topics as edges in the network.
- The colour of the node varies by how many articles it has a common topic with.
- Hovering on a node shows the number of connections. However this is only temporary, and we intend to display information about the article in a future step.
- Topic Tracker
- We have not yet implemented the topic tracker as we currently do not need to use it, since we do not have any meaningful data to track. However we plan to implement it within the coming weeks.
Currently the graphs are generated correctly, but the data has to be entered manually. The next step is to be able to directly generate the graph by reading the data from a file. After we have completed the Python Visualization Platform, we might deviate from the project plan, as we discovered a feature of Plotly that enables us to use Python code to generate JS Files. Therefore we are reconsidering whether we have to develop a new JS visualization platform later, or we can use this feature and have one platform throughout.
After the first 2 weeks of working on the project, we have been able to meet all the goals we had set. We have worked on pre-processing the data, and researching and developing early versions of the Topic Modelling algorithm and the visualization tool. However, since the data is not clean yet, it may take us some extra time to work on pre-processing it so that we can efficiently test our algorithm.