Citations of the archive

Abstract

This report studies the potentials and limitations of citation organisation in the humanities today. We focus in analysis of the citations that are related to the Archivio di Stato di Venezia (ASVe), the 80km conserved archive of Venice in the past centuries. Based on these citations, we present a novel tool for extracting and classifying the citations of ASVe. This tool enables the determination of the citations occurrence in current literatures and classify them in terms of its category in the archive. Hence, we are able to characterise the sequence for digitalization of the archive.

Background

Archivio di Stato di Venezia (ASVe) is a collection of the documents of the Venetian history. Being the largest archive in Italy, it covers to 70km of shelves and therefore the digitization of the whole collection would be time consuming and physically demanding.

In order to smooth the digitization process and the study of the archive, it would be more convenient to begin with the most cited documents by scholars instead of creating digital copies in chronological order. Therefore, for the sake of the alleviating the work force and accessibility, this project would be beneficial.

f1
Figure 1: Achivio di Stato de Venezia

The study of citations within the humanities context has been exploding recently. In Digital Humanities 2013, Sula C.A. & Miller M. has developed an online PDF extractor that studies the occurrences of citations within the same literature. However, they also suggested the challenges of low resolution due to diverse disciplinary patterns in citations, which would require intensive manual checking and machine training. Sula’s study has shared some light on the importance of our project – to develop a classifier, extractor and categorizer for ASVe citations which can eventually act as a model for interdisciplinary citations study.

Project Objectives

The goal of the project is to identify the sections of ASVe that are most cited by scholars, our expected deliverables are as follows:

  1. A program for classifying the text files containing the corresponding ASVe citations and extracting ASVe-related citations from text documents
  2. A program for corpsing the extracted information for categorizing the citations for statistical citation analysis.

Method

  • Classifier

The mechanism of the classifier is to search ASVe-related keywords in the source document to identify if it contains citations or not. Since it only detects the keywords such as ASVe, A.S.V. and Achivio di Stato di Venezia as illustrated in the expression below, a file contains ASV as an abbreviation of “anodic stripping voltammetry” was also considered as a right candidate, which is a wrong classification.

“((A|a)(R|r)(C|c)(H|h)(I|i)(V|v)(I|i)(O|o)(D|d)(I|i) (S|s)(T|t)(A|a)(T|t)(O|o) (D|d)(I|i)(V|v)(E|e)(N|n)(E|e)(Z|z)(I|i)(A|a)) |(ASV[e]?)”

  • Extractor

By further studying the citation format based on the findings from previous stage, we noticed that the citations always end with the abbreviation f, n, b, c, cc, and bb, following by a number. The program extractor works by searching in a text document, strings that start with “ASVe” (all possible formats as the classifier does) and end with an abbreviation followed by one or more digits. This pattern is compiled as a regular expression and matched in each page of the text. The general expression beneath illustrates how the extractor fetch information from the citation.

“(A[\\.]?S[\\.]?V[\\.]?[e]?(.(\\()?Archivio.di.Stato.di.Venezia(\\))?)?,.+((([fbcn]|(bb)|(cc))\\.)|(filza)|(busta)).(.|(\\r?\\n))?[\\d-]+)”

At this stage, the extractor found 1546 of citations in total in our collection of documents.

  • Manual Extraction

Manual extraction of citation from the documents is time consuming. While we have found around 100 pdf files for citation extraction by the program, extracting all ASVe-related citation from all files is not very feasible for the limiting time and workforces that we have. In this case, we randomly picked 20 of them for result comparison between the program and manual fetching.

Even at a professional level, the accuracy of manual data entry is yet to be high, there is a very high chance that one might have missed out some citations from the documents when they are reading through the documents or other human entry error. As a result, missing citations might be discovered when comparing to the program extraction. In contrast, there might be chances of missing a certain citation in both sides, causing a silent citation-miss that is undetectable.

  • Program Extraction

Due to citations containing multiple references, the citation format became more complicated and some of them were not able to be extracted. For instance, the reference as shown below with f. at the line and the number following it, is on the next line.

ASVe, 1759a, b, c. Senato, Dispacci, Provveditori da Terra e da Mar, Provveditori Generali da Mar, f.
1002 (luglio 1759­maggio 1760): a) Lettera, Provveditore di Cefalonia, [Cefalonia] 20 giugno 1759;

Due to the endline problem, the extractor was unable to extract this particular citation as it was not compatible to the regular expression in the program. The fact is that this is not a regular endline character (such as “\n” – Unix or “\r\n” – Windows), but was encoded in another way. Also, in the same document, we encountered other problems with the spaces encoding from the same reason.

  • Improving the citation extractor performance

Following the mismatches between the manual and program extraction from the previous stage, we found that the extractor did not include p. and reg. in the regular expression, which contributed to a significant poor performance of the extractor.

Besides, another problem was related to the use of semicolon between 2 citations, as shown below.

ASV, Governo, 1840-44, b. 6458, LIII 1/83; ASV, Presidio di Governo, 1840-44, b. 1087

Here, the two citations were considered as one extraction output. We tried to incorporate the restriction of the semicolon into the regular expression as an end of the citation, however the effect was not very significant, or even worse as the extract only stops when it found a semicolon. In this case, we decided to post-process manually to separate the citations that were linked by the semicolon.
In addition, because the classifier ignores some “good” files, we decided to not filter the documents, but use directly the citation-collector.
After the above mentioned improvements, the number of citations extracted increased from 1546 to 1731.

  • Labelling the containing items of a citation

Because the format of ASVe citations varies depending on the source document where it is cited, in order to label different items in one citation, we decided to use a supervised machine learning technique, i.e the Conditional Random Fields (CRF). CRF is a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction.

Firstly, we annotated manually a corpus containing 300 random samples from the citations collected so far. The tags used for the annotation are representative keywords for the content of one citation, creating a taxonomy which is defined by the following terms: Archive, Fond, Series, Folder, Registry, Box, Cartulation, Pagination, Foliation, Number, Year, Date, Conjunction, Notary, Period, Protocol, Other.

For instance, the citation [ASV, Riformatori, f. 43, 6 febbraio 1781, c. 9,] is annotated:

ASV, Archive
Riformatori, Fond
f. Folder
43, Folder
6 Date
febbraio Date
1781, Date
c. Cartulation
9 Cartulation

From this corpus, a set of representative boolean, numerical or categorical features, such as: punctuations, shape, digits/roman numbers, date, topology, preceding/following words/tags, is extracted to train and evaluate the model. With these features, the CRF model[2] is trained and evaluated. After that, the model is used to label unknown sequences of items.
In order to have a readable format of the results, we reconstructed the annotated citations and wrote them in an XML file with their corresponding predicted tag.f2

Results

  • Performance of the classifier

In order to show the performance of the classifier, we have collected 100 files where 61 of them contains the keywords, and 39 of dummy documents without any ASV keywords. By means of this tool, we are able to eliminate unwanted files and shrunken our data size.
Results showing that the classifier is able to achieve precision = 91% and recall =70% relevance in detecting the documents with useful citations.

  • Performance of the extractor

We have performed a comparison between the citations obtained from the extractor and from manual extraction for 20 randomly selected documents (that contain ASVe citations).

actual number of citations (manually found): 686
detection by using this algorithm: 577
wrong detection (compared manually): 2

The precision achieved by the extractor is about 99% with a recall of 83.4%, indicating the percentage of the relevant citations retrieved from the real relevant. With this performance, the result of the extractor as a first stage protocol is considerably acceptable.

  • CRF-model – labelling

The CRF-model was able to achieve a macro-average precision of 67.5% with a recall of 63.6% and F1 64.3% which is reasonable given the fact that the size of training corpus was about 225 citations. In addition, the corpus contained some noisy information which was labeled with Other tag, confusing the model. The accuracy per item is of 624 / 676 (92.31%). A more precise overview of the CRF-model performance can be observed in the chart below.

f3

Fig 2. CRF-model evaluation

  • Most cited fonds and series

To accomplish our mission of determining the most cited parts of the ASVe, we made a histogram for the most frequently cited fonds and series among our collection of citations. As it can be observed in the graphs below, the most cited fond is Arti (345 occurrences) and the most cited series is Supliche/S (62 occurrences).

f7

Fig 3. The most cited fonds from the ASVe

f6

Fig 4. The most cited series from the ASVe

Because there have been found many types of abbreviations frequently used, we also needed to define what fonds are exactly indicated in that abbreviations such as IS, IRG, PdG, CC, PSM, DRt. Based on the publications used for our source, we created a list of definitions of the abbreviations used in each document.
The meaning of the above mentioned abbreviations is the following:

IS= Inquisitori di Stato
IRG = Imperial Regio Governo
PdG =Presidio di Governo
CC= Camera di commercio
PSM = Procuratori di San Marco
DRt= Senato, Dispacci, Rettori

As a result of that, we need to consider DRt was part of the Senato. Now, we derive the third most cited fond as being Senato with 118 citations in total.

Future directions

Further improvements can be done on the citations extraction by using a bigger collection of documents and refining the regular expression to avoid mismatches. Moreover, for training the CRF-model, a greater (and cleaned) annotated corpus would be more reliable. With these works on the system, the most cited section of ASVe can be revealed with greater accuracy and enables the digitalization of the archive for the spread of knowledge.

References

[1] Archivio di Stato di Venezia. (2013, November 29). In Wikipedia, The Free Encyclopedia. http://it.wikipedia.org/wiki/Archivio_di_Stato_di_Venezia
[2] CRF-suite, http://www.chokkan.org/software/crfsuite/
[3] ParsCit, http://wing.comp.nus.edu.sg/parsCit/#d
[4] FreeCite, http://freecite.library.brown.edu/welcome
[5] Precision and Recall http://en.wikipedia.org/wiki/Precision_and_recall
[6] Storia.di Venezia, http://storiadivenezia.net/sito/index.php?option=com_content&view=article&id=88&Itemid=66
[7] Universita Ca’Foscari Venezia Online, http://dspace.unive.it/search?scope=%2F&query=asv&rpp=10&sort_by=0&order=DESC&submit=Go
[8] EC project “Review of Historical Seismicity in Europe” (RHISE) 1989-1993
[9] Support for Referencing archival materials by library of Georgia State University
http://library.gsu.edu/search-collections/special-collections-archives/using-special-collections-and-archives/citing-archival-materials/
[10] Support for referencing archival materials by library of Dalhousie University
http://dal.ca.libguides.com/content.php?pid=184574&sid=1551252
[11] Technical documentation, Snowtide Informatics, http://snowtide.com/help
[12] Sula C. & Miller M., Citation studies in the humanities, Abstract, Digital Humanities 2013, http://dh2013.unl.edu/abstracts/ab-353.html