All posts by honeygreenteax

Process Update 1

Welcome back our fellow citation crawlers!

Our overall objective is to identify which sections in the Archive of Venice are most frequently used and cited by scholars. In order to find the occurrence and to create the programs detecting the citation forms, it is necessary to precede understanding the system of citations in papers and books. Thus, as we described in our milestone from last post, the first task for these few weeks is to collect different formats of the citations to build regular expressions, finding the meaning of the keywords in the citations, manually from the materials accessible on open websites.

What we have done so far

In order to find books that contain citations from the Archive, we began with looking for websites which permit us to download the documents, because, as everybody knows, from google books, it is impossible to get the electronic version of a book without charge. We found several websites, deemed as the most impactful among them being [1], [2], but there is still space for improvement to do a realistic case of study on the citations problem.

Fig.1 Source of documents: Storia.di Venezia [1]

Fig.2 Source of documents: Universita Ca’Foscari Venezia Online[2]

We have collected, by hand, a large amount of citations and tried to find some specific rules and keywords applicable only for the ASVe citations.

Fig.3 List of citations from “Review of Historical Seismicity in Europe” (RHISE) 1989-1993 [3]

In order to reveal the format of the citation format of ASVe used in different documents sought, we compare those to the current citation system generally used by the scholars as a standard system. Therefore, we could get inspired by the key information necessary for a citation and the format when we quote the collection of books as the reference these days. But it is not easy to define the citation system of unpublished materials covering all over the archival citation because some of the archives customize and specify their own preferable formats when the documents are used as references [4, 5].

For instance, in the case of the archive in Georgia State University Library, it is designated to cite the manuscripts in this following format:

“ [item], [folder title], [series title (if any)], [collection name], [accession or manuscript number], [name of curatorial unit*], Special Collections and Archives, Georgia State University, Atlanta.” [4]

However, in other case at Dalhousie University Archives and Special Collections, they also provided their prefered format of citation:

“[Identification of item], [Name of fonds or collection], [Fonds or collection number], Box [box number], Folder [folder number], Dalhousie University Archives and Special Collections, Halifax, Nova Scotia, Canada.” [5]

Even though their formats are not completely matched with our examples from ASVe citations, and also we can not simply declare the system should be similar to this format, but what we can conclude by combining the information from Archive of Venice with an assurance to some extent, is that the manner of list the reference includes in this sequence:

1. Item (initial of the archive)
2. Folder title (year of issue)
3. Series title (position of writers or recipients)
4. The format (document type)
5. The collection name (accession position of responsible person)
6. Document number (shelf mark)
7. Place of issue
8. Page range (the number of box, folder whatever it is contained)

Based on the system, we presumed and customized the system in Fig.4, that the scholars who quote contents from the Archive of Venice might use.   ASVe citation format.jpgFig.4 Brief summary of the ASVe citation format

By means of the documents sought, we have developed a classifier which checks if a document contains citations to the archive of Venice, by matching expressions like: ASV, ASVe or Archivio di stato di Venezia, case insensitive, to include all the possibilities of spelling. The algorithm stops scanning the document if one match is found and classifies the document accordingly.

In addition to this classifier, we have done a very light citation detector which matches citations that start with “ASVe” and are followed by 5 items separated by commas:

The application is implemented in Java and uses the library PDFTextStream to extract the text from a pdf file [6]. Moreover, this library offers the possibility to read a document page by page, avoiding memory problems that might occur for very large file.

Problems with citation format

Today, there are different styles of citation for books and literatures like APA, MLA, Chicago and so does the citation for the Archivio di Stato di Venezia. During our search through the internet, we discovered that there is not a well-defined or systematic citation style for the ASVe. For example, in one source the citation includes the year of issue at the front part after ASVe, while in another source, it is placed at the end or not mentioned. Some literatures use ASV to denote the Archivio di Stato di Venezia while others use ASVe as abbreviation.

Unlike any of the current publications, ASVe itselves is a massive collection that contains diverse information ranging from occupation corporations to family archives. One variation in the citation format will result in huge combination possibilities and this makes the quantification of citation signature based on citation references more difficult than it seems. More importantly, the limited amount of references as discussed above creates more complexity in the establishment of a reliable and effective citation signature. When dealing with variation within such a small collection of documents, generalisation becomes questionable. In this case, a simple citation signature system that allows large coverage of the ASVe sources would be necessary before developing any regular expression.

What we plan to do in the next phase?

The following step is to create the regular expressions for the citation formats collected so far. If it will be necessary, we will build specific rules for different institutions, like the following regex applicable for citations of the senate (“Senato”):

ASVe, (.)*Senato,[.]*(filza|busta|f|b|bb)\. [\digit]+, (n\. [\digit]+)?

The regular expressions will be used by our program to extract the citations. After that, we will have to extract from each citation the useful information and store it in a structured manner.

Comparing with the plan from the previous blogpost, we have to admit that it is not possible to strictly follow it because we found out that there are a few sources, open websites with references to the Archive of Venice and the creation of an automatic tool for crawling would not be feasible.


[1] Storia.di Venezia,

[2] Universita Ca’Foscari Venezia Online,

[3] EC project “Review of Historical Seismicity in Europe” (RHISE) 1989-1993

[4] Support for Referencing archival materials by library of Georgia State University

[5] Support for referencing archival materials by library of Dalhousie University

[6]Technical documentation, Snowtide Informatics,