Citations of the Archives (D3) – Project Overview

Introduction

Archivio di Stato di Venezia (ASVe) is a collection of the documents of the Venetian history. Being the largest archive in Italy, it covers to 70km [1] of shelves and therefore the digitization of the whole collection would be time consuming and physical demanding.

In order to smooth the digitization process and the study of the archive, it would be more convenient to begin with the most cited documents by scholars instead of creating digital copies in chronological order. Therefore, for the sake of the alleviating the work force and accessibility, this project would be beneficial.

Archivio di Stati di VeneziaFigure 1: Achivio di Stato de Venezia

Project objective & Deliverables

The goal of the project is to identify the sections of ASVe that are most cited by scholars.

Firstly, we will identify references of the archives in a single document and store their occurrences in a database.

Next, we will try to find more possible documents which may contain citations to the archives ASVe with a web crawler and keep track of them by indexing with a specialized tool in order to easily search.

In our project, we are planing to achieve the following deliverables:

  • a program that search for references in a given document
  • a database with the occurrences of the citations and the books or the documents where they are found
  • Indexer with the documents sought
  • a statistic about the most frequent cited sections of the archives ASVe.

Methodology

DiagramDHProject1Figure 2: Design Diagram for the Project

1. Collecting regular expressions of the citations

In order to identify citations, we need to understand the rules for naming the documents in the archive. For example, the numbering system used in the archive of “Archivi propri degli ambasciatori”, it looks like this:
b. 1: nn. 1-4
b. 2: n. 5
b. 3: nn. 6-7
b. 4: nn. 8-10
b. 5: n. 11
………….
where b. is the abreviation of the italian word ‘busta’ (=envelope), n. stands for numero (=number) and nn. comes from the plural numeri.

After collecting a set of different citations by looking also, into the documents from google scholar and google books, we would like to extract and build several regular expressions (regex) to recognize the citations in a given document using a general rule.

Here is an short extract of citations from a book:

For instance, for the above references, we can extract the regex:
ASVe(.)+ b\. (0-9)+ (.)*.

2. Creating the database for the references

A database with two tables will be created in order to store our searching results. The database will be populated while processing documents by the program developed in the next step.

The first table will be used to keep the correspondence between the references found in a book or another material which cited ASVe in their context and that book or material. Specific information about the materials, such as title of the book, author and URL will be recorded for future use.

The other table will store the ASVe citations found in the reference materials and the corresponding total occurrences.

The linkage between the two tables is illustrated in the entity relationship diagram below:

ER-databaseDiagramFigure 3: Entity Relationship Diagram of the Database

3. Creating a program to detect references in given documents

Based on the regular expressions extracted in step 1, we will build the model that matches the citations. In order to search for all the references to the archive ASVe in a given pdf document, we will implement a program in a high level programming language (python, Java or C). The program should parse the text file received as input, extract the matching expressions found and store the information in the database.

4. Finding tools for crawling, searching and indexing

Here, we will search online for automatic web crawlers and indexing tools. An automatic web crawler, also known as search engine spider, is a program that most search engines use to find something new on the Internet. By introducing an index to the crawler, we can perform specific search of the collected information.

At this moment, we might consider to start with Nutch crawler and Solr indexer tools because we had some experience with these tools before.

After reviewing and documenting about more crawling and indexing tools, we will select the most proper tools which are known to be robust and easy to use.

5. Configuring the selected tools

After selecting the tools, we need to configure them according to the documentation to perform crawling, indexing and searching in the latter stages.

6. Crawling, indexing and searching

The purpose of crawling is to collect many documents, materials or books which may contain references to the archive ASV. We will start the crawler using as seed the google books web site for the beginning. The indexing will be done in parallel with the crawling  in order to facilitate the searching through the collected documents.

7. Summarization of the results and visualizing them

Finally, to make the big collected data visible, we will summarize the results and create diagrams or plot graphs. Among the various visualizing methods, we will figure out which one would be the most suitable for our data.

Planning

Milestones

The following table illustrates our plan for the next semester. 
D3 Milestone

References

1.  Archivio di Stato di Venezia. (2013, November 29). In Wikipedia, The Free Encyclopedia. Retrieved 18:58, November 29, 2013, from http://it.wikipedia.org/wiki/Archivio_di_Stato_di_Venezia

Appendix

1.  An example of google book search based on the archive b. 181

2.  An example of ASVe citations from book