Based on the regular expression format we suggested last time and the citations collected, we have created a collection of documents which may contain citations. We extracted the citations inside the text of documents that are classified as possible sources of citations.
For this purpose, we made a program which extracts the references at a high accuracy (approximately 84%). In order to check the accuracy of the program, we also compared the result of the program to that of manual extraction.
What have we achieved so far?
Performance of the classifier
The mechanism of the classifier is to search ASVe-related keywords in the source document to identify if it contains citations or not. Since it only detects the keywords such as ASVe, A.S.V. and Achivio di Stato di Venezia as illustrated in the expression below, a file contains ASV as an abbreviation of “anodic stripping voltammetry” was also considered as a right candidate, which is a wrong classification.
“((A|a)(R|r)(C|c)(H|h)(I|i)(V|v)(I|i)(O|o)(D|d)(I|i) (S|s)(T|t)(A|a)(T|t)(O|o) (D|d)(I|i)(V|v)(E|e)(N|n)(E|e)(Z|z)(I|i)(A|a)) |(ASV[e]?)”
In order to show the performance of the classifier, we have collected 100 files where 61 of them contains the keywords, and 39 of dummy documents without any ASV keywords. By means of this tool, we are able to eliminate unwanted files and shrunken our data size.
Results showing that the classifier is able to achieve precision = 91% and recall =70% relevance in detecting the documents with useful citations.
Performance of the extractor
By further studying the citation format based on the findings from previous stage, we noticed that the citations always end with the abbreviation f, n, b, c, cc, and bb, following by a number. The program extractor works by searching in a text document, strings that start with “ASVe” (all possible formats as the classifier does) and end with an abbreviation followed by one or more digits. This pattern is compiled as a regular expression and matched in each page of the text. The general expression beneath illustrates how the extractor fetch information from the citation.
Currently, the extractor has found 1546 of citations in total in our collection of documents.
We have performed a comparison between the citations obtained from the extractor and from manual extraction for 20 randomly selected documents (that contain ASVe citations).
actual number of citations (manually found) :
detection by using this algorithm :
wrong detection (compared by hands) :
The precision achieved by the extractor is about 99% with a recall of 83.4%, meaning the percentage of the relevant citations retrieved from the real relevant. With this performance, the result of the extractor as a first stage protocol is considerably acceptable.
What are the problems encountered?
Manual extraction of citation from the documents is time consuming. While we have found around 100 pdf files for citation extraction by the program, extracting all ASVe-related citation from all files is not very feasible for the limiting time and workforces that we have. In this case, we randomly picked 20 of them for result comparison between the program and manual fetching.
Even at a professional level, the accuracy of manual data entry is yet to be high, there is a very high chance that one might have missed out some citations from the documents when they are reading through the documents or other human entry error. As a result, missing citations might be discovered when comparing to the program extraction. In contrast, there might be chances of missing a certain citation in both sides, causing a silent citation-miss that is undetectable.
Due to citations containing multiple references, the citation format became more complicated and some of them were not able to be extracted. For instance, the reference as shown below with f. at the line and the number following it, is on the next line.
ASVe, 1759a, b, c. Senato, Dispacci, Provveditori da Terra e da Mar, Provveditori Generali da Mar, f.
1002 (luglio 1759maggio 1760): a) Lettera, Provveditore di Cefalonia, [Cefalonia] 20 giugno 1759;
Due to the endline problem, the extractor was unable to extract this particular citation as it is not complementary to the regular expression in the program. The fact is that this is not a regular endline character (such as “\n” – Unix or “\r\n” – Windows), but is encoded in another way. Also, in the same document, we encounter a problem with the spaces encoding from the same reason.
Classification of Citations based on institution
During our manual extraction, we discovered that some citations are referencing to the same institutes. In order to cope with the complex structure of the citation format used in various literatures, we came up with the idea to classify our collection of citations based on the institution. In this way, we are able to build a more simplified system for citations, thus able to determine the citation institutions which were most referenced.
We developed a program for detecting the institutions relevant to our citations collection. Out of our expectation that there were more than 100 institutions among our source documents, suggesting that this approach might not be optimal to gain a rough picture of the citation through categorization.
Our plan for the next stage
In the coming few weeks, we would like to improve our program in detecting the citations from source documents, for example to solve the endline problem and to prevent recall extraction. At the same time, we are also planning to structure the information from the extracted citations into fields like: year, institution, person responsible for the document, type of document, page information, depending on the source and the way that the author referred them to get a better understanding on the archive and to enhance the categorization system.
 Precision and Recall http://en.wikipedia.org/wiki/Precision_and_recall