Objective at this stage
In the previous weeks, we have been working on the classification and extraction of citation based on the root-based system. At this stage, we tried to use the implemented machine learning algorithms to segment the current ASVe citation extracts into different parts based on their underlying series/groups (e.g. collection, document, envelope, etc.).
What we have done
Study on different CRF++ (Conditional Random Fields) open source toolkits
Through the CRF++ toolkits, the parsing of citations can be achieved by training the program on citation features/elements through training set, then it will extract information based on these constraints experienced throughout the training. A manual test set can then be used to compare with the results from the program. By means of the CRF++ toolkits, we can parse citation information in a more precise way despite of the differences in reference format used between each source documents.
There are various open-source applications that enable parsing of document citations, like ParsCit: an open-source CRF Reference String and Logical Document Structure Parsing Package  and FreeCite .
Results obtained with ParsCit:
As it can be observed from these examples, ParsCit makes confusions between the author and the institution name because they are positioned one after the other, so they are grouped in the authors category. The booktitle is also wrong categorized (see eg. 3), which in our case it should be ASV(e). Another mistake is the page number which is linked to the year or other number inside the citation. Same issue occurs in the volume field.
Results obtained with FreeCite:
The keyterms used by this tool are a bit different: Authors Title Journal Booktitle Editor Volume Publisher Institution Location Number Pages Year Tech Note
The categorization of the same citations performed with FreeCite, shows that they make the same confusions with the title, the authors and the numbers. The results are pretty similar, except for the first citation where this tool has poorer recognition results, matching the author with a number.
In fact, the first tool that we have tried – ParsCit – allows us to train a model with the format of our citations, but we didn’t manage to use it because there were some compilation errors like :
BEGIN failed–compilation aborted at parscit/bin/../lib/Omni/Omnipage.pm line 8.
Improving the citation extractor performance
Following the mismatches between the manual and program extraction from the previous stage, we found that the extractor did not include p. and reg. in the regular expression, which contributed to a significant poor performance of the extractor.
Another problem was related to the use of semicolon between 2 citations, as shown below.
ASV, Governo, 1840-44, b. 6458, LIII 1/83; ASV, Presidio di Governo, 1840-44, b. 1087
Here, the two citations were considered as one extraction output. We tried to incorporate the restriction of the semicolon into the regular expression as an end of the citation, however the effect was not very significant, or even worse as the extract only stops when it found a semicolon. In this case, we decided to post-process manually to separate the citations that were linked by the semicolon.
In addition, because the classifier ignores some “good” files, we decided to not filter the documents, but use directly the citation-collector.
After the above mentioned improvements, the number of citations extracted increased from 1546 to 1731.
What we plan to do next
In the following 2 weeks, we are planning to train a CRF model with a specialized format for the ASVe citations. We will create the training dataset from the citations collected so far, annotating them with the corresponding tag of each token. The tags are representative keywords for the content of one citation, creating a taxonomy which includes the following terms: Archive, Fond, Series, Folder, Registry, Box, Cartulation, Pagination, Foliation, Number, Year, Date, Conjunction, Notary, Period, Protocol, Other. For instance, the citation, ASV, Riformatori, f. 43, 6 febbraio 1781, c. 9, will be annotated:
From this corpus, several representative features, like: punctuations, shape, topology, preceding/following words, will be extracted to train and evaluate the model. Finally, we plan to reconstruct the annotated citation resulted from the evaluation to a more readable format such as XML.
 ParsCit, http://wing.comp.nus.edu.sg/parsCit/#d
 FreeCite, http://freecite.library.brown.edu/welcome