Catholic pamphlets and the "Catholic Portal"

Nov

This posting outlines a possible workflow for getting digitized versions of Notre Dame's Catholic pamphlets into the "Catholic Portal".

The problem

The University of Notre Dame owns a significant number of Catholic pamphlets. These materials have been cataloged and denoted as destined for the "Portal" in their MARC records with the letters "CRRA" in field 590$u.

The University's library wants to digitize these materials, make the resulting PDF files freely available on the Web, apply optical character recognition against the PDF files, and support a text mining interface against the result. Bits and pieces of this work have already been done. The problem is gluing them together into functional workflow.

The solution

Here is an outline of a proposed solution:

scan the documents
convert them into PDF files
give them meaningful names
store them in a Web-accessible location
update their corresponding MARC records

Each of the sections below elaborate on the steps above.

Scan documents

Scanning the documents is the actual process of digitizing them. It can be done in-house with brute force using something like the Bookeye hardware. Or it could be outsourced. The technicalities of the process (dpi -- dots per inch, color versus black & white, TIFF versus JPEG versus PNG,etc.) will be driven by collection development policies and the intended use of the materials. Personally, I think the pamphlets ought be scanned as black & white images at 300 dpi and saved as uncompressed TIFF images. Compared to original manuscripts or colorful out-of-copyright materials, I do not think the pamphlets warrant anything more substantial than that.

Convert to PDF

The original TIFF images function as archival surrogates. PDF files are intended for use. The next step is to concatenate the TIFF images representing a single pamphlet into a a single PDF file. During this process OCR (optical character recognition) should be applied to the images and saved inside the PDF files. This process can be done using our existing in-house facilities or an outside jobber. A word of caution, if the scanned images include two pages, then care will need to be taken when doing the OCR. Specifically, the OCR process needs to not go all the way across the page, but rather all the way down one image first and then down the next. Otherwise the resulting plain text will not be ordered correctly.

Give the files meaningful names

There is plenty of room for interpretation in the previous two steps, but there is little or no room for interpretation in this step. Save the TIFF and PDF files with names corresponding to the pamphlets' MARC record field 001. The 001 field is a unique value across the library's collection, and by using this value it will be easy to match a file with its description. For example, suppose the 001 field of a given pamphlet has a value of 00023459. Then assign the TIFF files names such as 00023459-001.tiff, 00023459-002.tiff, 00023459-003.tiff where everything before the dash (-) is the 001 value, and everything after the dash is a page number. Similarly, and most importantly, save the PDF files with names such as 00023459.pdf. One could save files in a single directory with a name corresponding to the 001 value, but the principle is the same. Explicitly associate the saved files with a MARC record. If the 001 field is not used during the file naming process, then updating the MARC records with URIs, below, will be considerably more expensive.

Store files

The next step is to store the files in a Web-accessible location. This could be as simple as putting them on a computer's hard disk and providing access via the Web. Unfortunately, this process is not very scalable after 10's of thousands of files. Consequently, it maybe be better to save the files in a repository-like application such as Fedora. The technology behind storing the files is not as important as the resulting URI/URL. It is very important to make sure the URL is constant and immutable. If it is not, then the risk of an ongoing maintenance nightmare is dramatically increased. It is also better if the URLs are shorter rather than longer. Put in the language of the Web, follow the principles of "cool URLs" when making the digitized content Web-accesible. [1]

Update MARC

Links now need to be made between the description of the pamphlets and their location on the Web. To do this one loops through all of the digitized pamphlets and updates the 856 fields of the corresponding MARC records with the URI/URL created in the previous step. This can be done manually, but it can be done programmatically if the files have been saved with values based on MARC field 001 values. Once this process is completed it will be possible for the patron to search the library's catalog or "discovery system", identify a Catholic pamphlet of interest, and then choose to retrieve it from Special Collections or download it from the Web.

Optionally, update MARC again

The process outlined above provides access to the materials, but if we want to make it easier for the patron to use and manipulate the materials, then an additional URI/URL will may need to be added to the MARC record. Specifically, an additional URI/URL in field 865 may need to be included pointing to a text mining interface. Just like the URI/URL pointing to the PDF file, this URI/URL needs to be as constant and immutable as possible. If this additional step is done, then not only will the patron have access to the materials in physical and digital form, but they will also be able to perform various text mining functions against it. Examples include: listing all the words starting with the letter z and the number of times they occur, listing the 50 most frequently used words, listing the 10 most frequently used two-word phrases, concordancing the results of the previous examples thus displaying them in context.

Summary

If the process outlined above is implemented, then the "Catholic Portal" software will be able to regularly and systematically harvest the Catholic pamphlet MARC records and integrate them into the CRRA in exactly the same way all of the other CRRA content is harvested.

Providing access and use of Catholic pamphlets is of interest to both the University of Notre Dame as well as the Catholic Research Resources Alliance (CRRA). Digitizing the materials, making them Web-accessible, and integrating their locations into our collection is a way of exploiting the current technological environment as well as meeting patron expectations. By providing text mining functionality we -- the Libraries and the CRRA -- will be exemplifying leadership in the wider community.

Notes

[1] "Cool URIs for the Semantic Web" -- http://www.w3.org/TR/cooluris/