Search the Catholic Portal

Harvesting, updating, and re-indexing

This posting describes the automated process I am currently using to harvest, update, and re-index the MARC records of the "Catholic Portal".

Step #1 - Make a list

Librarians love lists, and I am no exception. The process begins with a list (databases) of CRRA members who have MARC metadata to share. Each item in the list includes the following fields:

  1. code - a unique three-letter identifier
  2. institution - the name of the CRRA member
  3. library - the name of the member's library
  4. URL - the location of their member's MARC records

Right now, the name of this list is libraries.db. It is created by hand.

Step #2 - Harvest

The second step is to harvest content from each member library. This is done by looping through the list, extracting the URLs, and copying the remote MARC data sets to a local file system. This process is done with a script called harvest.pl.

Step #3 - Update

Because each record in the underlying Solr index must have a unique identifier, it is necessary for me to make each 001 value in each MARC record unique. To do this I loop through each of the harvested MARC records and prepend the three-letter institution code to each 001 field. This is done with a script called add-code.pl.

Step #4 - Re-index

The last step is to re-index the MARC records making sure the Solr index is as current as possible. This is done with a script called re-index.pl. It is the most complicated. This process is done by, again, looping through the database of CRRA members reading the institution code. The script then deletes all of the records from the index whose identifier begins with the institutional code. ("Thanks WebService::Solr!"). Each of the records from each of the institutions' metadata files are then feed to Solr. Using this re-indexing process it is not necessary for me to manage overlays, duplicates, or deleted records. The whole index is wiped clean and refreshed anew.

VUFind

The whole process works pretty well, and because the Catholic Portal is based on VUFind, the whole process is extraordinarily flexible. VUFind cares not about the ingestion process allowing me to handle it in the manner I feel most useful. Let's hear it for open source software!

Share this post:

Comments on "Harvesting, updating, and re-indexing"

Comments 0-5 of 0

Please login to comment