Search the Catholic Portal

Preparing EAD files for indexing

This posting outlines how I plan to prepare EAD files for indexing with Solr, the underlying indexing technology of VUFind.

The problem

I am aggregating sets of EAD files from Catholic Research Resource Alliance members. I am expected to index these files at the most granular level possible -- meaning at the did level. In order to satisfy both human and computer requirements, each indexed record needs at least a unique identifier, a human-readable descriptor, and a location code. The unique identifier can be gotten from the unitid element. The human-readable descriptor can come from the unittitle. The location code can be inferred from the url attribute of the eadid element.

Unfortunately, not all of the aggregated EAD files include a unitid, and when they do, they are not always unique. Additionally, the hierarchal nature of EAD files make the values extracted from unittitle elements almost meaningless unless they are placed within the context of their parent unittitle values. In short, indexing EAD files without some preprocessing makes the indexing process all but useless. What to do?

The solution

The solution includes: 1) adding and/or normalizing the unitid values, 2) constructing a more complete "title" based on previously enumerated unittitle values, 3) and outputting the whole thing to an XML stream easily indexable by Solr.

Adding and/or normalizing the unitid values (Step #1) can be accomplished with a stylesheet called addunitid.xsl. Essentially an identity transformation, the stylesheet loops through an EAD file using the generate-id() function to create or replace unitid values. The result is an enhanced EAD file.

Constructing more complete "titles" and outputting XML streams (Steps #2 and #3) is done by looping through the each did element, extracting the necessary metadata, creating a record describing each did-level element, and sending to STDOUT a rudimentary XML stream of my own design. The heart of this second stylesheet (ead2solr.xsl) is the ancestor::*/did/unittitle selector used to find all the parent unittitle values of a given did.

Finally, a simple shell script was written (clean.sh) making it easy to do the above transformations from the command line.

(I would not have been able to do this work if it weren't for the XML4Lib mailing list and a few fine repondants to my pleas for help. Thanks go to MJ Suhonos, Tod Olson, Stefan Krause, and Alexander Johannesen. "Thank you!")

Next steps

Software is never done. If it were, then it would be called hardware. Therefore next steps include:

  • automatically adding the modified EAD files (the output of the first stylesheet) to Archon
  • enhancing the output of the second stylesheet with scope notes, abstracts, etc.
  • indexing the output of the second stylesheet

Fun with XSLT?

Share this post:

Comments on "Preparing EAD files for indexing"

Comments 0-5 of 0

Please login to comment