Search the Catholic Portal

Indexing EAD files, again

I have spent some time re-indexing EAD files for the "Catholic Portal", and you can see the fruits of these initial labors on line. The text below describes one solution to the indexing challenges. This is implementation is not the answer but rather a proposal.

The problem

The current process for indexing EAD files in the "Catholic Portal" is seen as causing more problems than offering solutions. By indexing things at the EAD's did level, too many search results are returned. These results, while unique, are too ambiguous and too similar in nature to be useful. Moreover, the current indexing process does not take advantage of an EAD file's rich metadata -- title, date, language, controlled vocabulary terms, biographical history, abstract, scope content notes, etc. Clearly, something has to change.

A mapping

To begin rectifying the problem I first reexamined the XML elements of my EAD files and asked myself, "What elements might be particularly useful to readers of the Portal?" Simultaneously I took a closer look at Vufind's Solr indexing schema, and I asked myself, "What fields are easily supported, and how can the schema be exploited?" After answering these questions, I constructed the following EAD to Vufind/Solr schema mapping:

EAD element(s)Vufind/Solr schema fields
titleproper & subtitle title
publisher publisher
date publishdate
language language
abstract description
physdesc physical
subject topic
persname author2
corpname author2
the whole EAD file allfields
url fullrecord
bioghist crra_bioghist_str
scopecontent crra_scopecontent_str

The elements bioghist and scopecontent are unique to EAD files. The first is short for biographical history and intended to describe the person or entity surrounding the collection. The element scopecontent is a description of the collection, a sort of extended abstract. By exploiting Vufind/Solr dynamic fields -- fields ending in _str -- I will be able to incorporate this interesting information into the record's detail display.

The "whole EAD file" is all of the content of all the elements in the finding aid. This includes everything from stop words, to integers, to repetitions of all the titles, etc. This is a free text field with no structure. It is the field searched by default when querying the index.

The URL found the EAD files' eadid/@url attribute is saved in the full record field. It is saved along with the URL pointing to the locally transformed copy of the EAD file to HTML. The full record field also contains the human-readable labels for these URLs, currently "View finding aid at owning institution", and "View finding aid in Portal display". Ideally these URLs and their labels ought to be saved differently because the use of the full record field is mis-leading, and the labels ought to be configured in the user interface, not the indexing process.

With the exception of the persname and corpname elements, all of the other mappings seem obvious. The MARC indexing process supports the inclusion of added entries as denoted by the Vufind's marc.properties file:

author = 100abcd, first author_fuller = 100q, first author-letter = 100a,first author2 = 110ab:111ab:700abcd:710ab:711ab author2-role = 700e:710e author_additional = 505r

In other words, author added entries (7xx fields) are mapped to a Vufind/Solr field named author2. The persnames and corpnames seem like added entries to me, so I mapped them accordingly.

Finally, there are bits of other information needing to be in the index, below:

Other bitsVufind/Solr schema fields
id id
institution institution
library building
format format
type recordtype

Id is a unique identifier, and currently it is an incremental integer prefixed with a key assigned to each library contributing content to the Portal. Ideally this should be a value coming from the EAD files themselves, but I was not able to identify such a value across our collection.

Institution and library are the names of each... contributing library and their hosting institution.

Format is a constant ("Archival Material") but I can see how something more meaningful might be gleaned from the EAD file itself.

Lastly, type is another constant ("ead") and it will be used to run the appropriate Vufind record driver -- EadRecord.php -- during the search result process.

Implementation

Given a mapping from EAD to Vufind/Solr, plus the few extra bits of information, I proceeded to re-write my indexing Perl script -- ead-index.pl. This was relatively easy. Next I had to update my EadRecord.php driver. Specifically, I enhanced the getExtendedMetadata function to include create/assign values to extendedBiogHist and extendedScopeContent -- my EAD-specific information fields:

$interface->assign('extendedBiogHist', $this->getBiogHist()); $interface->assign('extendedScopeContent', $this->getScopeContent()); protected function getScopeContent() { // added by ELM (October 4, 2012) return $this->fields['crra_scopecontent_str']; } protected function getBiogHist() { // added by ELM (October 3, 2012) return $this->fields['crra_bioghist_str']; } 

(I also deleted the getHoldings function because it was not being used.)

Finally, I created and enhanced my local version of extended.tpl of the interface in order to show the values of scope content and biographical history in the extended detail display:

<!-- added by ELM (October 3, 2012) -->
{if !empty($extendedScopeContent)}
{assign var=extendedContentDisplayed value=1}
<tr valign="top">
  <th>{translate text='Scope content'}: </th>
  <td>
	{$extendedScopeContent|escape}
  </td>
</tr>
{/if}

<!-- added by ELM (October 3, 2012) -->
{if !empty($extendedBiogHist)}
{assign var=extendedContentDisplayed value=1}
<tr valign="top">
  <th>{translate text='Biographical history'}: </th>
  <td>
	{$extendedBiogHist|escape}
  </td>
</tr>
{/if}

The result

The following screen shots illustrate the result of this work. For example, initial search results look just like before:

initial-search
Initial search results

After clicking on a search result's title, the detail display is shown. Notice how now includes "Other authors", "Published", and "Subject" fields:

initial-search
Detail display

After clicking on the Description tab, even more detail is displayed, specifically the abstract, scope content, biographical history, and physical description:

initial-search
Extended details

Next steps

This implementation is only a single person's perspective. What do you think? What EAD elements do you think ought to be mapped to which Vufind/Solr fields? And how do you think the results should be displayed? Inquiring minds would sincerely like to know. Please add your comments in the blog posting or share your ideas directly with any of the various Portal mailing lists.

Share this post:

Comments on "Indexing EAD files, again"

Comments 0-5 of 0

Please login to comment