April 17, 2008

Update on work since January

I've done quite a bit of work since January 30-- just slow to update.

1. Worked with core team to develop a set of browsable subject categories for the main page (which will be rolled out with the timeline and geographic facets at the same time). Once we had these, I cataloged each registry record to have one or more of these subjects.

African Americans
Asian Americans
Civil War
Education
General Resources
Government
Immigration
Latinos
Music
Native Americans
Religion
States and Regions
Travel and Transportation
Women
World War I
World War II

Our repositories are quite specific, so it was difficult to come up with a decent strategy, and eventually, set of subjects for this facet. The actual work to develop the subjects was done by Katherine and Perry, although I did add Education and Music as I was working through cataloging each record in the registry. Music may not seem like it fits well, since it's often an indication of format rather than subject, but in this context it seemed to fit well.

2. Worked on the UM side to add our MODS records from our old data provider into our new data provider (http://quod.lib.umich.edu/cgi/o/oai/oai). This involved some cleanup of the data (such as adding a new rights statement), but mostly involved matching the records to existing oai_dc records. In the course of doing this, we gained a number of records-- this is because our old MODS records had a "short URL" (e.g., ABZ1234), but in the new system, the matching oai_dc records had "long URLs" (e.g., ABZ1234.0001.001, ABZ1234.0002.001) indicating issues and volumes.

An additional side effect of this change is that records can exist in more than one set. This is indicated in the for each record, but because in the ASHO portal we are working on a set-by-set basis, records can be duplicated among harvested sets. Fortunately, they are exact duplicates, so Tom and Chick can de-dupe if desired.

This leaves us with only the re-exposed MODS records we are making available to the ASHO portal, in the old provider. At some point in the future (maybe 6 months?) we will be turning off the old provider, at which point Tom will harvest directly using his provider at UIUC. We still need to work through what value-add we have provided on the UM side that they may want to duplicate.

3. Added each new repository/set from January's batch to the registry as a separate record. In the process, I found a picture for each repository/set and sent those to Tom and Susan. Susan helped create a generic text image for those that I couldn't find a picture for (some of the CDL repositories had no images).

4. Also, added one more set from Columbia: the Digital New York City set of 55 records. You can see these at http://quod.lib.umich.edu/a/aquifer/ at the moment. Tom will add them to the ASHO portal soon.

5. Working with the Services Working Group on how to test the MODS Levels of Adoption developed by the Metadata Working Group. That's still in the initial stages, so nothing to report.

Posted by khage at 01:40 PM | Comments (0)

January 30, 2008

Follow-up on Columbia's data provider

Columbia's data provider now appropriately delivers oai_dc and MODS.

This was kind of interesting, because they are using Jeff Young's oaicat data provider tool, a very popular and useful one. My guess is that the tool was designed before the OAI community really pushed to have data providers offer richer metadata formats in addition to oai_dc, and so the code needs to be tweaked to allow this.

Columbia created a version of oaicat so that it would not mix up the two formats they were trying to provide. I'm sure a fix from Jeff will be forthcoming very soon.

Look for Columbia's oai_dc, as a result, in OAIster by the end of this week...

Posted by khage at 12:20 PM | Comments (0)

January 24, 2008

Four new institutions in Aquifer/MODS portals

There are now four new institutions in the Aquifer/MODS portals at UM. This is very exciting; the last time we updated was four months ago.

* California Digital Library (CDL): 79 static repositories (corresponding to OAI sets) for both portals
* Columbia University Libraries Digital Program Division: 1 set for both portals
* Digitized Books from the University of Illinois at Urbana-Champaign: 19 sets in Aquifer, 33 in MODS portal
* Harvard University Library Virtual Collections: 2 sets in Aquifer, 4 in MODS portal

Also, Celebration of Women's Writers and Southern Spaces were added to.

Aquifer portal grew from 238,794 records to 288,615 records: http://quod.lib.umich.edu/a/aquifer/
MODS portal grew from 306,849 records to 369,197 records: http://quod.lib.umich.edu/m/mods

Most importantly, the character encoding, which was alarmingly still wrong the last time we updated in September (!) is now fixed to the best of our ability for all sets in both portals.

The Aquifer portal records are all available in the re-exposed file at:
http://quod.lib.umich.edu/cgi/b/broker20/broker20/?verb=ListRecords&metadataPrefix=mods&set=oaimods:aquifermodsR

There was more than usual back-and-forth with Harvard and Columbia-- Harvard's stylesheet was a bit awry, and Columbia was having trouble with their data provider. Columbia is still not an official repository because they were unable to keep the MODS and oai_dc records from mixing together in sets, so they removed the oai_dc records for the time being.

My next step is to create records for each set in the DLF Registry, and find pictures for each set. That should take me a few days.

Posted by khage at 12:35 PM | Comments (0)

October 22, 2007

Work in the last month

Oish. It's been a month since I added a new blog entry. Here's what I've been doing:

1. Developing the data processing page, out of recommendations from the MWG and the SWG. The "date" field was definitely, um, the most fun.

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferMeta/Data+Processing

2. Providing feedback on the designs provided by Citrus. I'll admit this has been difficult without seeing the search portal, as it currently works on UIUC's server, married to the design.

3. Cleaning up the 41 Collection Registry entries for Tom so that he could populate the "browse collections" page. Grabbed images for each collection (set) while I was at it for that page.

http://dlf.grainger.uiuc.edu/DLFCollectionsRegistry/

4. Cleaning up the rest of the registry... This will take me at least a couple more months. I hope to finish it by the end of the year. I am specifically looking at the following:

- title matches the main site page
- additional titles, sub-titles, etc. are added as necessary
- URL works? can't find it using Google? (if not, collection gets deleted)
- format types and objects represented added
- collection description tweaked to be as correct as possible
- tweaking subject fields, if already exist

Things I am not doing: adding language, number of digitized objects, geographic/temporal periods, extensive subjects. I would think that after I clean the records up, those responsible should access the registry and either a) suggest changes or b) make the changes themselves.

Posted by khage at 03:54 PM | Comments (0)

September 21, 2007

Cleaned-up Re-exposed MODS Records

Tom and Chick found some errors with the re-exposed records, which we have consequently fixed.

1. Nesting of metadata, mods and about containers was incorrect.
2. Language codes versus language text was being too aggressively normalized (for OAIster it needs to be, for MODS best if not).
3. The about container needed a field included so that workflow on Tom and Chick's end could be streamlined.

This latter issue entailed Tom creating an Aquifer version of the provenance schema, currently hosted at:
http://dlf.grainger.uiuc.edu/dlfcollectionsregistry/aquifer_provenance.xsd

The official provenance schema does not contain a field for .
I will be contacting Simeon Warner who still officially maintains the OAI protocol to look into making this change for the official schema.

Oh, and UM Lincoln records were fixed and added to, so there are 8 more records in the current batch of re-exposed records.

Posted by khage at 02:46 PM | Comments (0)

September 05, 2007

New re-exposed records and new UM portal interfaces

This is the last step in our MODS work at UM, unless data issues are discovered. Many things are waiting in the wings for after we re-engineer BibClass, e.g., thumbnails, date normalization, Zotero integration.

There is a new re-exposed metadata file available for harvesting: http://quod.lib.umich.edu/cgi/b/broker20/broker20/?verb=ListRecords&metadataPrefix=mods&set=oaimods:aquifermodsR

The updated records in the portal contain these things:
- validation of MODS to BibClass, so elements and attributes are in all the correct places now
- tweaks to MODS display labels
- removal of the type normalization drop-down on the advanced search page, replacing it with the mods typeOfResource values
- change of default sort from title to weighted hit frequency
- addition of two new MODS sets: The Emancipator Newsletter from Tennessee and Southern Spaces from Emory

You can search and display all this here:
http://quod.lib.umich.edu/a/aquifer/

Also, we've updated the Google spreadsheet that reflects all these changes: http://spreadsheets.google.com/pub?key=pkPh8eRIHnlDi17JkWsaJJQ

Posted by khage at 02:13 PM | Comments (0)

August 24, 2007

And, re-exposed is ready...

Aquifer MODS records harvested by UM are now available for re-harvesting.

Value-add features:
- The concatenated repository name and OAI set name are available in the relatedItem/titleInfo/title MODS element.
- The original slurped thumbnails from Tom's original thumbnail service are available in the location/url@preview element.
- A provenance container has been added to indicate that we have modified the original harvested record.
- Any record without a location/url or identifier@uri is filtered out of the re-exposed records.

http://quod.lib.umich.edu/cgi/b/broker20/broker20/?verb=ListRecords&metadataPrefix=mods&set=oaimods:aquifermodsR

I plan on re-harvesting, and then re-exposing, these records either the first week in September or after I return from England, the third week in September.

Big kudos to Josh Santelli who did all the work to get this done.
Also, many thanks to the MWG for recommendations on where to place the value-add fields.

Posted by khage at 12:08 PM | Comments (0)

August 23, 2007

Update on work in last 2 weeks

1. Re-exposing MODS Aquifer records. We are this close to being ready. It required a number of steps to date:

a) Figuring out which MODS elements our INST and URL a="thumb" elements should fit into.
b) Building a provenance container. (Help from the MWG on these two items was invaluable.)
c) Modifying the DTD and validating the MODS records so that we are positive we are creating the BibClass files correctly.
d) Creating a spreadsheet of MODS-->BibClass (which we should have done ages ago!).
e) Modifying MODSTransform so that it would build the BibClass file and the re-exposed MODS file.
f) Concatenating all these repository-specific MODS files into one, indexing this and making it available through our data provider. (This is the piece that isn't finished yet.)

I would expect we should be ready by early next week.

As a result of creating the re-exposed records, we are using the filter in MODSTransform to remove all records that do not have dc:identifiers, i.e., we are keeping only those that link to digital objects. I know this was a concern of Chick and some others, so using our re-exposed records has this one added benefit.

2) Drafting a wiki page on the potential data processing, indexing and display issues related to making MODS records available in the Aquifer portal. This really is still in draft form and I hope folks will edit it as necessary.

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferMeta/Data+Processing

3) Still trying to get Columbia's MODS records...

Posted by khage at 02:16 PM | Comments (0)

July 19, 2007

Status

I've been working on a few things here at UM, in the past couple weeks.

Levels of Adoption

After adding 4 new MODS sets to the Aquifer portal, we went through the just-released Levels of Adoption document to see how well we conformed with our sets (total: 5). It turns out that we land squarely in between Levels 3 and 4. For instance, in level 3 we don't have at least one <genre> element, and in level 4 we almost never have abstracts in our records. But overall, we conform to the majority of requirements for both Levels 3 and 4. Consequently, which level are we really at? And how does this affect our inclusion in the new portal?

Collections Registry

Tom Habing and I worked out a number of the kinks in the modified Collections Registry, to the point where I would be able to modify current records and add new ones. I'm keeping a running list of questions and feature requests for Tom for when he returns from vacation.

I got as far today as modifying or newly entering the records for UM's 5 collections. I'd appreciate it if someone could check my work-- there are a lot of fields I'm not sure about adding to (such as Topics), even for our own collections, and I'm positive I won't be able to add nearly as much information for collections I know next to nothing about.

Try looking at Making of America. If you'd like to see the specific ASHO administrative information, I can add you as a new editor.

Beating the Bushes

I think I'm correct in saying that we have the following collections pending:
- Columbia (last email sent 6/26; no response)
- Northwestern and CHM (Katherine will be contacting)

Otherwise, we're waiting for the CWG to determine who is/will have MODS available from the "short list".

In the meantime, we can do two things:
- check the Registry for potential collections/institutions
- check OAIster for good ASHO collections/institutions

Which one is best to start with first?

However, I don't think all the collections that are currently in the Aquifer portal have submission agreements signed. Maybe someone could enlighten me so that I can modify the records. The institutions in the portal are:
- U Penn
- U Michigan (pending w/ modifications, correct?)
- Indiana
- Library of Congress (pending w/ modifications, correct?)
- U Tennessee Knoxville

Metadata Re-exposure

We're working on it. We just gave the work to our brand-new programmer. In the meantime, I've aggregated all the current harvested records and sent them to Chick (processed and originals).

Posted by khage at 01:30 PM | Comments (1)

July 02, 2007

4 New U. Michigan MODS Collections

And now there are four new MODS collections from U. Michigan: The Collected Works of Abraham Lincoln, The Public Papers of the Presidents of the United States, The Transportation History Collection, and Making of America. This brings all Aquifer records in the portal to 238,719.

Thanks to the MWG, we have determined that our High Level Browse mappings should go into the Subject element, Topic sub-element. However, we need to wait for LoC to handle our request for making High Level-Browse a classification authority. And we haven't gotten the management of our mapping figured out either.

We may try and add Making of America Journals and American Jewess as MODS collections. However, these are serials, and there are no examples of serials in the Aquifer profile, so we need some more advice from MWG on that.

(The four new MODS collections are also in the MODS portal.)

Posted by khage at 08:27 PM | Comments (0)

June 28, 2007

Updated MODS Records from UM

We finally got around to updating the MODS records for our two collections we're data providing through MODS -- History of Math and Michigan Counties. The latter is in the Aquifer portal:

http://quod.lib.umich.edu/a/aquifer/
(search "The University of Michigan, University Library")

Mostly these were tweaks to conform to the new-ish Aquifer Guidelines. However, we are planning to do something interesting with the call numbers in the original MARC records we're mapping to MODS. We plan on attaching High-Level Browse topics, based on these call numbers, to the MODS records. If anyone has suggestions for what MODS field to place these topics in, that would be much appreciated. We were originally thinking Genre, but that doesn't seem quite right.

** See High-Level Browse in action at UM: http://lib.umich.edu/ejournals/

Posted by khage at 01:00 PM | Comments (0)

June 20, 2007

Web Stats for DLF Aquifer Portal

I added a Wiki page to the Core group describing the web stats to date. You'll see that June 2007 has been our busiest month to date!

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferCore/Web+Stats+for+DLF+Aquifer+Portal

Posted by khage at 01:53 PM | Comments (0)

June 19, 2007

Steps to Add an Aquifer Set

So, this is my attempt to list what it takes on my end to add records to the Aquifer portal. There are a bunch of stand-alone files, and if nothing else, perhaps this will encourage Chick and Tom to make this more invisible/automatic!

1. Assuming a successful harvest of the MODS records, I look to see what the set ID is, and if I need to modify the name (I try to avoid set IDs with a dash or underline in them).

2. I add the set ID and appropriate set name (not necessarily what the data provider has indicated is the set name) to a couple of stand-alone files. One file associates the set ID with the set name, so that the left-hand column in the results list will be populated correctly. The second file associates the set IDs with the set names for the purpose of running the transform tool.

3. Once the files are populated, I run the MODSTransform tool. This runs very quickly, and I get a report of which records are in the file system, which records have URLs (i.e., those that end up being included in the portal), any data conditioning/massaging that needed to happen, etc.

4. I move the resulting [setID]_bib.xml file to the appropriate place in the filesystem, concatenate all set XML files and index the concatenated file.

5. While it's indexing, I change the web files so that they describe the new set added, and update the number of records and data contributors.

6. I rdist (move) all these files to the production server.

Posted by khage at 09:46 AM | Comments (0)

June 15, 2007

Institutions interested in creating MODS records

I was wondering how many institutions have shown interest in creating MODS records. We currently have 7: Digital Colls at UM, OCLC, Indiana, LoC, U of Chicago, Deep Blue at UM and Celebration of Women Writers at UPenn. Who has heard of others who are waiting on the finalization of the Aquifer MODS profile to create MODS records?

Can we encourage folks to create MODS records even if they are not perfectly conforming to the profile? I would imagine that the big hurdle is finding (or creating) the appropriate mapping and finding time to make sure the mapping is appropriate for a collection. Once the metadata is created, fitting into the profile at a level of conformance would seem a smaller hurdle. I know that's the case at UM.

Interested in your thoughts.

Posted by khage at 02:35 PM | Comments (6)

June 11, 2007

EDIT: Collection Solicitation Draft

Folks, I've edited the Collection Solicitation draft using the comments of the collection submission advisory group (Tim, Sarah, Jenn, Perry, me). Please comment: go ahead and just comment on this blog entry.

Edits for later: The Best Practices wiki URL will change at some point in the near future. Also, we might wish to point instead to the published versions on the DLF site.

Also, we need to determine who will be responsible for ingestion of collection submissions. I think we can hash that out at Friday's meeting.

****
DLF Aquifer seeks material that fits within an American culture and life theme, broadly defined. The Aquifer Collections Working Group will evaluate the metadata to determine whether it is in the Aquifer scope.

The current method for gathering material is by metadata harvesting using OAI-PMH. We ask data providers to make MODS records available according to the Digital Library Federation/Aquifer Implementation Guidelines for Shareable MODS Records. In addition, potential data providers may find the OAI Best Practices useful, in particular the Shareable Metadata and Data Provider Implementations sections.

We understand that many libraries and cultural heritage organizations have metadata in other formats and we are investigating the possibility of providing mapping support for transformations from MARC, EAD and possibly VRA formats to Aquifer style MODS. The Aquifer team is modeling workflows for ingesting new collections and would be glad to work with collection providers and their technical contacts to create flexible methods to add content with metadata in other formats. We strongly recommend that these formats be as rich or richer than MODS or MARC, in keeping with our efforts to provide as much detail as possible. It also bears noting that a requirement of OAI-PMH is that data providers make the oai_dc (simple Dublin Core) format available.

As per the Aquifer mission, the metadata we collect should point to digital objects. These digital resources should also be made available for additional purposes within our framework, e.g., as thumbnails for Asset Actions services; for manipulation of the resources themselves.

For organizations without OAI data providers or plans to set one up, we are also considering offering a Static Repository setup service. We welcome the opportunity to discuss this idea with collection providers. For those interested in this avenue, the requirement is XML-enabled metadata.

When collections are submitted, we ask that the provider insure the continued availability of the resources through Aquifer by signing a submission agreement. The agreement also outlines the Digital Library Federation's right to use the digital material and metadata within the Aquifer project.

Send expressions of interest and questions to need group/individual.
****

Posted by khage at 04:05 PM | Comments (1)