Further tweaks to the OAI provider

December 21, 2009

We have fixed a mistake with the UMProvider (OAI provider) that caused there to be more Dublin Core format records than MARC format records. This was due to our not implementing the "deleted records" function for the MARC records. There were about 1500 of these-- a re-harvest will grab these now-deleted records (e.g., records that had their rights status changed to in-copyright). We apologize for the inconvenience.

We have also discovered that there is still a discrepancy of about 1800 records between the number of records in the provider and the number of records in our HathiTrust databases. This is partially due to the fact that the provider is a day or two behind the databases. We will keep an eye on this discrepancy, as this is only a partial explanation.

As of today, there are 391,640 HathiTrust records (not volumes) in the provider and 882,409 total records.

Please do let us know if you have questions or comments about the provider. We're interested in how others are using it.

Updates to our UM OAI provider

October 05, 2009

We have been making improvements to our OAI provider (UMProvider). We host the metadata for HathiTrust public domain texts through the provider, as well as all the metadata for text and image collections in the UM Digital Library.

Our first improvement was to make it faster to harvest. Our provider uses mySQL tables to store, sort and provide access to the metadata. Our method for sorting the data was one of the causes for the slowness of the harvesting.

Our second improvement comes from our investigation into the increasing number of deleted HathiTrust records that were showing up in the provider, and a discrepancy between the number of records in the provider and the number of records in our HathiTrust databases. We have not fully determined the cause of this, but we have been able to restore over 30,000 HathiTrust records that were marked as deleted in the provider.

Consequently, we recommend you harvest the provider from scratch, whether the entire metadata set or a particular set. It will be quick, and you'll get those missed records. We will keep you posted on further improvements.

(The UMProvider can be accessed via http://quod.lib.umich.edu/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc. There is useful information about the HathiTrust records in the provider at http://www.hathitrust.org/data.)

Changes to the University of Michigan OAI data provider

January 07, 2009

We've made some changes to the University of Michigan OAI data provider.
baseURL = http://quod.lib.umich.edu/cgi/o/oai/oai

The data provider now reflects the fact that we are providing records from the HathiTrust Digital Library (http://www.hathitrust.org/), formerly called MBooks. From reading this blog, you probably know that the HathiTrust Digital Library contains Google-digitized books and journals from a consortium of institutions including the University of Michigan.

Consequently, OAI sets that were originally named "mbooks" are now named "hathitrust". If you're harvesting us, please change your harvesting protocols to reflect this.

In addition, we have modified the MARC and oai_dc formats to correct and amplify the information we are providing, based on feedback from those who have harvested us in the past. For instance, the 245 field now includes the statement of responsibility (subfield c). We hope the records will be more useful as a result.

We'd love to hear any feedback you might have on the changes.

Google Still Not Indexing Hidden Web URLs

July 22, 2008

Read our recent article in D-Lib Magazine:

This report is a follow-up to the McCown et al. article in IEEE Internet Computing two years ago [1], in which the researchers investigated the percentage of URLs from OAI records in Google, Yahoo and MSN search indexes. We were interested in whether Google in particular had increased the number of OAI-based resources in its search index.

Google's indexing does not seem to have retrieved more of the hidden web since the publication of the McCown, et al. article in 2006. We would venture to conclude that Google has not endeavoured to increase their support and access to OAI materials. Even taking into account the caveats in our report, we would also conclude that aggregations of OAI records are as valuable for user research purposes as they were at least two years ago.

[1] McCown, F., Liu, X., Nelson, M. L., and Zubair, M. "Search engine coverage of the OAI-PMH corpus." IEEE Internet Computing 10:2 (March/April 2006) pp. 66-73.

University of Chicago integrating MBooks in catalog using OAI

May 15, 2008

There is an alternative way to access MBooks other than through UM's online catalog Mirlyn. You can harvest the MBooks records directly via our OAI interface. The University of Chicago has done just that, and integrated these records into their library catalog.

Excluding serials, as they were more problematic to integrate, they provided access for users to MBooks and Google Books links from the catalog, for books they held.

As an example, go to http://lens.lib.uchicago.edu/ and search for "An historical sketch of the native states of India in subsidiary alliance with the British government". The second record provides the link to the MBooks full text of that book.

We're very interested in hearing from other libraries that are using MBooks records in their online catalogs.

Only records for MBooks available in the public domain are exposed through OAI. We have split these into sets containing public domain items according to U.S. copyright law, and public domain items worldwide. There are currently over 121,000 records available for harvesting. We anticipate having 1 million records available when the entire UM collection has been digitized by Google.

For more information, see http://www.lib.umich.edu/mdp/info/OAI.html.

