Advanced Search for HathiTrust full-text seach
April 18, 2012
In February we released the first part of the advanced search interface for HathiTrust full-text search. Advanced search allows users to combine a full-text search with searches within specific metadata fields such as Title, Author, or Subject.
For example if you want to find out where Charles Dickens used the phrase "the best of times" you can search for: [All of these words] [Dickens, Charles] in [Author] AND [This exact phrase][the best of times] in [Just Full Text]
The advanced search interface also allows you to set limits by publication date, format, or language. Multiple languages or formats can be selected.
Today we released the second phase of advanced search. You can now combine up to four different fields connected by the "AND" or "OR" operators, and any limits set are retained if you click on the "Revise this advanced search" on the search results page.
HathiTrust "Search in this text." Now with relevance ranking and better multilingual support!
August 29, 2011
Today we released the third high priority feature identified by the HathiTrust Full-Text Working Group: Relevance ranking for "Search in this text." Now when using the "Search in this text" feature, instead of having to scroll through numerous pages of results in page order, the results are now returned in relevance order with most relevant pages at the top of the list. The default is to list only pages that contain all the words in a user's search (a Boolean "AND" search.) However, there is also a link that will search for pages containing one or more of the search terms. If this option is selected, the pages containing more of the user's search terms are ranked higher.
In addition to relevance ranking, searching for non-Latin languages such as Hindi, Arabic, Hebrew, or Thai, now matches the capabilities of the Full text search of all 9 million volumes.
HathiTrust Full-Text search: Now with Facets!
August 10, 2011
On July 27th we went live with faceted search and relevance ranking based on both OCR and MARC metadata in Full-Text search. (www.hathitrust.org) These are the top two features identified by the HathiTrust Full-Text Working Group.
The relevance ranking now will give volumes that match a user's query terms in both the OCR and in the title or author or subject a higher ranking than a match in only the OCR. There is much more work to be done in tuning relevance ranking, but this is a first step.
Search results can now be refined by selecting facets such as subject, date or author. Although selecting facets can help users drill down to narrow large result sets, using very specific terms and especially using phrases in quotes remain one of the best ways to get reasonably small result sets.
Over the next few months we will be releasing further improvements in ranking and more of the features identified by the task force.
HathiTrust Digital Library Functionality Enhancements
July 09, 2010
We have recently made a number of significant updates to the HathiTrust Digital Library:
* University of Michigan users (and a number of other HathiTrust partner institutions) can now login to the Digital Library using Shibboleth.
* All users can now download full PDFs of public domain volumes that were not digitized by Google. This currently includes nearly 100,000 Internet Archive-digitized volumes that were contributed by the University of California and thousands of volumes digitized locally by the University of Michigan.
* Authenticated users can now download full PDFs of ALL public domain volumes.
* All users can now add items to public or private collections via the full-text search results pages.
Questions or comments? Submit via the feedback link on the HathiTrust website or via DLPSemail@example.com.
Making Personal Collections from Large Scale Search Results
July 07, 2010
We just released a new feature in our full-text Large Scale Search. When you do a search,you will see check boxes next to each search result. You can select items you want from the search results and create a personal collection. This should make it much easier to do repeated searches and explore a targeted subset of the HathiTrust volumes. If you are not logged in, the collection will be temporary. If you log in you can save the collection permanently. This enables users to do focused searching within a selected subset of search results.
More Opportunities to buy UM Library Books from the HathiTrust
March 12, 2010
by Maria Bonn
On Wednesday, the Scholarly Publishing Office activated more than 225,000 "buy a reprint" links in the HathiTrust, increasing the number of public domain reprints available for purchase by more than 300%. Potential purchasers are now directed to our publishing partner, BookPrep, a service of Hewlett Packard, where they can preview the book prior to ordering. Over time, these books will also become available on Amazon.com (where you can already find tens of thousands of our titles) and through other distribution channels.
For those of you who are confused by the ever increasing avenues to reprints of UM Library books, know that you are not alone. SPO is working on some simple guides and materials to help you and our users understand the choices.
As we have seen throughout our years selling reprints, these books tend to sell in modest numbers (ones and twos) but in the aggregate we see considerable interest in our titles. Last month, a government document, The Sourcebook in Forensic Serology, Immunology, and Biochemistry enjoyed considerable interest, with sales of ten copies, and the 1907 edition of Dr. Chase's Third, Last and Complete Receipt Book and Household Physician... became a runaway bestseller, selling seventeen copies.
HathiTrust Reaches 5 Million Volumes
December 18, 2009
The HathiTrust repository has reached the 5 million volume mark! Included are texts from 5 institutions: Indiana University, Penn State University, University of California, University of Wisconsin, University of Michigan. Compared to 2007 ARL stats, HathiTrust ranks 29th for volumes held. Holdings are likely to reach 8 million in 2010, comparable to a top 15 ranking. Over 700,000 volumes are in the public domain! Search all 5 million at http://catalog.hathitrust.org/ .
HathiTrust Accessible Interface
October 20, 2009
For the past two years the University of Michigan Library has been making many of our digitized texts (including items that are in-copyright) available to persons with print disabilities through the HathiTrust Digital Library . Our Dean, Paul Courant, recently posted about this project on his blog so I thought it might be nice to offer more background and some technical information about this project.
In order to determine the best method for the system, we began by conducting research in a number of areas. We explored the technology that users with print disabilities often use to access web content (primarily via screen reader applications and assistive technologies like digital Braille devices), researched accessibility related coding techniques, and met with the campus Services for Students with Disabilities (SSD). After weighing the pros and cons of different options, we decided to do a few things:
- Make our standard interfaces more accessible.
- Create a text-only book interface that is optimized for the specific needs of users with print disabilities (referred to internally as the "SSD interface").
- Create a system to grant additional access to the full-text of a digitized book for certain UM patrons, regardless of the book's copyright status.
Since HathiTrust is a multi-institutional and publicly available access system we deemed it very important to improve the accessibility baseline. This was done fairly easily by making better use of web standards-based coding techiniques like proper use of headings, separating style from content, etc. However, due to the structure of our current system, we were still only able to offer one book page at a time which results in a less than ideal experience for actually reading a book. So, after talking to some print disabled users, we determined that what was really needed was a simplified text-only interface that could be coded for optimal accessibility and display the entire concatenated text from beginning to end on one single web page. In order to do this, we needed to establish an access policy and authentication mechanisms. We accomplished this through collaboration with SSD, digital library developers & systems administrators, library managers, and Jack Bernard, the UM Assistant General Council. Once we worked out the access mechanism, implementing the interface was actually fairly simple. Since the HathiTrust Digital Library uses XML & XSLT, we just had to write one single XSLT style sheet to generate the code to any book as it is requested.
Here's how it works:
- UM student/faculty registers for the program with SSD.
- SSD notifies the Library, and the Library enables the patron's record for access.
- SSD Patron checks out a book that has been digitized (public domain or in-copyright).
- Library Catalog system (Mirlyn) automatically sends them an email containing a URL linking them to the SSD interface.
- Student follows the URL and is prompted to login.
- The system checks their eligibility in the program and that the book is checked out to them and then they're given access to the SSD interface for as long as they have the book checked out.
Here's what it looks like:
During every phase of this project we tried to get feedback from some potential users of this system. Services for Students with Disabilities put us in touch with a few students who piloted the project and provided feedback. We were also very lucky to be able to hire two UM School of Information graduate student interns to help work on this project. One student helped research coding techniques and drafted a set of departmental guidelines. The other, a blind student, conducted evaluations of the regular and SSD interfaces using a variety of assistive technologies. Additionally, we have worked with the National Federation of the Blind for input as well as a round of testing of the SSD interface which resulted in an official endorsement.
There are currently over 3 million University of Michigan volumes available via the HathiTrust. The use of the SSD interface is still relatively low, averaging about 35 pageviews a month but we hope this will increase as more books become available and more students learn about the service. Most of our HathiTrust Digital Library interfaces pass section 508 and WCAG priority 1 validation. We are currently working to get them all up to that standard and hopefully beyond.
- There is still much work to be done to ensure that our system is as accessible as possible. We are in a constant state of development and we are now beginning to collaborate more with other institutions so it is easy for edits to be done that cause the code to fail validation. As we continue to develop new tools and functionality, it will become more and more important to follow accessible coding conventions at all stages of development as it is extremely easly to fall out of order with many people working on different parts of the system.
- We are working with OCR content that isn't as complete or flawless as we would like so hopefully one day we'll be able to improve the quality, add descriptions for images, and allow users to suggest corrections.
- The full-text viewing feature is currently only available to UM students and faculty but we hope that through continued collaborations, a similar program will be established at other HathiTrust partner institutions.
- There are always improvements to be made! We welcome any comments or feedback about how to improve our system.
Islamic Manuscripts Digitization
July 08, 2009
The University of Michigan Library has received a CLIR grant to provide specialized access to Islamic manuscripts. Users of the site will be able to contribute to the description/cataloging of each item. We are digitizing 1250 manuscripts as part of this project, and I thought it would be worthwhile to share the following summary of the process.
A combination of in-house and out-sourced digitization is used to digitize the Islamic Manuscripts. It is only recently that we began outsourcing Special Collections materials for digitization, and we are doing pre- and post-process evaluations of a representative sample in order to assess damage and measure risk. Results are not yet available. Conversion of the 1,250 Islamic Manuscripts will span the duration of the 3 year project.
Each manuscript is given an acid-free bookmark with a barcode. The barcodes allows DCU to track the progress of each manuscript inhouse or at the vendor. Because the manuscripts are unpaginated, DCU staff adds page numbers in pencil to the upper outer corners of each page, for the ease of verifying that all pages have been digitized and for scholars to refer to particular pages in their studies. The pages are numbered back-to-front, in a manner consistent with the reading of the text.
Approximately 70% of the volumes are to be done in-house because they deserve special handling. We use an overhead color scanner with relatively gentle lights to capture color page images as 400 ppi 24-bit color TIFFs. A scanning operator turns the pages by hand, with the utmost concern being to digitize the pages without harming the manuscript. When concern arises over the condition of the manuscript, the manuscript is taken to the Conservation department for consultation, and adjustments are made to the digitization process. Each manuscript is captured from cover to cover to ensure that every part of the item is represented. Each page is cropped to just outside the page edge, both to prevent cropping away the page numbers, and to give scholars the opportunity to see the condition of the page edges. We have a Zeutschel OS10000 overhead color scanner currently, and will be purchasing a CopiBook HD Book Scanner*** to bolster our capacity.
[*** Correction July 14, 2009: The CopiBook Book Scanner is one of the products being considered. A decision has not yet been made.]
The remaining 20% are outsourced to Trigonix Inc. in Montreal, Canada. We have a long working relationship with Trigonix and they do excellent work. The vended cost per page, for color scanning of bound volumes on their overhead color scanners manually, is about $0.30USD. For comparison, the per page cost of black and white scanning of unbound volumes is $0.09USD. We are very satisfied with the quality of the vended scans created by Trigonix for this project.
Preservation and access will be handled by HathiTrust in addition to the site mentioned above where users will be able to make contributions to the cataloging. Location to be determined.
Thanks to Larry Wentzel for his contributions to this write-up.
January 20, 2009
The launch of HathiTrust was #4 on Library Journal Academic Newswire's list of Top Ten Stories for 2008. As they say:
The project is worth watching for a few reasons; notably, its collaborative model, sharing everything from funding to technology and expertise, suggests that, after years of experimentation, libraries at last understand and grasp the needs and challenges facing them in the digital future.
I'd say we've understood that for a long time, but have found successful ways of acting on that understanding.
Changes to the University of Michigan OAI data provider
January 07, 2009
We've made some changes to the University of Michigan OAI data provider.
baseURL = http://quod.lib.umich.edu/cgi/o/oai/oai
The data provider now reflects the fact that we are providing records from the HathiTrust Digital Library (http://www.hathitrust.org/), formerly called MBooks. From reading this blog, you probably know that the HathiTrust Digital Library contains Google-digitized books and journals from a consortium of institutions including the University of Michigan.
Consequently, OAI sets that were originally named "mbooks" are now named "hathitrust". If you're harvesting us, please change your harvesting protocols to reflect this.
In addition, we have modified the MARC and oai_dc formats to correct and amplify the information we are providing, based on feedback from those who have harvested us in the past. For instance, the 245 field now includes the statement of responsibility (subfield c). We hope the records will be more useful as a result.
We'd love to hear any feedback you might have on the changes.
Large-scale Full-text Indexing with Solr
December 18, 2008
A recent blog pointed out that search is hard when there are many indexes to search because results must be combined. Search is hard for us in DLPS for a different reason. Our problem is the size of the data.
The Library has been receiving page images and OCR from Google for a while now. The number of OCR'd volumes has passed the 2 million mark. This raises the question of whether it is possible to provide a useful full text search of the OCR for 2 million volumes. Or more. We are trying to find out.
Our primary tool is Solr, a full text search engine built on Lucene. Solr has taken the Library world by storm due to its ease of use. Many institutions have liberated MARC records from the confines of the OPAC, indexed them with Solr and created innovative, faceted search interfaces to aid discovery. However, these systems are based on searching the metadata.
Metadata search is fine as far as it goes but what if you'd like to find books containing the phrase fragment: "Monarchy, whether Elective or Hereditary"? You need full-text search.
These Solr-based systems have indexes of millions of MARC records and perform well. Their index size is in the range of a few tens of gigabytes or less. We've discovered that the index for our one million book index is over 200 gigabytes or 10 times the size of a metadata index for only one-tenth as many items. We hope to scale to 10 million books. Our data show index size is linearly proportional to data size so we expect to end up with a two terabyte index for 10 million books.
The large size of the index is is due to a variery of factors. One is the fact that the OCR is "dirty" and contains many terms that are useless non-words. Another is that we are indexing works in many languages, again introducing many unique terms. But the biggest contributor is simply the index of word positions. At some point most unique terms are already present in the index and the only growth is in the size of the list of documents containing that term. But every occurrence of every additional word adds its position to the index. The average size of the OCR for a book is about 680,000 bytes, so a single book will add a very large number of entries to the position index. In fact, we've found that this part of the index accounts for 85% of the total index size.
Why do we need word positions, anyway? The answer is that Lucene does phrase searching by comparing the proximity of the terms in the phrase and it uses positions to compute proximity. We absolutely want phrase searching. Our performance tests using a few terms with great informational weight such as "english" and "constitution", return results in under one second. This hold true even for the phrase "english constitution". These terms are infrequent enough that the computational load to search for them and compare proximities is small. Unfortunately, phrases or term queries that contain words with low weight like "of", "and", "the", i.e. common terms, impose such a large computational load that it can take minutes for these queries to complete.
But even queries containing many common terms perform adequately if the index size is just a few gigabytes. So what's the problem? It has to do with the speed of data retrieval from RAM vs. disk. RAM is orders of magnitude faster than disk. When most or all of the index fits in RAM performance is optimal. When Lucene must retrieve pieces of an index from disk because it's too large to fit entirely in RAM performance suffers. This is the case for our index. We are exploring several paths toward a solution.
One involves splitting the index into what Solr calls "shards" and distributing the shards to two or more machines. In this way, more of the index will fit into RAM. Solr supports result merging from several shards. So far it is unclear how many machines we will need to obtain good performance.
The second is a special kind of indexing that creates bi-grams. These are tokens composed of pairs of words. This decreases the number of common terms that must be indexed.
Stay tuned for developments.
Historical Math Collection Incorporated into HathiTrust
December 01, 2008
Since the earliest days of MBooks, DLPS has been looking forward to ingesting our previously-digitized page image volumes into the repository. The release of Collection Builder had our colleagues interested in building collections that combined our existing content with the millions of books coming through the Google digitization process, increasing our interest.
Though it has taken longer than we had hoped, the Historical Math Collection is now available through the HathiTrust Collection Builder interface: http://babel.hathitrust.org/cgi/mb?a=listis;c=1730264573
Much of the credit for this goes to our colleagues in Core Services, who have had to develop an ingest process that takes into account the differences in files, formats, and metadata, and Library Systems, who have had to develop a process to create the automatic links in Mirlyn (our OPAC) without the use of the barcode scan that is central to the Google digitization workflow.
The Historical Math collection was selected because of its size -- neither too big nor too small at just under a thousand volumes -- and its age -- not so early in our digitization workflow that many anomolies were likely to be encountered. We will continue adapting the ingest process to incorporate all of our page-image volumes over the coming months.
More work remains to develop a standalone collection interface, similar to the one at http://quod.lib.umich.edu/u/umhistmath, and assess usability.
HathiTrust Search Experiment Goes Live!
November 04, 2008
Immediately after the HathiTrust announcement, one blog said that we'd built the digital library but forgot the front door. Why? Because there was no search functionality included in the initial release.
Large scale search has always been a goal (see http://www.hathitrust.org/large_scale_search for more), and we now have the first attempt at meeting that goal. Come through the door of our public beta at http://babel.hathitrust.org/cgi/ls
As an initial public beta of full text search functionality, we are offering a simple mechanism to search across all of the fully viewable works (both those in the public domain and those for which we have permissions) and a sprinkling of search-only works (i.e., in-copyright works where we may not show the text of the work). The size of the content indexed is approximately 500,000 volumes, and the majority of the works are fully viewable.
This implementation does not include functionality in use elsewhere in HathiTrust (e.g., no sorting or collection-related functionality), and does not have features like clustering of results which are likely to be in a fuller implementation. For this public beta, we have devoted limited system resources such as system processing speed. A full implementation with more robust resources is being planned. Although most searches produce results quickly, some may take several minutes and in fact fail to produce results.
A New HathiTrust Collection for the New Academic Year
September 09, 2008
Start the new year with a look at some Historical Advice to Undergrads. Covering the period from 1856 to 1941, these guides, handbooks, and (let's be frank) sermons offer advice on scheduling your time, choosing appropriate courses, being popular, and remaining virtuous in the face of the temptations that will surround you on campus.
There are books aimed at women (The American College Girl), at international students (Guide Book for Foreign Students in the United States), and at Illini freshmen (Facts for Freshmen Concerning the University of Illinois), as well as a number of general titles that promise to tell you how to get the most out of college.
MBooks is now HathiTrust
September 03, 2008
MBooks is becoming HathiTrust. See the new website for more information: http://www.hathitrust.org.
Roy Tennant has already commented.