New iPhone-friendly interface for religious texts

December 19, 2008

I've been researching mobile interface design for a few months now so when we got an email yesterday from a user asking if we'd consider making an iPhone interface for her favorite collection, I jumped at the opportunity. The collection she was interested in (the Bible: Revised Standard Version) is one of our oldest "legacy" collections. Luckily, the regular interface is very simple and doesn't contain any tables or even graphics. The main problem with viewing the text on an iPhone is that it basically displays a much smaller version of what you see on a full size monitor. The user must then zoom in to make the text big enough to read and then do lots of horizontal and vertical scrolling to read.

Before:

So, in order to get the text to fill the allotted space and wrap nicely, all we had to do was simply add one meta tag into the <head> of the html.

<meta name="viewport" content="width=device-width">

This basically tells the iPhone to display the content at the default size of the iPhone screen.

Of course, "simple" is never just that. Because we aren't dealing with static html pages, collection manager extraordinaire Chris Powell had to spend some time trying to figure out how to insert the meta element, attributes, and attribute values & then figure out how to pass them to the CGI perl module.

We have 2 other similar legacy collections, so we went ahead and made the same changes to them as well:

After:


Next on the agenda is to try this with one of the more complicated collections, or HathiTrust Digital Library using mobile specific style sheets.

Posted by Suzanne Chapman at 03:01 PM. Permalink | Comments (0)

Large-scale Full-text Indexing with Solr

December 18, 2008

A recent blog pointed out that search is hard when there are many indexes to search because results must be combined. Search is hard for us in DLPS for a different reason. Our problem is the size of the data.

The Library has been receiving page images and OCR from Google for a while now. The number of OCR'd volumes has passed the 2 million mark. This raises the question of whether it is possible to provide a useful full text search of the OCR for 2 million volumes. Or more. We are trying to find out.

Our primary tool is Solr, a full text search engine built on Lucene. Solr has taken the Library world by storm due to its ease of use. Many institutions have liberated MARC records from the confines of the OPAC, indexed them with Solr and created innovative, faceted search interfaces to aid discovery. However, these systems are based on searching the metadata.

Metadata search is fine as far as it goes but what if you'd like to find books containing the phrase fragment: "Monarchy, whether Elective or Hereditary"? You need full-text search.

These Solr-based systems have indexes of millions of MARC records and perform well. Their index size is in the range of a few tens of gigabytes or less. We've discovered that the index for our one million book index is over 200 gigabytes or 10 times the size of a metadata index for only one-tenth as many items. We hope to scale to 10 million books. Our data show index size is linearly proportional to data size so we expect to end up with a two terabyte index for 10 million books.

The large size of the index is is due to a variery of factors. One is the fact that the OCR is "dirty" and contains many terms that are useless non-words. Another is that we are indexing works in many languages, again introducing many unique terms. But the biggest contributor is simply the index of word positions. At some point most unique terms are already present in the index and the only growth is in the size of the list of documents containing that term. But every occurrence of every additional word adds its position to the index. The average size of the OCR for a book is about 680,000 bytes, so a single book will add a very large number of entries to the position index. In fact, we've found that this part of the index accounts for 85% of the total index size.

Why do we need word positions, anyway? The answer is that Lucene does phrase searching by comparing the proximity of the terms in the phrase and it uses positions to compute proximity. We absolutely want phrase searching. Our performance tests using a few terms with great informational weight such as "english" and "constitution", return results in under one second. This hold true even for the phrase "english constitution". These terms are infrequent enough that the computational load to search for them and compare proximities is small. Unfortunately, phrases or term queries that contain words with low weight like "of", "and", "the", i.e. common terms, impose such a large computational load that it can take minutes for these queries to complete.

But even queries containing many common terms perform adequately if the index size is just a few gigabytes. So what's the problem? It has to do with the speed of data retrieval from RAM vs. disk. RAM is orders of magnitude faster than disk. When most or all of the index fits in RAM performance is optimal. When Lucene must retrieve pieces of an index from disk because it's too large to fit entirely in RAM performance suffers. This is the case for our index. We are exploring several paths toward a solution.

One involves splitting the index into what Solr calls "shards" and distributing the shards to two or more machines. In this way, more of the index will fit into RAM. Solr supports result merging from several shards. So far it is unclear how many machines we will need to obtain good performance.

The second is a special kind of indexing that creates bi-grams. These are tokens composed of pairs of words. This decreases the number of common terms that must be indexed.

Stay tuned for developments.


Posted by Phillip Farber at 05:21 PM. Permalink | Comments (4)

Historical Math Collection Incorporated into HathiTrust

December 01, 2008

Since the earliest days of MBooks, DLPS has been looking forward to ingesting our previously-digitized page image volumes into the repository. The release of Collection Builder had our colleagues interested in building collections that combined our existing content with the millions of books coming through the Google digitization process, increasing our interest.

Though it has taken longer than we had hoped, the Historical Math Collection is now available through the HathiTrust Collection Builder interface: http://babel.hathitrust.org/cgi/mb?a=listis;c=1730264573

Much of the credit for this goes to our colleagues in Core Services, who have had to develop an ingest process that takes into account the differences in files, formats, and metadata, and Library Systems, who have had to develop a process to create the automatic links in Mirlyn (our OPAC) without the use of the barcode scan that is central to the Google digitization workflow.

The Historical Math collection was selected because of its size -- neither too big nor too small at just under a thousand volumes -- and its age -- not so early in our digitization workflow that many anomolies were likely to be encountered. We will continue adapting the ingest process to incorporate all of our page-image volumes over the coming months.

More work remains to develop a standalone collection interface, similar to the one at http://quod.lib.umich.edu/u/umhistmath, and assess usability.

Posted by Chris Powell at 03:37 PM. Permalink | Comments (0)