PictureIt Rare Book Reader

March 08, 2010

We bring old books to life. See them again for the first time.

by Catherine Soehner

It is my pleasure to announce the public debut of PictureIt Rare Book Reader (http://www.lib.umich.edu/pictureit). A collaborative effort between several Library units, the product is now available for shared use after 18 months in development.

PictureIt Rare Book Reader

PictureIt is a web-based animation program that gives users the sensation of turning the pages of digitized rare materials that would be otherwise difficult, if not impossible, to view or obtain. Volume 1 of John James Audubon’s Birds of America was selected as the inaugural PictureIt book for a few reasons. Foremost, the eight volume set has special meaning as the first purchase for the Library by the Board of Regents of the University of Michigan. As well, the University of Pittsburgh had already digitized all volumes of the Birds of America set and was willing to share the images with us. And finally, the illustrated plates of this set were intricately completed, making them as much art work as scientific work. Volume 1 of Audubon’s Birds of America was also selected for the first PictureIt book because its complex images demonstrate the product’s embedded magnification tool which allows users to get up-close and view the details of each illustration.

While the Library is excited to share Volume 1 of Audubon’s Birds of America within the University of Michigan community, the scope of the PictureIt project is much larger. The animation programming for PictureIt was designed as a template to allow for the easy and quick insertion of other digitized rare materials. The PictureIt project is also under a Creative Commons License http://creativecommons.org/licenses/by/3.0/deed.en, which will allow others to use and change the programming with proper attribution to the University of Michigan. As a result, we hope many institutions will post their digitized rare materials using PictureIt as a growing collection of primary source materials available for worldwide viewing.

I wish to express my deep gratitude to the many people who participated in bringing PictureIt from idea to finished product, including Lilienne Chan, Peggy Daub, Sara Henry, Karen Jordan, Melissa Levine, Ken Varnum, John Weise, and John Merlin Williams. I also would like to extend a special thanks to Eric Maslowski, who provided the programming skills and the vision of a template for this product.

Update April 2, 2010: Application and Source Now Available


Posted by John Weise at 11:17 AM. Permalink | Comments (8)

Searching for MBooks in Mirlyn

August 15, 2008

There are three ways to find MBooks in Mirlyn, the U-M online catalog:

1. Click on "Find Other Library Catalogs" in the upper right side of the Mirlyn screen, and you'll see the entry for MBooks/HathiTrust in the center of the page.

2. Limit searches in Advanced Search to "MBooks only" using the checkbox.

3. In Command Language, search for "wct=mdp"

You may have noticed that many MBooks records contain this reproduction note:

Electronic text and image data Ann Arbor, Mich. : University of Michigan Library 2008 Includes both image files and keyword searchable text. [Michigan Digitization Project]

These notes are going away. Searching on the phrase "michigan digitization project" in Mirlyn no longer retrieves all MBooks. Instead, use one of the methods described above.

Finally, we have gotten questions about items in Mirlyn with links to Google Book Search, but no link to MBooks. This occurs when Google digitizes a book from another source before they digitize our copy. An example of this can be found (for the moment) in this record. The link to GBS is created using Google's API. Eventually, Google will digitize the U-M copy and a link to MBooks will appear in Mirlyn.

Posted by Perry Willett at 08:14 AM. Permalink | Comments (0)

Languages in MBooks

August 01, 2008

Many people have asked us about the languages available in MBooks. In particular, they want to know if Google is providing searchable text for non-Western languages or difficult scripts. Most Western European languages have been available from the beginning of the project, but here are some examples of books in languages that Google has added in the past few years:

* Chinese: http://hdl.handle.net/2027/mdp.39015055131992
* Japanese: http://hdl.handle.net/2027/mdp.39015067188378
* Hebrew: http://hdl.handle.net/2027/mdp.39015019327512
* German/Fraktur: http://hdl.handle.net/2027/mdp.39015070866887
* Russian: http://hdl.handle.net/2027/mdp.39015028011768
* Czech: http://hdl.handle.net/2027/mdp.39015026722820
* Polish: http://hdl.handle.net/2027/mdp.39015055374857
* Greek: http://hdl.handle.net/2027/mdp.39015047659472

The process used to convert from page images to text is called Optical Character Recognition, or OCR. (You can view the OCR text of any of the pages by switching to "text" under "view page as" on the left-hand menu in the pageturner.) Without good OCR, there's no way to search the books. Google is all about search, and they're working to improve the OCR they produce. However, the multitude of languages, scripts, and fonts in this collection poses a serious problem for OCR, and it's likely that Google won't be able to OCR all languages as they encounter them. In addition, the quality of the page itself is critical to good OCR. In many older books, particularly those published between 1850-1950, the paper has deteriorated and discolored, resulting in lower quality OCR.

I can read German, so I know that the OCR for the Fraktur script in the above example isn't perfect. However, given that there isn't much OCR software that can handle Fraktur, it's not bad. I don't read any of the other languages, so I can't make any judgment about the accuracy of the OCR in the rest of the list.

I think that this is an area that Google will continue to improve. You will be able to find examples of books in these languages with very poor OCR. Google is reprocessing texts and will send us new and improved versions, so we will get better OCR as the project progresses.

One of the complexities of this work is assessing the quality of OCR in languages that you don't read. I don't read Italian or Spanish, but they use the same alphabet and Latin roots as other western European languages, so I'm able to at least verify the words without knowing the exact meaning. Chinese, Japanese, Korean, Hebrew, Russian and Greek present many more problems for me. For instance, the text in most of the books in Chinese that I've seen runs from top to bottom (including the example in Chinese above), but the OCR goes left to right. Is that right? Are all the characters there, in the correct order?

The Greek title in the list shows another complexity with OCR: the pages alternate between Latin and Greek, but the text has Greek characters throughout. It's difficult for most OCR software to handle multiple languages in the same book.

We don't have a lot of experience in dealing with non-Western languages in the Digital Library Production Service department, and we'll be reaching out to experts--in the library, in the university, in our consortium--to help us answer questions.

Hebrew and other languages that read right-to-left present special problems for us. In looking at the example in Hebrew in the above list, it looks like the glyphs have been converted correctly, but we're using a right-justified margin rather than left-justified. Here's a sample page image:

MBooks Language page image

And here's the OCR:

MBooks Language text example
We'd welcome hearing from users about these issues.

Posted by Perry Willett at 10:26 AM. Permalink | Comments (0)

Google Still Not Indexing Hidden Web URLs

July 22, 2008

Read our recent article in D-Lib Magazine:

This report is a follow-up to the McCown et al. article in IEEE Internet Computing two years ago [1], in which the researchers investigated the percentage of URLs from OAI records in Google, Yahoo and MSN search indexes. We were interested in whether Google in particular had increased the number of OAI-based resources in its search index.

Google's indexing does not seem to have retrieved more of the hidden web since the publication of the McCown, et al. article in 2006. We would venture to conclude that Google has not endeavoured to increase their support and access to OAI materials. Even taking into account the caveats in our report, we would also conclude that aggregations of OAI records are as valuable for user research purposes as they were at least two years ago.

[1] McCown, F., Liu, X., Nelson, M. L., and Zubair, M. "Search engine coverage of the OAI-PMH corpus." IEEE Internet Computing 10:2 (March/April 2006) pp. 66-73.

Posted by Kat Hagedorn at 04:38 PM. Permalink | Comments (0)

Top Ten MBooks Collections

July 21, 2008

Three weeks after it was launched, we can say a little bit about MBooks collection builder usage. Right now, there are 47 public collections (more than half were created by LIT staff) and 170 personal collections.

I've done a little bit of rough assessment, and can report on the ten most-used MBooks collections (they are all public collections). Collection usage includes viewing the collection page, searching the collection, sorting the books in the collection, and copying items to another collection. It does not include searching or viewing the items within that collection -- tracking use of a book from a collection vs. from Mirlyn vs. from links from blogs was outside the scope of my quick-and-dirty analysis. Usage from our network range was not included in this assessment.

Here they are:

  1. Abraham Lincoln: Fact and Fable
  2. Great Britain
  3. Ann Arbor History
  4. How to be a Domestic Goddess
  5. Gothic literature
  6. Historical Bicycling
  7. Adventure Novels: G.A. Henty
  8. What It Was, Was Football
  9. Patents
  10. French Texts

Abraham Lincoln: Fact and Fable is twice as popular as the next-most popular collection, Great Britain, which is almost twice as popular as Ann Arbor History. As far as I can tell, none of these collections is linked from anywhere else except for the G. A. Henty Adventure Novels, which is included as a link in Henty's Wikipedia entry. Even with the minimal metadata presently available on the Public Collections page, people are finding and using collections that are interesting to them.

Posted by Chris Powell at 03:17 PM. Permalink | Comments (0)

New MBooks!

July 01, 2008

As previously mentioned, we've been working on expanding the functionality of our MBooks system.

The new interface now allows users to create their own collections of MBooks items and view public collections created by others. Users can also do full text searching across all items within a collection.

So, check it out! MBooks Public Collections Page

We have quite a few more enhancements planned down the road that include adding MTagger and making it easier to find MBooks items in Mirlyn.

We quietly released it last week so we could discover any remaining bugs and (my personal nemesis) browser display problems. We hope we caught them all, but please let us know if you experience any weird behavior. You can contact us via mdp-help@umich.edu or the feedback form linked to from the top of every MBooks page.

And please take a few minutes to fill out our quick survey to help us decide what features to add next.

Posted by Suzanne Chapman at 12:57 PM. Permalink | Comments (2)

Browsing in MBooks?

June 18, 2008

Last month I attended the annual Digital Library Federation spring meeting and David Rumsey, renowned for his collection of historical maps, was one of the keynote speakers. Prompted by David Rumsey’s map ticker (http://www.davidrumsey.com/ticker.html) and what he said in passing about "moving among the maps" in Second Life, I’ve been brooding about the perceived lack of browsability in the digital library context. How would we "move among the books" in MBooks?

Presumably, one way we could do it would be to make a book ticker – perhaps with covers or title page thumbnails, arranged in call number order (as one would browse a shelf).

That raises a few immediate practical questions:

1. Do we have identified title pages or cover thumbnails for all the books? What do we do for cases where we don’t?
2. Should we precompute thumbnails or try to derive them on the fly?
3. Can we use the Mirlyn call number to browse? They aren’t in the MARC record per se.

These practical questions raise a number of other usability issues, of course. Some are about thumbnails – what size would the thumbnails have to be to make them useful? When you clicked on them, where would you end up? Could you hover over them and see some volume metadata? Can we show thumbnails for in-copyright items? Others are about call number browsing – would you really want to browse all items by call number, or just those from a given library? That is, browse the "real" stacks for a holding location, like Shapiro Undergraduate Library, or the superset of all libraries, the stacks as they’ve never been in the physical world?

To me, the latter choice seems like the best one – it’s something that is only possible in a digital library, as we’d be drawing together items that are housed in separate buildings yet may be related. How do you imagine browsing in the digital library?

Posted by Chris Powell at 04:23 PM. Permalink | Comments (3)

Google Book Search links in Mirlyn

June 13, 2008

You may have noticed that the links to Google Books in Mirlyn have a little more information lately. We have always provided links to online copies in both Google Book Search and MBooks. We're now using the Google API to provide links to any book in Mirlyn that is also in Google Book Search.

We provide a thumbnail image of the cover or title page (although there's been some controversy about this lately). In addition, we also tell you what level of access you can expect if you follow the link to Google Book Search. Google Books has three levels of access, while MBooks has only two:

Google Book Search termsMBook terms
Snippet viewSearch Only
Limited view
Full-textFull Text

In Google Book Search, "Snippet view" means that you cannot view the full-text, but can see up to three text snippets; "Search Only" in MBooks means that you can search for keywords, and discover where all the matches occur, but can't view the pages. (See this previous post for more about "Search Only.") "Limited view" means that the book is part of Google's Publisher Partnership, and a limited number of pages is available for reading. You won't be able to see the entire book, but you will have access to a significant number of pages. "Full-text" in Google Book Search means that you can view the entire text, and get a PDF file of the entire text, while "Full Text" in MBooks means that you can view the page images using the MBooks pageturner, and get a 10-page PDF excerpt.

If you look at very many records for MBooks in Mirlyn, you will soon note that in some cases the access levels differ between MBooks and Google Book Search.

In this last example you'll have full-text in either Google Books or MBooks, so you can decide which interface you prefer. Knowing how to read the Mirlyn record will help you find the best access for any given book. Happy reading!

--Perry Willett
--Head, Digital Library Production Service

Posted by Perry Willett at 09:13 AM. Permalink | Comments (11)

Preview of the new Collection Builder tool

June 09, 2008

Over the past year we've been developing a new collection building tool to be used in conjunction with the MBooks "page-turning" application already available. This tool will allow users to create their own collections of MBooks items and view public collections created by others. Users will also be able to do full text searching across all items within a collection.

We're still working out some bugs and interface issues but hope to release soon. Check back in July!

MBooks preview

Posted by Kat Hagedorn at 11:36 AM. Permalink | Comments (5)

Page numbers and URLs in MBooks

June 06, 2008

We get questions from MBooks users (most recently from dfulmer in the comments to this post) about how to link to pages, what the URL parameters such as "num" and "seq" mean, and other questions about links and page numbers.

There are a couple of issues. The first is about URLs. The most stable and persistent URL is the one that we include in the Mirlyn record, and also at the top of the pageturner with other descriptive metadata. It's called a "handle" and is a robust persistent identifier managed by CNRI (more on handles at http://www.handle.net/). They look like this:


and this is the URL that we encourage people to use and save. However, since they all start with http://hdl.handle.net/2027, people don't recognize them as belonging to the University of Michigan. Users are much more familiar with URLs that include the umich.edu domain. Nevertheless, since these handles are persistent and robust ("2027" is registered with CNRI as belonging to us) these are the URLs that should be used.

Other URLs will be less stable. The sharper-eyed among our readers will have noted that our URLs recently changed from starting with "mdp.lib.umich.edu" to "sdr.lib.umich.edu". We will redirect users any time they use a URL starting with "mdp.lib.umich.edu" but these local domain names will change over time. The same is true for the URL parameters such as "page," "num," "seq," "orient," etc. Phil Farber's response to the same post noted above provides documentation on what these mean, but be aware that these will change without warning. URL hacking will lead to tears before bedtime.

The other related issue has to do with page numbers and other metadata. People will notice that many MBooks include a table of contents with page numbers on the left-hand side, such as this one. You may also notice that some books lack this table of contents, and use "sequence" instead of page numbers. Here's an example of a book for which we do not have page numbers.

It all has to do with the metadata. At a minimum, we know the sequence in which the pages of any given book should be displayed. The pageturner buttons for forward and backward use this information to work properly, but for some books, this is all the information we have. Since the sequence of pages starts with the front cover, it's unlikely that the sequence number will match the actual page number. (And as Suzanne noted in her comments to this post, if someone has a better term than "sequence" please let us know!) Many of these books without page numbers were early efforts by Google; they are sending us newer, better versions of these books, so eventually the entire collection will include page numbers.

In many (soon most or all) cases we will have page numbers, along with additional metadata identifying title pages, tables of contents, first pages of sections, and other page features. We get these metadata from Google. We don't know how Google generates them, but it's undoubtedly an automated method. This means that they won't be perfect. When we do have metadata indicating the title page, we will open the book to the title page as a default. If we don't have any metadata about the title page, we will open to the first image (usually the front cover).

Page numbers are, to quote the kids, whack. In some books, they are out of sequence, or repeated, or misnumbered, or missing. With many journals, the library has bound together two or more issues, each with its own pagination from 1 to whatever. Therefore, the online volume could have multiple pages numbered 207, as in the example that David points to in his comments to the post mentioned above. Right now, MBooks will take you to the first instance of p. 207 if you type that into the "goto" box. We could probably do something to alert people to the fact that there are multiple pages numbered 207, and give them links to each of them.

We need to consider having persistent URLs to individual pages. People want to refer to individual pages, and we should have a method with a stable URL to allow them to do it. We could also do more to have a predictive method of referring to a page. Ed Vielmetti recently wrote some ideas about this in his blog.

We will look at this more carefully soon, once we get through the current round of development for collection builder and other new features.

--Perry Willett
--Head, Digital Library Production Service

Posted by Perry Willett at 10:38 AM. Permalink | Comments (0)

Full-Text MBook Searches from the Library Catalog

May 30, 2008

At the University of Michigan Library, in partnership with Google, we have been busily scanning our collections. This opens up lots of possibilities, including an exciting one that launches today: search the full text of a book from within Mirlyn, the library's catalog.

If a book has been scanned by Google, there is a "search in in this book" field within the library catalog record. Depending on the particular book, a search will result in full text results (if the book is in the public domain) or search-term only view (if the book is in copyright).

Here is an example of an out-of-copyright book (with full-text results available): 1931: A Glance at the Twentieth Century. The record in the catalog looks like this:

Screen Shot of Mirlyn Record with "Search in this Book" Option

Screen shot of Mirlyn record with 'search in this book' option

And here are the results of that search:

Screen Shot of MBook Search Results

Screen shot of Mirlyn record with 'search in this book' option

All books that have been scanned -- one million and counting -- are searchable. Search results are linked to the full text for those works that are in the public domain. Search results for books that are still under copyright are shown in brief view. Brief view displays a phrase or two on either side of the search term, but doesn't include full-text display of the page. In either case, the search in the book tool will help you know if you want to get the actual book off the shelf before you visit the library or make a delivery request.

Try these sample records:

Full-text: The Miscellaneous Writings of Lord Macaulay

Search only: 500 Bracelets: An Inspiring Collection of Extraordinary Designs

Posted by Ken Varnum at 09:25 AM. Permalink | Comments (8)

What to do with books in copyright

May 20, 2008

As is well known, we are digitizing all the bound volumes in our library, including books in copyright. I don't want to address the legal issues surrounding the digitization itself, but instead discuss uses of these materials after digitization. We do not show any part of in-copyright books in MBooks, leading people to wonder why we even bother to digitize them. We can answer that question in a number of ways:

1) Keyword searches. People can still conduct keyword searches within the book. We don't show snippets like Google does in copyrighted works, but we do display how many matches occur in the volume. Also, Google only shows a maximum of three snippets per volume, whereas we list all of the pages on which matches occur. We believe that this is useful information for people deciding whether they want to take the next step and retrieve the book from the library shelf.

2) Access for students with visual impairments. For many years, students with disabilities could request to have books digitized by the UM Office of Services for Students with Disabilities (OSSD). Many universities have similar services. The students could then use the digitized books with screen readers such as JAWS. This is explicitly allowed under section 121 of U.S. Copyright law: http://www.copyright.gov/title17/92chap1.html#121

We now have a system in place for students with visual impairments to use MBooks in much the same way. Once a student registers with OSSD, any time she checks out a book already digitized by Google, she will automatically receive an email with a URL. Once the student selects the link, she is asked to login. The system checks whether the student is registered with OSSD as part of this program, and whether she has checked out this particular book. If the student passes both of those tests, she will get access to the entire full-text of the book, whether it is in copyright or not, in an interface that is optimized for use with screenreaders.

Currently, this system is available to UM students with visual impairments. We are investigating the possibility of including students with learning disabilities as well.

3) Establishing copyright status. One of the conundrums of digitization is knowing the copyright status of any given volume. U.S. copyright law has changed over the years, and many books published after 1923 are actually in the public domain. For instance, under previous copyright law, books needed to have a copyright symbol and statement in order to be eligible for copyright. The terms of copyright were much shorter, but copyright holders could renew their copyright after 28 years.

There is a unit in the library, headed by Judy Ahronheim, that is investigating the copyright status of U.S. works published between 1923 and 1964. They check whether the book contains a copyright statement and symbol, and also whether the copyright was renewed (using the Stanford Copyright Renewals Database at http://collections.stanford.edu/copyrightrenewals/). Over the past year, Judy's staff have examined over 26,000 volumes, and identified almost 15,000 that are in the public domain. These books are now freely available through MBooks as a result of their work.

Thus, there are multiple reasons for us to include copyrighted works in MBooks. Even though we cannot provide access to most of them for the majority of users, we can provide important services that make our collections much more accessible.

Posted by Perry Willett at 01:16 PM. Permalink | Comments (2)

University of Chicago integrating MBooks in catalog using OAI

May 15, 2008

There is an alternative way to access MBooks other than through UM's online catalog Mirlyn. You can harvest the MBooks records directly via our OAI interface. The University of Chicago has done just that, and integrated these records into their library catalog.

Excluding serials, as they were more problematic to integrate, they provided access for users to MBooks and Google Books links from the catalog, for books they held.

As an example, go to http://lens.lib.uchicago.edu/ and search for "An historical sketch of the native states of India in subsidiary alliance with the British government". The second record provides the link to the MBooks full text of that book.

We're very interested in hearing from other libraries that are using MBooks records in their online catalogs.

Only records for MBooks available in the public domain are exposed through OAI. We have split these into sets containing public domain items according to U.S. copyright law, and public domain items worldwide. There are currently over 121,000 records available for harvesting. We anticipate having 1 million records available when the entire UM collection has been digitized by Google.

For more information, see http://www.lib.umich.edu/mdp/info/OAI.html.

Posted by Kat Hagedorn at 12:30 PM. Permalink | Comments (2)

What is MBooks?

May 14, 2008

MBooks is a partnership between the University of Michigan and Google, Inc. to digitize the entire print collection of the University Library. The digitized collection, called MBooks, is searchable in the library catalog, Mirlyn, as well as in Google Book Search. Full-text of works that are out of copyright or in the public domain are available.

For more information about MBooks, see http://www.lib.umich.edu/mdp/.

Posted by Kat Hagedorn at 09:30 AM. Permalink | Comments (0)