Searching for MBooks in Mirlyn

August 15, 2008

There are three ways to find MBooks in Mirlyn, the U-M online catalog:

1. Click on "Find Other Library Catalogs" in the upper right side of the Mirlyn screen, and you'll see the entry for MBooks/HathiTrust in the center of the page.

2. Limit searches in Advanced Search to "MBooks only" using the checkbox.

3. In Command Language, search for "wct=mdp"

You may have noticed that many MBooks records contain this reproduction note:

Electronic text and image data Ann Arbor, Mich. : University of Michigan Library 2008 Includes both image files and keyword searchable text. [Michigan Digitization Project]

These notes are going away. Searching on the phrase "michigan digitization project" in Mirlyn no longer retrieves all MBooks. Instead, use one of the methods described above.

Finally, we have gotten questions about items in Mirlyn with links to Google Book Search, but no link to MBooks. This occurs when Google digitizes a book from another source before they digitize our copy. An example of this can be found (for the moment) in this record. The link to GBS is created using Google's API. Eventually, Google will digitize the U-M copy and a link to MBooks will appear in Mirlyn.

Posted by Perry Willett at 08:14 AM. Permalink | Comments (0)

Languages in MBooks

August 01, 2008

Many people have asked us about the languages available in MBooks. In particular, they want to know if Google is providing searchable text for non-Western languages or difficult scripts. Most Western European languages have been available from the beginning of the project, but here are some examples of books in languages that Google has added in the past few years:

* Chinese: http://hdl.handle.net/2027/mdp.39015055131992
* Japanese: http://hdl.handle.net/2027/mdp.39015067188378
* Hebrew: http://hdl.handle.net/2027/mdp.39015019327512
* German/Fraktur: http://hdl.handle.net/2027/mdp.39015070866887
* Russian: http://hdl.handle.net/2027/mdp.39015028011768
* Czech: http://hdl.handle.net/2027/mdp.39015026722820
* Polish: http://hdl.handle.net/2027/mdp.39015055374857
* Greek: http://hdl.handle.net/2027/mdp.39015047659472

The process used to convert from page images to text is called Optical Character Recognition, or OCR. (You can view the OCR text of any of the pages by switching to "text" under "view page as" on the left-hand menu in the pageturner.) Without good OCR, there's no way to search the books. Google is all about search, and they're working to improve the OCR they produce. However, the multitude of languages, scripts, and fonts in this collection poses a serious problem for OCR, and it's likely that Google won't be able to OCR all languages as they encounter them. In addition, the quality of the page itself is critical to good OCR. In many older books, particularly those published between 1850-1950, the paper has deteriorated and discolored, resulting in lower quality OCR.

I can read German, so I know that the OCR for the Fraktur script in the above example isn't perfect. However, given that there isn't much OCR software that can handle Fraktur, it's not bad. I don't read any of the other languages, so I can't make any judgment about the accuracy of the OCR in the rest of the list.

I think that this is an area that Google will continue to improve. You will be able to find examples of books in these languages with very poor OCR. Google is reprocessing texts and will send us new and improved versions, so we will get better OCR as the project progresses.

One of the complexities of this work is assessing the quality of OCR in languages that you don't read. I don't read Italian or Spanish, but they use the same alphabet and Latin roots as other western European languages, so I'm able to at least verify the words without knowing the exact meaning. Chinese, Japanese, Korean, Hebrew, Russian and Greek present many more problems for me. For instance, the text in most of the books in Chinese that I've seen runs from top to bottom (including the example in Chinese above), but the OCR goes left to right. Is that right? Are all the characters there, in the correct order?

The Greek title in the list shows another complexity with OCR: the pages alternate between Latin and Greek, but the text has Greek characters throughout. It's difficult for most OCR software to handle multiple languages in the same book.

We don't have a lot of experience in dealing with non-Western languages in the Digital Library Production Service department, and we'll be reaching out to experts--in the library, in the university, in our consortium--to help us answer questions.

Hebrew and other languages that read right-to-left present special problems for us. In looking at the example in Hebrew in the above list, it looks like the glyphs have been converted correctly, but we're using a right-justified margin rather than left-justified. Here's a sample page image:

MBooks Language page image

And here's the OCR:

MBooks Language text example
We'd welcome hearing from users about these issues.

Posted by Perry Willett at 10:26 AM. Permalink | Comments (0)