Languages in MBooks
August 01, 2008
Many people have asked us about the languages available in MBooks. In particular, they want to know if Google is providing searchable text for non-Western languages or difficult scripts. Most Western European languages have been available from the beginning of the project, but here are some examples of books in languages that Google has added in the past few years:
* Chinese: http://hdl.handle.net/2027/mdp.39015055131992
* Japanese: http://hdl.handle.net/2027/mdp.39015067188378
* Hebrew: http://hdl.handle.net/2027/mdp.39015019327512
* German/Fraktur: http://hdl.handle.net/2027/mdp.39015070866887
* Russian: http://hdl.handle.net/2027/mdp.39015028011768
* Czech: http://hdl.handle.net/2027/mdp.39015026722820
* Polish: http://hdl.handle.net/2027/mdp.39015055374857
* Greek: http://hdl.handle.net/2027/mdp.39015047659472
The process used to convert from page images to text is called Optical Character Recognition, or OCR. (You can view the OCR text of any of the pages by switching to "text" under "view page as" on the left-hand menu in the pageturner.) Without good OCR, there's no way to search the books. Google is all about search, and they're working to improve the OCR they produce. However, the multitude of languages, scripts, and fonts in this collection poses a serious problem for OCR, and it's likely that Google won't be able to OCR all languages as they encounter them. In addition, the quality of the page itself is critical to good OCR. In many older books, particularly those published between 1850-1950, the paper has deteriorated and discolored, resulting in lower quality OCR.
I can read German, so I know that the OCR for the Fraktur script in the above example isn't perfect. However, given that there isn't much OCR software that can handle Fraktur, it's not bad. I don't read any of the other languages, so I can't make any judgment about the accuracy of the OCR in the rest of the list.
I think that this is an area that Google will continue to improve. You will be able to find examples of books in these languages with very poor OCR. Google is reprocessing texts and will send us new and improved versions, so we will get better OCR as the project progresses.
One of the complexities of this work is assessing the quality of OCR in languages that you don't read. I don't read Italian or Spanish, but they use the same alphabet and Latin roots as other western European languages, so I'm able to at least verify the words without knowing the exact meaning. Chinese, Japanese, Korean, Hebrew, Russian and Greek present many more problems for me. For instance, the text in most of the books in Chinese that I've seen runs from top to bottom (including the example in Chinese above), but the OCR goes left to right. Is that right? Are all the characters there, in the correct order?
The Greek title in the list shows another complexity with OCR: the pages alternate between Latin and Greek, but the text has Greek characters throughout. It's difficult for most OCR software to handle multiple languages in the same book.
We don't have a lot of experience in dealing with non-Western languages in the Digital Library Production Service department, and we'll be reaching out to experts--in the library, in the university, in our consortium--to help us answer questions.
Hebrew and other languages that read right-to-left present special problems for us. In looking at the example in Hebrew in the above list, it looks like the glyphs have been converted correctly, but we're using a right-justified margin rather than left-justified. Here's a sample page image:
And here's the OCR:
We'd welcome hearing from users about these issues.