Rebecca Dunkle, librarian at University of Michigan (UM), and Ben Bunnell from Google, spoke about UM’s experience working with Google, as they begin what will be roughly a 6-year project to digitize 7 million volumes at UM. (Abigail Potter, a recent grad of the Information School at Michigan, now working at NPR, who worked on the Google Project while still at UM was also on hand to answer questions).
The general outlines of the project are familiar to most of us, having been presented previously: Google, as if by magic, since the technology is proprietary and they can’t tell us about it, is non-destructively digitizing the entire bound print collection at UM (and also portions of 4 other research libraries; New York Public Library, Stanford, Harvard, and Oxford) . The scanning is resulting in page images and OCR files, all up to agreed upon digital preservation standards that have been established by the library community. Michigan will receive its own copies of all the files created, and will be able to host them on their own servers and build them into new digital library services.
Dunkle made it clear that Google and UM are partners in this project – Google is not forcing the library to do anything it doesn’t want to do. She also pointed out that even though there are unresolved issues, such as the full impact that this fully dual digital/print collection will have on UM staff, the advantages of getting this huge corpus of digital texts are enormous.
Bunnell showed screenshots of the Google Print interface for public domain books, where users can page through the entire book online, and in-copyright books, where they can only see 3 (un-printable) snippets, but also get a list of how many times their search terms occur in the whole book; e.g. we are only showing you three, but your words occur in this book 57 times. Both the public domain and in-copyright views allow user to find the book in libraries or buy it. Bunnell also showed the interface for books submitted directly by publishers, which allow users to access considerably more than the snippets.
The questions were the best part of the program, since attendees brought up lots of pertinent points, such as – the interface for the books submitted directly from publishers does NOT include the “find in library” link (wonder why?); UM & Google’s approach to simply do everything is going to result in the scanning of a lot of bound journals, many of which certainly exist digitally already; although Bunnell assures us (as other Google representatives also have) that GooglePrint is NOT in competition with other, existing, robust, digitizing programs at many libraries and cultural institutions, surely funding for local digitizing projects is going to diminish as a result of Google’s massive efforts; and no, they won’t let us buy their technology. Also mentioned was recent OCLC collection analysis work on the Google 5 (reported in Sept. D-Lib http://www.dlib.org/dlib/september05/lavoie/09lavoie.html) that shows that about 60% of the books to be digitized in the project are only held by ONE of the Google 5, only 20% are held by two, and only 3% are held by all five. The shockingly high number is that 80% of the books in the Google 5 libraries are still in copyright, so even though the full text will be digitized, only the snippets will be available. Although it is probably more productive if we stop thinking of the visible part of in-copyright books as snippets, and start thinking of them as indexing – a point brought up by danah boyd in her keynote, and echoed by Rebecca Dunkle – danah said that she couldn’t wait for Google to finish digitizing so that more of the volumes lying around her house would be indexed; Dunkle related her experience handing over books retrieved from offsite storage to users who take one look at them and say “not what I expected”; she expects the snippets and keyword counts to reduce the number of times this scenario plays out, and feels this is just one of several outcomes that make the whole project worthwhile.