Click here to receive your FREE subscription to Campus Technology
1/1/2008
Processing Book Files
Image compression isn't a small issue in mass digitization projects, says Mark McKinney, VP of business development for LuraTech, a company that produces compression software. Besides consuming storage, the files developed out of scanned books need to be delivered across the web with no perceivable delays. LuraTech powers the work done by the Open Content Alliance, which applies the JPEG 2000 format to compression; JPEG 2000 is a powerful long-term archival format that reduces a large color file to about a hundredth of its original size, says McKinney. In that effort, he says, workers run the process from digitization stations called "Scribes" that take the picture from a page, color-correct it, and then "OCR" it (apply optical character recognition) so it becomes searchable. Once the operator has captured all the individual pages, metadata is added to the book through a user interface. But the metadata-title, author, copyright, description, etc.-isn't necessarily added via human effort.
According to Brian Tingle, a technical leader for the Digital Special Collections (part of UC's CDL), much of the metadata is already cataloged as part of the online public access catalog (OPAC), known in the pre-digital era as the card catalog. Tingle's team works with a metadata object format, a standard for encoding and transmission of digital objects. "Those objects get turned into those formats and that's how we ingest them into our system," he explains. It's a different level of metadata that enables the linking together of objects, such as the pages of a book.
The automation of data capture is certainly something in which a former- NASA AI expert like Google's Clancy would excel. "If you look on our book reference page, you'll find related works identified; books with some relationship to the book you're looking at," he points out. "Or you'll find something we call ‘Popular Passages,' where we've extracted passages that are seminal or popular and mentioned in a number of different books. We use that as a way to link some of these books together." Achieving those connections, he says, is a programming job. "It may not be perfect, but this is 100 percent how we've done it. We don't have people picking out related books; we use lots of different signals. We just don't talk about which signals we use."
Beyond Automation
Not surprisingly, when the wizard behind the curtain is Google, the same kind of secrecy applies to search. Where UC's Tingle is highly forthcoming about the search product his team has developed at CDL-eXtensible Text Framework (XTF), which is based on Lucene, an Apache open source search engine-Google's Clancy prefers to focus on search outcomes. "We're all familiar with how search works on the web," says Clancy. "You type in a keyword phrase and suddenly it seems to find just the document you want; people create link structures that relate two things together. Well, as soon as you do that [with a book], you're giving us more information about that book. Eventually, you can imagine people linking to books from their web pages and other things. A book should be like the web: People should link directly into the book when it's relevant to them."
Today, it's clear to almost every campus executive that moving an institution from the traditional purchasing model to a strategic eProcurement program can greatly increase staff efficiency and save the institution money. Because eProcurement automates so many purchasing processes, it eliminates reams of paperwork and allows procurement staff to refocus their efforts on cutting costs and improving strategic partnerships.
Mary Jo Gorney-Moreno didn't start out in IT. She joined San Jose State University (CA) in 1981 as an assistant professor in the school of nursing. But somewhere along the way, she realized her energy was focused on academic technology, and how it could help a variety of learners gain knowledge.