May 12, 2003

Digitizing books

Interesting article in today's NY Times about robotic book scanners. Unfortunately, the story -- which focuses on Stanford's digitization programs, not all of which rely on robots -- doesn't reveal how much the new scanners cost, nor how accurate they might be.

The article also makes reference to the problems posed by recent extensions of copyright protection, but without spelling out quite how great the difficulties can be. As things now stand, there is a huge quantity of ephemeral material -- magazine articles, advertisements, etc -- that is anywhere from 30 to 80 years old, commercially obsolete, out of print, and yet potentially still not in the public domain. Collectors' publications make liberal use of this material without bothering to investigate copyright, but libraries and other institutions are already on notice and can't assume anything. One thing I would love to see taken up was a suggestion by Lawrence Lessig, that if copyright protection is to be permitted for such a span of years, it should at least have to be renewed periodically -- precisely to insure that material of no further commercial value should promptly enter the public domain.

Posted by David on May 12, 2003 9:13 PM

Comments

I read it in the Herald Tribune yesterday. There was a line about the machines not being cost effective for projects under 5.5 million pages -- and I think the implication was that Stanford wanted to scan its entire 8 million volume collection, so they'll need a couple.

I wondered about accuracy, too -- especially with foreign language inclusions things get very messy with a scanned page, unless they're just doing IMAGES of pages. Those are pretty useless.

I tried scanning a bunch of stuff a few years ago and gave up -- old typefaces were packed too tightly for the fairly high-quality scanner Emory had to read. And I had to redo all foreign words and phrases manually.

The people who produced the CD rom of Migne's Patrologia Latina went to Bermuda (I seem to remember) where one can hire clerk/typists who have 7 years of good high school Latin but still needed the work. I thought it was a positive legacy of colonialism, myself.

Posted by: Michael Tinkler on May 13, 2003 3:42 AM

I remember attending the informational seminars for the PL project. It was explained then that manual entry was the only practical method -- not surprising considering the fuzzy printing and speckly paper of all copies of Migne I've ever used.

I do believe OCR (optical character recognition) software has improved tremendously since then (it's been over ten years, after all!), as has the availability of cheap computing power.
There was a blog post back when the Iraqi government turned over the supposed documentation of its weapons programs. Alas, I did not bookmark it, but the gist was that the Iraqis were mere pikers when it came to overwhelming adversaries with piles of paper: American lawyers handling big corporate cases have apparently become accustomed to burying the other side with literally millions of pages of documents during discovery -- and accustomed to handling such a flood promptly by means of staggeringly fast scanners which can reduce even a truckload of documents to machine-searchable digital form in a matter of a few days.

Posted by: David on May 13, 2003 9:39 AM
Post a comment




  Remember Me?


(For bold text to display correctly, please use <strong>, not <b>)




Google