March 4, 2009
Digitization gets a boost via reCAPTCHA
Luis von Ahn . . . helped develop the first captchas in 2000. Apparently, he had a revelation a few years later while sitting on a plane. . .Full article here, which also includes a nice bit on the digitization project at the University of Toronto, overseen by our friend Jonathan Bengtson:He realized that he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.
With the help of a MacArthur "genius" grant, von Ahn set out to make amends. Now a growing number of websites, from e-commerce (Ticketmaster) to social networking (Facebook) to blogging (Wordpress), have implemented the precocious professor's new tool, dubbed recaptcha. If you've visited those sites, your squiggly-letter-reading ability has been harnessed for a massive project that aims to scan and make freely available every out-of-copyright book in the world, by deciphering words from old texts that have stumped scanning software.
U of T is currently adding about 1,500 books a week -- and at that rate there's no need to be choosy about which ones to scan. "It's a real beast to feed, actually," says Jonathan Bengtson, the librarian who oversees the university's role. Entire subject areas are scanned by sorting for pre-1923 works (in accordance with US copyright laws), eliminating duplicates, and taking everything that's left. Scholars from around the world can also request books for ten cents a page, and typically see them online in less than twenty-four hours.Hat tip to Mirabilis.The most popular Toronto contribution, Juszel reports, is a 1475 edition of St. Augustine's De civitate Dei, downloaded a baffling 75,911 times (at press time). . .
For the newer books, OCR is about 90 percent accurate. But that success rate drops to as low as 60 percent for older texts, which often contain fonts that are blurry and less uniform. These troublesome scans are sent on to the reCAPTCHA servers at Carnegie Mellon University in Pittsburgh.
Posted by David on March 4, 2009 12:50 PM