Выбрать главу

“At the moment it looks as if [disbinding] is the cheapest way to do things,” Lesk told me. He is even bolder in a paper entitled “Substituting Images for Books: The Economics for Libraries,” where he argues for the outright hashing of a better copy of a book over one that is worn out or very brittle, simply because it’s less expensive to destroy the book in better condition. “It is substantially cheaper8 to scan a book if the paper is strong and can be fed through a stack feeder, rather than requiring manual handling of each page,” he writes; thus “it may turn out that a small library located in a rural and cold mountain location with few readers and clean air has a copy in much better shape, and one that can be scanned more economically.”

Of course, that small mountain library, having done such a fine job of safekeeping all those years, may have “less motivation to scan a book which is not yet deteriorating”—hence the need, in Lesk’s central-plannerly view, for a nationwide cooperative authority that will order that library to guillotine its copy and feed it to the scanner for the greater good.

Lesk’s candor is impressive: he acknowledges that the resolution of today’s scanned offerings may be crude by tomorrow’s standards, or even by comparison with today’s microfilm. “I would like to see, as soon as possible, a lot of scanning, so that momentum builds for doing this job,” he told me. “It is likely that to build support for a conversion of this sort, what matters much more is that a lot of stuff gets done, than that the stuff that gets done is of the ultimate highest quality.” Better to have to scan some things twice, in Lesk’s view, than not to scan at all — assuming, of course, that there is still a physical copy left to destroy when it comes time for the retake. Lesk also recognizes that in a cooperative project involving millions of volumes, there will be errors and omissions. “The odds are that there will be things lost,” he said. Some projects, such as JSTOR, have the money to do a careful preliminary check to be sure that no pages or issues are missing, but most places, he says, “won’t be able to afford the JSTOR quality standards.”

I was interested to hear Lesk offer JSTOR as a paragon of quality. JSTOR is the most successful of the large-scale digitization projects; it has big money and big names behind it (including lifelong library automator Richard De Gennaro, former chief librarian at Harvard and, before that, at the New York Public Library); it can be marvelously helpful in finding things that you didn’t know existed, or that you do know exist but don’t have handy. Its intent, however, is not supplemental but substitutionaclass="underline" back issues of scholarly journals are, in the words of its creator, William G. Bowen, ex-president of the Andrew W. Mellon Foundation and of Princeton, “avaricious in [their] consumption9 of stack space”; JSTOR will allow libraries “to save valuable shelf space on the campus by moving the back issues off campus or, in some instances, by discarding the paper issues altogether.” Taking this cue, Barbara Sagraves, head of preservation at the Dartmouth library, wrote in an online discussion group in 1997 that questions about weeding the collection had “bubbled up” at her library. “The development of JSTOR and the promise of electronic archiving creates the possibility of withdrawing paper copies and relying solely on the electronic version,” she writes. Although she wants to make clear that Dartmouth is “in no way considering that option,” she says that construction planning has made librarians there “step back and question retention decisions in light of new means of information delivery.” In a survey conducted by JSTOR10 in 1999, thirteen percent of the respondents had already “discarded outright” bound volumes of which electronic copies exist on JSTOR, and another twenty-five percent had plans to do so; twenty-four percent have stopped binding incoming issues.

Lesk likes JSTOR for that very reason. He wants to divert capital funds from book-stack square footage into database maintenance, to create a habit of dependence on the electronic copy over the paper original, to increase the market share of digital archives. And he is right that JSTOR’s staff takes pains in the preparation of what they reproduce: they make sure that a given run of back issues is as complete as possible before they scan and dump it.

What about quality, though? The printable, black-and-white page-pictures that JSTOR stores are good — their resolution is six hundred dots per inch, about the same as what you would get using a photocopier. (What you see on-screen is less good than that, because the images are compressed for faster loading, and the computer screen imposes its own limitations.) But the searchable text that JSTOR derives from these page-pictures is, by normal nineteenth- and twentieth-century publishing standards, intolerably corrupt. OCR (optical character recognition) software, which has the job of transmuting a digital picture of a page into a searchable series of letters, has made astonishing improvements, but it can’t yet equal even a middling typesetter, especially on old fonts. Thus JSTOR’s OCR accuracy rate is held (with editorial intervention) to 99.95 percent. This may sound exacting, but the percentage measures errors per hundred letters, not per hundred words or pages. A full-text electronic version of a typical JSTOR article will introduce into the clickstream a newly minted typo every two thousand characters — that is, one every page or two. For instance, I searched JSTOR for “modem life”11 and got hits going back to the April 1895 issue of Mind: the character-recognition software has difficulty distinguishing between “rn” and “m” and hasn’t yet been told that there were no modems in 1895.

It’s easy to fix individual flukes like this, once they are pointed out, but the unpredictable OCR misreads of characters in proper names, in dates, in page numbers, in statistics, and in foreign quotations are much costlier to control. That’s why JSTOR allows you to see only the image of the page, and prevents you from scrolling through its searchable text: if scholars were free to read the naked OCR output, they might, after a few days, be disturbed by the frequency and strangeness of its mistakes, especially in the smaller type of footnotes and bibliographies, and they might no longer be willing to put their trust in the scholarly integrity of the database.

Half joking, I pointed out to Michael Lesk that if a great many libraries follow his advice by scanning everything in sight and clearing their shelves once they do, the used-book market will collapse. Lesk replied evenly, “If you’ve ever tried taking a pile of used books to a local bookseller, you know that for practical purposes, most used books are already worthless. Certainly old scientific journals are worse than worthless. You will have to pay somebody to cart them away, in general.” (Online used-book sites, such as abebooks.com, Bibliofind, and Alibris, where millions of dollars worth of ex-library books and journals change hands, might contest that statement.) I asked Lesk whether he owned many books. He said he had several thousand of them — most of them printed on “crummy paper.”