Richard Poynder
Richard Poynder - Freelance Journalist
The State of Open Access Creative commons Institutional Repositories Green OA Gold OA Self-archiving Copyright Basement Interviews OA Interviews OA Essays Open Access in Serbia Open Access in India Scholarly Publishing ScienceDirect Patents OA in Latin America Open Access in the Humanities Communicating Research Open Knowledge Society UK Select Committee Enquiry Global Research Council World Bank Open Notebook Science Elsevier OA in Germany OA in South Africa OA in France Open Content Mining Open Society Institute Archivangelism Harold Varmus Peter Suber OA in Poland OA Embargoes Big Deal Finch Report Jeffrey Beall ALPSP OA Mandates PLOS Peer Review Springer BioMed Central Free Software Digital Preservation The Commons COPE Whom Would You Back? Annette Holtkamp Doing the Numbers Peter Murray-Rust
UK Web Archive
This website is preserved
by the UK Web Archive

Elephants and dung-trucks

By RICHARD POYNDER

Nicholson Baker’s controversial book Double Fold shone an unwelcome spotlight on preservation activities in libraries. In the last issue of Information Today, however, I suggested that rather than shoot the messenger, librarians should recruit him to their cause.

What is surely certain is that preservation is set to become ever more controversial. Baker’s primary concerns relate to the use of microfilm for reformatting brittle books and newspapers, but as librarians begin to grapple with digital preservation they will become the target for much greater and more critical public scrutiny.

Why? Because preserving digital materials is far more difficult than dealing with brittle paper. Moreover, no one yet knows how to do it effectively. As Jeff Rothenberg, a senior research scientist at the RAND Corporation, pointed out in a 1999 report for the Council on Library and Information Resources (CLIR), unless the matter is addressed urgently “our increasingly digital heritage is in grave risk of being lost.”

The fear is that we could be precipitated into a new Dark Ages — where significant data, records and publications simply disappear. While libraries are not responsible for all this material, if a large part of the “published record” were to be lost the finger of blame would undoubtedly point at them, and the mistakes and errors catalogued in Double Fold would seem minor by comparison.

Far more fragile

The nub of the matter is that digital materials are far more fragile than brittle paper. As Abby Smith, director of programs at CLIR, puts it: “Brittle paper was a problem of a fragile medium, but you can rescue the information by reformatting it on to another medium. Digital information, by contrast, has so many more dependencies that paper-based materials do not. You have hardware that is always becoming obsolete; and you have software that is also becoming obsolete.”

That is, unlike paper or microfilm — where the meaning is transparently inscribed on the surface of the medium — digital documents are opaque bit streams only understandable to humans when interpreted by a machine. The hardware and software needed to do this interpretation, however, is constantly superseded. There have, for instance, been more than 200 digital storage formats alone deployed since the 1960s, with none lasting more than 10 years.

As Baker expresses it: “If you put some books and papers in a locked storage closet and come back fifteen years later, the documents will be readable without the typesetting systems and printing presses and binding machines that produced them; if you lock up computer media for the same interval (some once-standard eight-inch floppy disks from the mid-eighties, say), the documents they hold will be extremely difficult to reconstitute.”

And it is not just text: images, video and multimedia files are all at risk. The extent of this threat was graphically demonstrated to the British Broadcasting Corporation when, fifteen years ago, it decided to celebrate the 900th anniversary of the 1086 Domesday book by creating a huge digital archive to depict life in the 1980s. Costing £2.5 million, the project involved around a million people in Britain.

Once complete the results were stored on 12in videodiscs designed to be read by the Acorn BBC computer. A decade and half later, however, the discs were obsolete and unreadable, and the Acorn computer a museum piece. While the data was eventually recovered, it required a time-consuming digital archaeology effort to do so.

Moreover had this work been put off indefinitely, at some point the BBC data would have become irrecoverable — since without constant nurturing, digital files eventually become as unreadable as the Linear B and Cuneiform scripts were to the modern age before archaeologists deciphered them. As Baker put it in Double Fold: “We will certainly get more adept at long-term data storage, but even so, a collection of live book-facsimiles on a computer network is like a family of elephants at a zoo: if the zoo runs out of money for hay and bananas, for vets and dung-trucks, the elephants will sicken and die.”

The good news is that governments and libraries are beginning to act. In February, for instance, the Librarian of Congress received approval for the National Digital Information Infrastructure and Preservation Program (NDIIPP) — a project for which Congress appropriated $100 million in funding.

The aim of the NDIIPP, says Guy Lamolinara, confidential assistant to the associate librarian for strategic initiatives at the Library of Congress, is to “develop a national strategy to collect, archive and preserve the burgeoning amounts of digital content, especially materials that are created only in digital formats, for current and future generations.”

Similar programmes have been introduced in other countries too, reports Neil Beagrie, secretary of the UK-based Digital Preservation Coalition. “There is the JISC Digital Preservation Strategy in the UK for instance; and in Australia the National Library is doing some valuable work with its Preserving Access to Digital Information [PADI] initiative.”

The bad news is that there is still no known long-term solution for preserving digital resources, although the quantity of material is growing day by day. Moreover, despite initiatives like the NDIIPP, there remains a serious funding shortage.

The fundamental question, says Helen Shenton, head of collection care at the British Library, is: “How do we future proof this material so that it is still accessible for future readers in 100 to 300 years time.”

Many issues still to be solved

For libraries there are two primary issues, says Johan Steenbakkers, deputy director of the National Library of the Netherlands (Koninklijke Bibliotheek, or KB). Firstly, maintaining the digital objects (publications); secondly guaranteeing permanent access to the information within those digital objects.

“These issues are at a different state of development,” he explains. “The first issue — which one could describe as building the necessary infrastructure, or ‘digital stack”, to enable a library to effectively manage the digital objects and their formats — can already be fully solved using existing technology: libraries simply have to find the necessary resources and then act appropriately. Ensuring permanent access to the information, however, requires new techniques and approaches that are still under development. And since the technology is continuously developing this is not a one-time effort, but has to be ongoing.”

Today a number of approaches are being used to try and ensure permanent access to digital materials. At the most basic level a digital bit stream can be regularly “refreshed” by transferring it to newer media before the old media deteriorates beyond the point where the information can be retrieved.

In addition, more sophisticated techniques are needed. This can include migration (“porting”) in which files are updated, or sometimes entirely rewritten; emulation, where older hardware is mimicked in order to allow old software and files to run on new machines without having to be re-written; and encapsulation, where electronic files are wrapped in a digital envelope that describes how the files are stored, and how to re-create the software, hardware or operating systems to decode the contents. However, none of these somewhat complicated approaches has yet been successfully implemented.

Some, therefore, are exploring alternative solutions. Raymond Lorie, a research fellow at IBM’s Almaden Research Centre, for instance, has proposed the development of a “universal virtual computer”. This would require that every time a digital file was saved (whatever hardware or software was being used) a separate file was simultaneously saved in a format understandable to the universal computer. To stave off obsolescence, the specifications of the universal computer would be compacted into around 10 to 20 pages of text, and distributed as widely as possible. However, the universal computer too remains at concept stage.

“There are many technological issues still to be solved,” says Janet Gertz, director for preservation, Columbia University Libraries (but speaking in a personal capacity), “and just as many fiscal issues, since maintaining and preserving digital files is much more expensive that maintaining and preserving paper copies.”

In the meantime, adds Gertz: “Information already has been lost and will continue to be lost.”

Huge catastrophes

That digital preservation is a huge challenge for libraries is incontrovertible. Yet today the KB is the only library in the world with an operational system focused on the deposit and long-term preservation of digital publications. The library has been collecting electronic material since 1994, and at the end of last year introduced a full-scale electronic deposit process using IBM’s newly developed Digital Information Archiving System (DIAS) as a “dedicated archiving environment” into which electronic materials can be transferred. There is, however, a huge amount of work still to be done before the library can feel confident about the future, says Steenbakkers.

And as we head breakneck into the digital future, and more and more library resources are born digital (with no analogue equivalent), the lack of effective preservation techniques will become an increasingly serious problem. Not only is there an explosion of data on the web, but a flood of new eBooks and e-journals — and increasingly the latter have no hard-copy version. A report commissioned last year by the UK’s Joint Committee on Voluntary Deposit (JCVD) estimated that by 2005 the number of “pure” e-serials (with no print equivalent) alone will have grown from 3,220 to 7,032 (equating to 192,672 e-serial issues annually)

At the same time, libraries are getting involved in the creation of institutional archives — an activity enthusiastically promoted by organisations like the Scholarly Publishing and Academic Resources Coalition (SPARC). The challenge here, points out Michael Day, of UK-based UKOLN, in a recent JISC-funded report is that “institutions that set up repositories may not always be aware of their responsibility to ensure the long-term preservation of content. Even when they are, they may not have the organisational infrastructure or technical knowledge to do this successfully.”

What is undeniable is that an increasing proportion of a library’s role will consist of the provision and maintenance of digital data. Yet too few have adequately grasped the preservation nettle. What libraries must appreciate, says Steenbakkers, is that “the effective preservation of born-digital resources is not a matter of choice, but necessity.”

“I anticipate huge catastrophes as we realise that we’ve collectively encoded all sorts of information in all sorts of obsolete formats,” says Jacob Nadal, acting head of preservation at the E. Lingle Craig Preservation Laboratory, Indiana University Libraries.

Plenty of potential material here, perhaps, for a public exposé even more damning than Double Fold?

Preserving the published record

But this is not about the preservation of library holdings alone. As Double Fold demonstrated, the public also expects libraries to preserve the published record.

Indeed national governments specifically task certain libraries to do this. All works under copyright protection published in the US, for instance, are subject to mandatory deposit — requirements similar to those in most other countries.

While historically these deposit requirements were confined to print publications, many countries are extending the law to cover electronic materials too. In the UK, for instance, a Private Members Bill was passed unopposed in March that will make the deposit of digital materials compulsory.

Leaving aside the absence of durable preservation solutions, does the traditional deposit model migrate seamlessly to a digital world? What, for instance, constitutes the published record in the age of the web?

“One thing we are very, very heavily tied up with thinking about at the moment is web archiving,” says Shenton. “After all, the remit of the British Library is to preserve and care for the national published archive, and the web is a form of quasi-publishing.”

Some, such as the National Library of Australia, are taking a selective approach. Every few months the library archives “significant” national websites, most notably the 2000 Sydney Olympics. The Royal Library in Sweden, by contrast, has adopted a more comprehensive approach. Since 1996, its Kulturarw3 project has been regularly archiving everything with a Swedish web address.

But can web preservation be treated as an isolated national activity? How can a seamless, linked record of our times be salami-sliced by geographical borders for archival purposes?

Ironically, the only organisation focused on archiving the entire web is not a library, but a not-for-profit organisation called the Internet Archive. Founded by Brewster Kahle in 1996, this is attempting to capture and maintain a permanent archive of the continuously changing web. Currently it contains over 100 terabytes of data.

Certainly some believe that libraries should preserve far more material than historically. After all, argued web guru Stewart Brand in a 1999 Library Journal article, in a digital world physical space is no longer a constraint. “There is more room to store stuff than there is stuff to store,” he wrote, adding that therefore: “we need never again throw anything away.”

Librarians, however, prefer a more traditional approach. “At the moment there is a debate as to scope,” agrees Shenton, but adds: “No society has ever collected absolutely everything. So what we are trying to do is to get collection development, selection, and retention policies in place.”

To avoid another Double Fold, however, librarians might be advised to consult with the public before finalising their digital-age selection policies.

New responsibilities

And what about the preservation of electronic publications such as eBooks and e-journals? Traditional legal deposit assumes publishers file new publications with the deposit library and walk away, leaving libraries to care for them.

But as Baker points out, the digital environment introduces new responsibilities. Rather than simply warehousing digital material, someone is going to have to actively manage it. After all, what benefit would it be to society if deposited electronic publications were unreadable within ten years?

Some publishers clearly understand this. In March, Kluwer Academic Publishers (KAP), announced an agreement with the KB to ensure long-term archiving of their journals. This will mean supplying KB with digital copies of all Kluwer journals and books made available on its web platform, Kluwer Online. Both new publications and digitised backfiles will be included in the deposit arrangements, which currently includes 235,000 articles from 670 journals and more than 600 e-Books.

Elsevier Science entered into a similar agreement with KB last September. As Elsevier’s Karen Hunter commented at the time: “Journals have been called ‘the minutes of science’. As we move toward journals being available only in electronic form and being held centrally on publishers’ computers, the public has the right to be assured that, should a publisher go out of business, these files will not be lost.”

But who, I asked Leo Voogt, director of global library relations at Elsevier, will be responsible for the vital preservation work? “The National Library will be making the technical preservation decisions,” he replied. “That is really both their expertise and their mission.”

What these agreements signal is that it will not be publishers who take on the responsibility of nurturing Baker’s elephants, or emptying their dung-trucks, but librarians. Commenting on the agreements with KAP and Elsevier, Steenbakkers says: “The parties have agreed that each will bear their own costs with regard to the depositing and the preservation of the electronic publications.”

For the KB these costs are daunting. Already the library has spent four million Euros acquiring its system. “More significant,” says Steenbakkers, “are the yearly costs of running it, and financing the ongoing research and development.”

Ideally, publishers would contribute to these additional costs; the reality is that this is unlikely. As Brand pointed out in his 1999 article: “While contemporary information has economic value and pays its way, there is no business case for archives, so the creators or original collectors of digital information rarely have the incentive — or skills, or continuity — to preserve their material. It’s a task for long-lived non-profit organisations such as libraries, universities, and government agencies, which may or may not have the mandate and funding to do the job.”

Nor is digital preservation a matter only for deposit libraries. As more and more libraries become involved in the creation of institutional archives, so they too will need to engage in the nurturing and dung-shifting activities demanded by digital preservation.

Too little, too late?

But given the financial pressures libraries already labour under, how will they fund these new activities? The NDIIPP’s $100 million suddenly begins to look like being too little, too late. Certainly libraries are going to need a far greater injection of funding if they are to avert the threatened digital Dark Ages.

They will also need appropriate funding, says Steenbakkers. “To date governments have provided very little money for the library and archive community to get to grips with this issue, either in Europe or the US. In addition, agencies granting research subsidies are simply playing safe, and providing only small amounts of money for consultants to write more and more reports on the problem, rather than allocating the substantial funds that are essential if we are to develop real-life practical solutions.”

To add to the urgency, financial pressures are increasingly causing libraries to abandon print in favour of digital products. “As prices continue to sky-rocket,” explains Gertz, “it becomes more and more difficult financially to justify the cost of keeping a paper copy in addition to a digital copy when almost no one is using the paper copy.”

This will inevitably exacerbate the preservation challenge, since it will accelerate the growth of digital-only publications, leading to more and more of the published record being produced on media for which there is currently no long-term preservation solution. A development best characterised as alarming.

Librarians should not forget that it is on their heads that public ire will fall if things go awry. As Smith commented to Technology Review last October: “People count on libraries to archive human creativity. It’s important for people to know, though, that libraries are at a loss about how to solve this problem.”

To obtain the money they need to find a solution, libraries are going to have to engage more with the public — as well as with governments and other funding bodies — and convincing them of the severity of the threat we face.

Who better to lead the campaign than Nicholson Baker? Someone better pick up the phone and call him!

This article has been reprinted in its entirety from the September 2003 issue of Information Today with the permission of Information Today, Inc., 143 Old Marlton Pike, Medford, NJ 08055. 609/654-6266, http://www.infotoday.com.

Home
About Richard Poynder
Blog: Open and Shut?
arrow The State of Open Access
The Basement Interviews
Open Access Interviews
Essays on Open Access
Archive
Contact
 
Blogs, News & More:
OA Archivangelism
The Occasional Pamphlet
Peter Murray-Rust's Blog
Digital Koans
Cameron Neylon's Blog
John Wilbank's Blog
The Scholarly Kitchen
BioMed Central Blog
PLoS Blogs
The Open Access Directory
SPARC Europe
IP Watchdog
OPEN...
 
Creative Commons License
Open & Shut? by Richard Poynder is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.0 UK: England & Wales License.
Permissions beyond the scope of this license are available here

spacer
 

Website design by Clear Image Graphics.