Web Archiving

A Web page that is here today, may not be around tomorrow. A new standard, ISO 28500:2009, Information and Documentation – WARC File Format, will ensure that the vast and often valuable information posted on the Web is not lost when a page changes or disappears.

ISO 28500 provides a file format known as WARC (Web ARChive), which offers a convention for concatenating multiple data objects into one long file. The format can be used to build applications for harvesting, managing, accessing, and exchanging content.

For a long time, keeping track of the staggering number of Web sites and pages posed a difficult challenge for digital curators and archivists, and resulted in countless lost data. With WARC, ISO 28500 takes Internet archiving to the next level by enabling the effective management, structure, and storage of billions of resources collected from the Web and elsewhere.

Standardization on the WARC format offers a guarantee of durability, and will help Web archiving become part of the mainstream activities of heritage institutions, by for example, fostering the development of new tools and ensuring interoperability between collections.

The WARC format is an extension of the ARC file format, which has been used by the Internet Archive since 1996, and by numerous heritage institutions to store “Web crawls” – which represent extracts of entire Web pages and their links.

The motivation to extend the ARC arose from the discussions and experiences of the organizations within the International Internet Preservation Consortium (IIPC) – whose core mission is to acquire, preserve, and make accessible knowledge and information from the Internet for future generations. IIPC members were finding it increasingly difficult to store and manage the growing volume of information coming from the Internet.

According to the ISO technical committee that developed ISO 28500, several applications are already WARC compliant, such as the Heritrix crawler for harvesting, the WARC tools for data management and exchange, the Wayback Machine, and NutchWAX. And, I noticed that the Library of Congress is already using the WARC format to harvest web sites.

You can order ISO 28500:2009 at the ANSI Standards Store at this web page.