Web Archiving in Libraries and Archives (and at Penn State)

Screen shot of Penn State's partner page on Archive-It

Penn State’s Partner Page on Archive-It

This post was written by Ben Goldman, Digital Records Archivist.

In honor of Preservation Week, I wanted to provide an introductory post (what I hope will be the first of several) on web archiving. An enormous amount of cultural heritage is now published online, which presents both opportunities and challenges for archivists charged with collecting, preserving, and making accessible documentation for future generations. An entire blog post (or more) could be devoted to describing these opportunities and challenges in greater depth, but the New Yorker has already done so with an elegance I could not hope to surpass. I highly recommend spending some time with “The Cobweb,” by Jill Lepore, on newyorker.com (I even sent this article to my parents as a way of explaining what I do).

The Internet Archive is nearly 20 years old, but it’s only been in the last ten years or so that libraries and archives have started to devote infrastructure and resources to this effort. In 2006, the Internet Archive created Archive-It, a subscription service that allows institutions like libraries and archives to select and curate their own web archive collections, which are then preserved by the Internet Archive. Since 2006, over 400 institutions have preserved 10 billion unique URLs and created nearly 3,000 web archive collections using this service. A recent conversation on the CIC Archivists listserv found that 13 of the 15 CIC Libraries now have active web archiving programs. Collectively, in the CIC, we have archived over 5,000 unique seed URLs (and likely many thousands more websites).

In 2012, Penn State’s Special Collections Library began its subscription to Archive-It. Our initial goal was to advance the mission of the University Archives by capturing websites in the psu.edu domain. We also attempted to retrospectively capture websites that documented the fast-moving events related to the Jerry Sandusky scandal. Our web archiving mission is slowly evolving to include the capture of websites that relate to existing manuscript collections, and also to new collecting areas. To date we have captured over 1.5 terabytes of data. A glimpse into our collecting efforts can be found on Penn State’s Archive-It partner page. I’ll explore some of our web archiving work in more details in future posts.

The Archive-It team likes to say “the web is a mess,” which is a good mantra for describing the work of selecting, crawling, reviewing, and monitoring websites. Websites come and go. The average age of a website is said to be around 100 days. Websites change. Penn State, for example, is in the midst of an extended redesign period that is seeing many of its nearly 2,000 psu.edu subdomains receive an overhaul that sometimes includes a migration to new platforms. For a notable example, compare the Arboretum’s website from two years ago to the Arboretum’s website today. Websites are complex. In their simplest iteration, websites were little more than HTML with text and maybe low-resolution imagery (for an example of how simple websites were at the start of the World Wide Web, check out the oldest U.S. website, which is available online thanks to web archivists at Stanford University). But now, websites are often dynamic, database driven, and/or reliant on code that pulls from disparate parts of the web. And rich web content is increasingly embedded in a variety of so-called social media platforms that present challenges to the Archive-It tools that crawl these sites. All of which is to say that the process of curating websites for archiving is one of constant checking, reporting, reviewing, and testing. And even then, given the scale of website capture, it’s often difficult to verify that all the most critical content has been archived.

Web archiving also presents some unique rights challenges that the whole community has attempted to navigate carefully. Archive-It is currently building tools that allows for embargo periods on access to web archives (to accommodate some of the professional thinking on websites and copyright). As a policy, we have also decided to respect robots.txt, which are seen as the clearest reflection of a web administrator’s wishes around crawling of websites (whether by the Archive-It service or search engines like Google).

Despite the scale of web content that has been archived by libraries and archives to date, this work is still in its infancy. Most notably, access remains rudimentary, and the best practices for preservation (which entails the aggregation of website objects into the WARC file format) do not necessarily support access. As well, the kind of assessment tools that might allow researchers to comb through what are essentially large datasets are lacking. But there are many exciting areas of innovation happening now. Archive-It recently announced two grants, one to improve the archiving of time-based media on the web, and one to provide better access to social media content. And Archive-It also recently announced new tools that would support data mining of web archives. There are also activities underway, such as Perma.cc or Memento, that seek to use web archiving tools and strategies in support of mitigating link rot, a problem that recently received a lot of press as a result of a study of links to web resources in scholarly articles from a period covering 1997 to 2012 (TL;DR: 1 in 5 links rotted).

With all the good work happening around web archiving, it’s also important to reflect on the sustainability of this preservation model. The primary benefit of Archive-It as a service is that it allows libraries and archives to focus their resources on traditional curatorial activities, while offloading the technical burden to a central organization with dedicated expertise and infrastructure. But with the California Digital Library’s recent decision to discontinue its Web Archiving Services (WAS), some have questioned the wisdom of consolidating all that expertise and infrastructure under one organization. The Library of Alexandria analogy is perhaps imprecise, but not by much (and it’s certainly not our only sustainability challenge in this domain), but it does suggest that archivists and librarians will need to take a more active and vocal role in shaping these services in order to effectively support our long-term stewardship mission(s).

Comments are closed.

Skip to toolbar