In the last several years, the Library of Congress Web archiving program has grown to include large sites that publish news–over more than a year we learned they present serious challenges. After thinking through the use cases for archived online news sites, we realized that completeness of harvest was paramount. As we developed our understanding of deficiencies in the completeness of these kinds of sites we began to test use of RSS feeds to build customized seed lists for shallow crawls as the primary way these sites are crawled. Over time we discovered that while completeness of harvest was greatly improved, we had a new problem with the ability to browse to all harvested content. This article is a case study describing these iterative experiences that are a work in progress.

Author Biography

Gina M. Jones and Michael Neubert are digital specialists in the Digital Collections Management & Services Division at the Library of Congress. Gina focuses on crawl engineer issues while Michael works on collection development.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.