In the last several years, the Library of Congress Web archiving program has grown to include large sites that publish news–over more than a year we learned they present serious challenges. After thinking through the use cases for archived online news sites, we realized that completeness of harvest was paramount. As we developed our understanding of deficiencies in the completeness of these kinds of sites we began to test use of RSS feeds to build customized seed lists for shallow crawls as the primary way these sites are crawled. Over time we discovered that while completeness of harvest was greatly improved, we had a new problem with the ability to browse to all harvested content. This article is a case study describing these iterative experiences that are a work in progress.
Gina M. Jones and Michael Neubert are digital specialists in the Digital Collections Management & Services Division at the Library of Congress. Gina focuses on crawl engineer issues while Michael works on collection development.
Jones, Gina M. and Neubert, Michael
"Using RSS to Improve Web Harvest Results for News Web Sites,"
Journal of Western Archives: Vol. 8
, Article 3.
Available at: http://digitalcommons.usu.edu/westernarchives/vol8/iss2/3