Using RSS to Improve Web Harvest Results for News Web Sites

Gina M. Jones, Library of CongressFollow
Michael Neubert, Library of CongressFollow

Abstract

In the last several years, the Library of Congress Web archiving program has grown to include large sites that publish news–over more than a year we learned they present serious challenges. After thinking through the use cases for archived online news sites, we realized that completeness of harvest was paramount. As we developed our understanding of deficiencies in the completeness of these kinds of sites we began to test use of RSS feeds to build customized seed lists for shallow crawls as the primary way these sites are crawled. Over time we discovered that while completeness of harvest was greatly improved, we had a new problem with the ability to browse to all harvested content. This article is a case study describing these iterative experiences that are a work in progress.

Author Biography

Gina M. Jones and Michael Neubert are digital specialists in the Digital Collections Management & Services Division at the Library of Congress. Gina focuses on crawl engineer issues while Michael works on collection development.

Recommended Citation

Jones, Gina M. and Neubert, Michael (2017) "Using RSS to Improve Web Harvest Results for News Web Sites," Journal of Western Archives: Vol. 8: Iss. 2, Article 3.
DOI: https://doi.org/10.26077/0bb9-5e53
Available at: https://digitalcommons.usu.edu/westernarchives/vol8/iss2/3

Download

Included in

Archival Science Commons

COinS

DOI

https://doi.org/10.26077/0bb9-5e53

Journal of Western Archives

Using RSS to Improve Web Harvest Results for News Web Sites

Authors

Abstract

Author Biography

Recommended Citation

Included in

Share

DOI

Special Issues: