Log-data Visualization Tool for Analyzing and Improving Performance of Data De-duplication Tool in CHARM-II

Daniel Erickson, Utah State University

This work made publicly available electronically on December 9, 2011.

Abstract

A de-duplication tool used in CHARM-II, called the CHARM Matcher, produces log files that record why it decides two records are or are not a match. This data, if properly analyzed, could help CHARM developers improve the Matcher over time by tuning its configuration. However, the log data is complex and recorded chronologically in the log files instead of in a way that would aid analysis. Further, visually studying the raw log data is a laborious and difficult task. This report describes a tool that parses and organizes the raw log data, and then produces graphical reports that summarize key performance indicators. The performance indicators give CHARM developers exactly what they need to know to improve the Matcher’s specificity and sensitivity for any particular data source. A significant contribution of this report and prerequisite to creating a meaningful tool was the investigation into possible performance indicators and determination which would be best suited for the existing CHARM matcher. In anticipation of further evolution of the CHARM matcher, the proposed tool is designed to be extensible, so additional indicators and reports could be added later, as the need arises.