Date of Award:

12-2010

Document Type:

Thesis

Degree Name:

Master of Science (MS)

Department:

Computer Science

Committee Chair(s)

Stephen W. Clyde

Committee

Stephen W. Clyde

Committee

Vicki Allan

Committee

Stephen Allan

Abstract

This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules.

The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution provides robustness against missing/incomplete record data via the selection of a best-fit support vector machine. Empirical evidence shows that the combination of these two novel solutions eliminates two critical long-standing problems in deduplication, providing accurate and robust results in a critical area of rule-based deduplication.

Checksum

0ee48fb872a4892622ce127c96273580

Comments

This work made publicly available electronically on November 29, 2010.

Share

COinS