Date of Award:

12-2010

Document Type:

Thesis

Degree Name:

Master of Science (MS)

Department:

Computer Science

Committee Chair(s)

Stephen W. Clyde

Committee

Stephen W. Clyde

Committee

Vicki Allan

Committee

Stephen Allan

Abstract

This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules.

The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution provides robustness against missing/incomplete record data via the selection of a best-fit support vector machine. Empirical evidence shows that the combination of these two novel solutions eliminates two critical long-standing problems in deduplication, providing accurate and robust results in a critical area of rule-based deduplication.

Checksum

0ee48fb872a4892622ce127c96273580

Comments

This work made publicly available electronically on November 29, 2010.

Recommended Citation

Dinerstein, Jared, "Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution" (2010). All Graduate Theses and Dissertations, Spring 1920 to Summer 2023. 787.
https://digitalcommons.usu.edu/etd/787

Download

Included in

Computer Sciences Commons

COinS

Copyright for this work is retained by the student. If you have any questions regarding the inclusion of this work in the Digital Commons, please email us at .

DOI

https://doi.org/10.26076/ddfd-6aaf

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

Date of Award:

Document Type:

Degree Name:

Department:

Committee Chair(s)

Committee

Committee

Committee

Abstract

Checksum

Comments

Recommended Citation

Included in

DOI

Browse

For Authors

Scholarly Communication

Research Data

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

Author

Date of Award:

Document Type:

Degree Name:

Department:

Committee Chair(s)

Committee

Committee

Committee

Abstract

Checksum

Comments

Recommended Citation

Included in

Share

DOI

Browse

For Authors

Scholarly Communication

Research Data