Date of Award:

Summer 2017

Document Type:

Dissertation

Degree Name:

Doctor of Philosophy (PhD)

Department:

Computer Science

Advisor/Chair:

Stephen Clyde

Co-Advisor/Chair:

Xiaojun Qi

Third Advisor:

Curtis Dyreson

Abstract

Without a valid assessment of accuracy there is a risk of data users coming to incorrect conclusions or making bad decision based on inaccurate data. This dissertation proposes a theoretical method for developing data-accuracy metrics specific for any given person-centric integrated system and how a data analyst can use these metrics to estimate the overall accuracy of person-centric data.

Estimating the accuracy of Personal Identifiable Information (PII) creates a corresponding need to model and formalize PII for both the real-world and electronic data, in a way that supports rigorous reasoning relative to real-world facts, expert opinions, and aggregate knowledge. This research provides such a foundation by introducing a temporal first-order logic language (FOL), called Person Data First-order Logic (PDFOL). With its syntax and semantics formalized, PDFOL provides a mechanism for expressing data- accuracy metrics, computing measurements using these metrics on person-centric databases, and comparing those measurements with expected values from real-world populations. Specifically, it enables data analysts to model person attributes and inter-person relations from real-world population or database representations of such, as well as real-world facts, expert opinions, and aggregate knowledge. PDFOL builds on existing first-order logics with the addition of temporal predicated based on time intervals, aggregate functions, and tuple-set comparison operators. It adapts and extends the traditional aggregate functions in three ways: a) allowing any arbitrary number free variables in function statement, b) adding groupings, and c) defining new aggregate function. These features allow PDFOL to model person-centric databases, enabling formal and efficient reason about their accuracy.

This dissertation also explains how data analysts can use PDFOL statements to formalize and develop formal accuracy metrics specific to a person-centric database, especially if it is an integrated person- centric database, which in turn can then be used to assess the accuracy of a database. Data analysts apply these metrics to person-centric data to compute the quality-assessment measurements, YD. After that, they use statistical methods to compare these measurements with the real-world measurements, YR. Compare YD and YR with the hypothesis that they should be very similar, if the person-centric data is an accurate and complete representations of the real-world population.

Finally, I show that estimated accuracy using metrics based on PDFOL can be good predictors of database accuracy. Specifically, I evaluated the performance of selected accuracy metrics by applying them to a person-centric database, mutating the database in various ways to degrade its accuracy, and the re-apply the metrics to see if they reflect the expected degradation.

This research will help data analyst to develop an accuracy metrics specific to their person-centric data. In addition, PDFOL can provide a foundation for future methods for reasoning about other quality dimensions of PII.

Checksum

2757092b46df243935b0eeb142f537d8

Share

COinS