Date of Award:

8-2017

Document Type:

Dissertation

Degree Name:

Doctor of Philosophy (PhD)

Department:

Computer Science

Committee Chair(s)

Stephen Clyde

Committee

Stephen Clyde

Committee

Xiaojun Qi

Committee

Curtis Dyreson

Committee

Nicholas Flann

Committee

Bedri Cetiner

Abstract

Both government agencies and private companies rely on the collection of personal data on an ever-increasing scale. Out of necessity, person data include Personal Identifiable Information (PII), which is information that could potentially identify a specific individual. Many of these data would be integrated, so data analyst, policy makers or corporate officers can use it to make decisions or get a conclusion. Integrating data in a heterogeneous database environment create a need to estimate the accuracy of that data; without a valid assessment of accuracy there is a risk of coming with incorrect conclusions or making bad decision based on inaccurate data. Confidentially issues and the inaccessibility of the real individuals raises the question of how to measure the accuracy of person data, and specifically PII. So, the problem becomes one of estimating data accuracy using real-world facts, expert opinions, or aggregate knowledge about the represented population.

Estimating the quality of PII creates a corresponding need to model and formalize PII for both the real-world and electronic data, in a way that supports rigorous reasoning relative to real-world facts, rules from domain experts, and rules about expected data patterns. This research presents an extended first-order logic language (FOL), called PDFOL (Person Data First-order Logic), that can express these kinds of facts and rules, as well as relevant person attributes and inter-person relations. The salient features of PDFOL are: 1) namely temporal predicated based on time intervals, 2) aggregate functions, and 3) tuple-set comparison operators. I adapt and extend the traditional aggregate functions to allow any arbitrary number free variables in function statement, we add groupings feature to aggregate functions and we define new aggregate function. These features allow PDFOL to model person-centric databases, enabling formal and efficient reason about their accuracy and help to provide methods for reasoning about the accuracy of PII.

Also, I propose a method that describe how data analysts can use PDFOL statements to formalize and develop formal accuracy metrics specific to a person-centric database, especially if it is an integrated person-centric database, which in turn can then be used to assess the accuracy of a database. Where data analysts apply these metrics to person-centric data to compute the quality-assessment measurements. After that, they statistically compare these measurements with the real-world measurements, with the hypothesis that they should be very similar, if the person-centric data is an accurate and complete representations of the real-world population.

I evaluated the performance of the developed accuracy metrics and their predicative capability and we prove that the developed accuracy metrics are applicable and easily can be used to estimate the accuracy of person-centric data. The evaluation presents how the proposed methodology can estimate the accuracy of the person-centric data and give an accuracy value is almost equal the real-accuracy with some deviations.

Checksum

97b2b1bd464d48e0b0e95dac33ebe230

Share

COinS