Date of Award:
8-2017
Document Type:
Dissertation
Degree Name:
Doctor of Philosophy (PhD)
Department:
Computer Science
Committee Chair(s)
Stephen Clyde
Committee
Stephen Clyde
Committee
Xiaojun Qi
Committee
Curtis Dyreson
Committee
Nicholas Flann
Committee
Bedri Cetiner
Abstract
Both government agencies and private companies rely on the collection of personal data on an ever-increasing scale. Out of necessity, person data include Personal Identifiable Information (PII), which is information that could potentially identify a specific individual. Many of these data would be integrated, so data analyst, policy makers or corporate officers can use it to make decisions or get a conclusion. Integrating data in a heterogeneous database environment create a need to estimate the accuracy of that data; without a valid assessment of accuracy there is a risk of coming with incorrect conclusions or making bad decision based on inaccurate data. Confidentially issues and the inaccessibility of the real individuals raises the question of how to measure the accuracy of person data, and specifically PII. So, the problem becomes one of estimating data accuracy using real-world facts, expert opinions, or aggregate knowledge about the represented population.
Estimating the quality of PII creates a corresponding need to model and formalize PII for both the real-world and electronic data, in a way that supports rigorous reasoning relative to real-world facts, rules from domain experts, and rules about expected data patterns. This research presents an extended first-order logic language (FOL), called PDFOL (Person Data First-order Logic), that can express these kinds of facts and rules, as well as relevant person attributes and inter-person relations. The salient features of PDFOL are: 1) namely temporal predicated based on time intervals, 2) aggregate functions, and 3) tuple-set comparison operators. I adapt and extend the traditional aggregate functions to allow any arbitrary number free variables in function statement, we add groupings feature to aggregate functions and we define new aggregate function. These features allow PDFOL to model person-centric databases, enabling formal and efficient reason about their accuracy and help to provide methods for reasoning about the accuracy of PII.
Also, I propose a method that describe how data analysts can use PDFOL statements to formalize and develop formal accuracy metrics specific to a person-centric database, especially if it is an integrated person-centric database, which in turn can then be used to assess the accuracy of a database. Where data analysts apply these metrics to person-centric data to compute the quality-assessment measurements. After that, they statistically compare these measurements with the real-world measurements, with the hypothesis that they should be very similar, if the person-centric data is an accurate and complete representations of the real-world population.
I evaluated the performance of the developed accuracy metrics and their predicative capability and we prove that the developed accuracy metrics are applicable and easily can be used to estimate the accuracy of person-centric data. The evaluation presents how the proposed methodology can estimate the accuracy of the person-centric data and give an accuracy value is almost equal the real-accuracy with some deviations.
Checksum
97b2b1bd464d48e0b0e95dac33ebe230
Recommended Citation
Shatnawi, Amani "Mohammad Jum'h" Amin, "Estimating Accuracy of Personal Identifiable Information in Integrated Data Systems" (2017). All Graduate Theses and Dissertations, Spring 1920 to Summer 2023. 6103.
https://digitalcommons.usu.edu/etd/6103
Included in
Copyright for this work is retained by the student. If you have any questions regarding the inclusion of this work in the Digital Commons, please email us at .