Date of Award:
Doctor of Philosophy (PhD)
Without a valid assessment of accuracy there is a risk of data users coming to incorrect conclusions or making bad decision based on inaccurate data. This dissertation proposes a theoretical method for developing data-accuracy metrics specific for any given person-centric integrated system and how a data analyst can use these metrics to estimate the overall accuracy of person-centric data.
Estimating the accuracy of Personal Identifiable Information (PII) creates a corresponding need to model and formalize PII for both the real-world and electronic data, in a way that supports rigorous reasoning relative to real-world facts, expert opinions, and aggregate knowledge. This research provides such a foundation by introducing a temporal first-order logic language (FOL), called Person Data First-order Logic (PDFOL). With its syntax and semantics formalized, PDFOL provides a mechanism for expressing data- accuracy metrics, computing measurements using these metrics on person-centric databases, and comparing those measurements with expected values from real-world populations. Specifically, it enables data analysts to model person attributes and inter-person relations from real-world population or database representations of such, as well as real-world facts, expert opinions, and aggregate knowledge. PDFOL builds on existing first-order logics with the addition of temporal predicated based on time intervals, aggregate functions, and tuple-set comparison operators. It adapts and extends the traditional aggregate functions in three ways: a) allowing any arbitrary number free variables in function statement, b) adding groupings, and c) defining new aggregate function. These features allow PDFOL to model person-centric databases, enabling formal and efficient reason about their accuracy.
This dissertation also explains how data analysts can use PDFOL statements to formalize and develop formal accuracy metrics specific to a person-centric database, especially if it is an integrated person- centric database, which in turn can then be used to assess the accuracy of a database. Data analysts apply these metrics to person-centric data to compute the quality-assessment measurements, YD. After that, they use statistical methods to compare these measurements with the real-world measurements, YR. Compare YD and YR with the hypothesis that they should be very similar, if the person-centric data is an accurate and complete representations of the real-world population.
Finally, I show that estimated accuracy using metrics based on PDFOL can be good predictors of database accuracy. Specifically, I evaluated the performance of selected accuracy metrics by applying them to a person-centric database, mutating the database in various ways to degrade its accuracy, and the re-apply the metrics to see if they reflect the expected degradation.
This research will help data analyst to develop an accuracy metrics specific to their person-centric data. In addition, PDFOL can provide a foundation for future methods for reasoning about other quality dimensions of PII.
Shatnawi, Amani "Mohammad Jum'h" Amin, "Estimating Accuracy of Personal Identifiable Information in Integrated Data Systems" (2017). All Graduate Theses and Dissertations. 6103.