Date of Award:
8-2022
Document Type:
Dissertation
Degree Name:
Doctor of Philosophy (PhD)
Department:
Mathematics and Statistics
Committee Chair(s)
Kevin R. Moon
Committee
Kevin R. Moon
Committee
Adele Cutler
Committee
Jürgen Symanzik
Committee
Matt Harris
Committee
Jacob Gunther
Abstract
Many machine learning algorithms use calculated distances or similarities between data observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels and are thus considered unsupervised. Supervised methods for measuring distance exist which incorporate data labels and thereby exaggerate separation between data points of different classes. This approach tends to distort the natural structure of the data. Instead of following similar approaches, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels into similarity measures known as random forest proximities. In this dissertation, we explore previously defined random forest proximities and demonstrate their weaknesses in popular proximity-based applications. Additionally, we develop a new proximity definition that can be used to recreate the random forest’s predictions. We call these random forest-geometry-and accuracy-Preserving proximities or RF-GAP. We show by proof and empirical demonstration can be used to perfectly reconstruct the random forest’s predictions and, as a result, we argue that RF-GAP proximities provide a truer representation of the random forest’s learning when used in proximity-based applications. We provide evidence to suggest that RF-GAP proximities improve applications including imputing missing data, detecting outliers, and visualizing the data. We also introduce a new random forest proximity-based technique that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. We show that this method does well at portraying the relationship between data variables and the data labels. We show quantitatively and qualitatively that this method surpasses other existing methods for this task.
Checksum
4763a111ca70bc594b674f0d584fd407
Recommended Citation
Rhodes, Jake S., "Geometry- and Accuracy-Preserving Random Forest Proximities with Applications" (2022). All Graduate Theses and Dissertations, Spring 1920 to Summer 2023. 8598.
https://digitalcommons.usu.edu/etd/8598
Included in
Copyright for this work is retained by the student. If you have any questions regarding the inclusion of this work in the Digital Commons, please email us at .