Date of Award:


Document Type:


Degree Name:

Master of Science (MS)


Mathematics and Statistics

Committee Chair(s)

D. Richard Cutler


D. Richard Cutler


Kevin Moon


John R. Stevens


A major focus in statistics is building and improving computational algorithms that can use data to predict a response. Two fundamental camps of research arise from such a goal. The first camp is researching ways to get more accurate predictions. Many sophisticated methods, collectively known as machine learning methods, have been developed for this very purpose. One such method that is widely used across industry and many other areas of investigation is called Random Forests.

The second camp of research is that of improving the interpretability of machine learning methods. This is worthy of attention when analysts desire to optimize current systems or processes so that superior response values may be obtained. It also matters if analysts wish to characterize how variables relate to one another and which associations between the response and predictor variables are strongest. Variable importance measures are one powerful tool that has been developed to help meet these objectives. The focus of my research is the widely used permutation variable importance algorithm that is part of Random Forests.

Statisticians have found that models can be tuned to improve predictive accuracy. However, comparatively little research is available regarding how these same adjustments impact the interpretability of the models, and any meaningful information that is available on this subject is not widely considered by analysts when fitting models. My thesis explores how a model ought to be adjusted to obtain superior variable importance information. My research has led me to build a software package that provides model developers with better tools and graphics for evaluating variable associations and for comparing importance values and prediction accuracies across different models. Ultimately, I hope to provide analysts with insights and tools to build better Random Forests for predictive accuracy and for assessing and interpretating the importance of variables for prediction.