Class
Article
College
College of Science
Department
Mathematics and Statistics Department
Faculty Mentor
D. Richard Cutler
Presentation Type
Poster Presentation
Abstract
Random Forests are a widely used predictive technique in the modern data analyst’s toolkit. As with other machine learning methods, Random Forests have hyper-parameters that should be tuned for getting the best predictive accuracy and for interpretation. Variable importance measures give users valuable insights regarding which features are most informative for prediction. The subject of my research is the commonly used permutation importance algorithm with Random Forests. Key results of my research are: 1. When predictive features are highly correlated, importance values can be misleading. 2. The best choice of the Random Forests hyper-parameter mtry for importances may be quite different from the best mtry for prediction, especially when features are highly correlated. When correlated features are byproducts of each other, then using larger values of mtry gives superior importance values. 3. The square root of importance values is a better measure than the raw values.4. A collection of importances, accuracy, and association measures is more helpful than a single tuning measure. I implemented plots and measures associated with the results above in a package for the R programming language to assist users of Random Forests. Ultimately, it helps analysts tune Random Forests based on variable importance information as well as predictive accuracy.
Location
Logan, UT
Start Date
4-12-2023 12:30 PM
End Date
4-12-2023 1:30 PM
Included in
Tuning Random Forests for Interpretability
Logan, UT
Random Forests are a widely used predictive technique in the modern data analyst’s toolkit. As with other machine learning methods, Random Forests have hyper-parameters that should be tuned for getting the best predictive accuracy and for interpretation. Variable importance measures give users valuable insights regarding which features are most informative for prediction. The subject of my research is the commonly used permutation importance algorithm with Random Forests. Key results of my research are: 1. When predictive features are highly correlated, importance values can be misleading. 2. The best choice of the Random Forests hyper-parameter mtry for importances may be quite different from the best mtry for prediction, especially when features are highly correlated. When correlated features are byproducts of each other, then using larger values of mtry gives superior importance values. 3. The square root of importance values is a better measure than the raw values.4. A collection of importances, accuracy, and association measures is more helpful than a single tuning measure. I implemented plots and measures associated with the results above in a package for the R programming language to assist users of Random Forests. Ultimately, it helps analysts tune Random Forests based on variable importance information as well as predictive accuracy.