Date of Award


Degree Type


Degree Name

Master of Science (MS)


Mathematics and Statistics


Machine learning algorithms are increasingly being used by ecologists to model and predict the distributions of individual species and entire assemblages of sites. Accurate prediction of distribution of species is an important factor in any modeling. We compared prediction accuracy of four machine learning algorithms-random forests, classification trees, support vector machines, and gradient boosting machines to a traditional method, linear discriminant models (LDM), on a large set of stream invertebrate data collected at 728 reference sites in the western United States. Classifications were constructed for individual species and for assemblages of sites clustered a priori by similarity on biological characteristics. Predictive accuracy of the classifications was evaluated by computing the percent of sites correctly classified, sensitivity, specificity, kappa, and the area under the receiver operating characteristic curve on 10-fold crossvalidated predictions from each classification method on each individual species and assemblage of sites. The predictions from each type of classification were used to estimate the Observed over Expected (O/E) index of taxa richness. Random Forests generally produced the most accurate individual species models . However, none of the machine learning algorithms showed significant improvement over LDMs for classifications of assemblages of sites and precision of the O/E index. The performance of Support Vector Machines was particularly poor for classifying individual species and assemblages of sites, and resulted in greater bias in the O/E index. We believe that the performance of models developed for species at such large spatial scales may depend more on the predictor variables available than the classification technique.

Included in

Mathematics Commons