Date of Award:

5-2013

Document Type:

Dissertation

Degree Name:

Doctor of Philosophy (PhD)

Department:

Mathematics and Statistics

Committee Chair(s)

Adele Cutler

Committee

Adele Cutler

Committee

Donald Cooley

Committee

Christopher Corcoran

Committee

Daniel Coster

Committee

Jürgen Symanzik

Abstract

Statistical classification is widely used in many areas where there is a need to make a data-driven decision, or to classify complicated cases or objects. For instance: disease diagnostics (is a patient sick or healthy, based on the blood test results?); weather forecasting (will there be a storm tomorrow, based on today's atmospheric pressure, air temperature, and wind velocity?); speech recognition (what was said over the phone, based on the caller's voice level and articulation); spam detection (can the unsolicited commercial e-mails be identified by their content?); and so on.

Classification trees help to answer such questions by constructing a tree-like structure, where the features of the objects are analyzed consequently one at a time in a step-by-step fashion, e.g., if a patient is coughing – measure his/her temperature, if the temperature is above 100.4°F (38.0°C) – listen to the lungs, if there are crackles or rattling noises – suspect pneumonia. The classification results become more reliable if the decision is made by aggregating many trees created from randomly sampled data into a Random Forest, similarly to consulting several doctors with different training backgrounds before stating a subtle diagnosis.

In this work the tree classification algorithm was enhanced with the ability to consider the objects' features in pairs, similarly to considering a patient's body mass index (weight together with height) before diagnosing obesity; or considering a customer's debt-to-income ratio (income together with debt) before approving him/her for a loan. The trees created with the new method are called oblique, because they separate the objects with oblique lines when looking at the pairwise features plots.

Since the new method is able to focus on pairs of features, it can be used to determine which of the pairs are more useful for classification (chosen more often than others), how the features relate and interact with each other.

This work contains theoretical argumentation for the new method, as well as the detailed description of the classification algorithm, which was implemented in a computer software package (download links are provided). The properties and performance of oblique trees were investigated using numerical simulations and real data examples. Comparison with other popular classification methods was also performed.

Checksum

0549aac85e01712d9f9b808e8c0f5f18

Share

COinS