Date of Award:


Document Type:


Degree Name:

Master of Science (MS)


Civil and Environmental Engineering

Committee Chair(s)

Mac McKee


Mac McKee


Alfonso Torres-Rua


Bushra Zaman


The identification of data not representative of the target subject for outdoor (in-situ) environmental sensors (bad data) is a topic that has been explored in the past. Many tools (such as data filters and computer models) have succeeded in providing an end user with properly identified incorrect data over 95% of the time. However, with the continuous increase in the use of automated data collection, a simple indication of the bad data may no longer provide the end user with enough information to reduce the amount of time that must be spent for manual quality control. The purpose of this research was to devise and test a data classification technique capable of determining when and why water quality data are incorrect in an environment that experiences seasonal and daily fluctuations. This should reduce or eliminate the need for manual quality control (QC) in a large-volume data system where the range of good data is wide and changes often. The objectives this project sought to achieve were; training a learning machine that could identify local maximum and minimum values as well as dulled signals, and forming a multi-class classifier that accurately placed sensor temperature data into three categories; good, bad (because of exposure of the temperature probe to ambient air temperature), and bad (because the sensor has become buried in sediment). This involved the development of a model using a Multi-Class Relevance Vector Machine (MCRVM), and identification of its parameters that would provide at least 90% removal of false negatives for Classes 2 and 3 (the bad data) using only 100 data points from each class for purposes of training the learning machine. These objectives were met using the following methods: (1) QC completion on water temperature sensors manually, (2) an iterative process that involved the selection of inputs for the model and then the optimization of these values based on the RVMs performance, and (3) evaluation of the best performing machines testing a small group of data and then a full year.