Effect of missing data on performance of learning algorithms for hydrologic predictions: Implications to an imputation technique

Document Type


Journal/Book Title/Conference

Water Resources Research



Publication Date





A common practice in preprocessing of data for use in hydrological modeling is to ignore observations with any missing variable values at any given time step, even if it is only one of the independent variables that is missing. In most cases, these rows of data are labeled incomplete and would not be used in either model building or subsequent model testing and verification. We argue that this is not necessarily an optimal approach for dealing with missing data because significant information could be lost when incomplete rows of data are discarded. Learning algorithms are affected by such problems more than physically based models because they rely heavily on data to learn the underlying input/output relationships of the systems being modeled. In this study, the extent of damage to the performance of learning algorithms due to missing data is explored in a field-scale application. To do so, we employed two well-known learning algorithms, namely artificial neural networks (ANNs) and support vector machines (SVMs) for short-term prediction of groundwater levels at a well field. Performance comparison is made by subjecting these algorithms to various levels of missing data. In addition to understanding the relative strengths of these algorithms in dealing with missing data, an approach for filling the data gaps in the form of an imputation methodology is proposed and tested against observed data. The utility of the current approach is further demonstrated by analyzing model runs obtained with and without imputed data. It is shown that as the percentage of missing data increases, the forecasting accuracy of ANNs is compromised more than that of SVMs. However, ANNs also derive the greater benefit from the use of imputed data.

This document is currently not available here.