Date of Award:

5-2011

Document Type:

Dissertation

Degree Name:

Doctor of Philosophy (PhD)

Department:

Computer Science

Committee Chair(s)

Xiaojun Qi

Committee

Xiaojun Qi

Committee

Changhui Yan

Committee

Minghui Jiang

Committee

Adele Cutler

Committee

Vicki Allan

Abstract

Nowadays, machine learning techniques are widely used for extracting knowledge from data in a large number of bioinformatics problems. It turns out that in many of such problems, data observations can be naturally represented by discrete structures such as graphs, networks, trees, or sequences. For example, a protein can be seen as a cloud of interconnected atoms lying on a 3-dimensional space. The focus of this dissertation is on the development and application of machine learning techniques to bioinformatics problems wherein the data can be represented by graphs. In particular, we focus our attention on proteins, which are essential elements in the life process. The study of their underlying structure and function is one of the most important subjects in bioinformatics. As proteins can be naturally represented by graphs, we consider the use of kernel functions that can directly deal with data observations in the form of graphs. Kernel functions are the basic building block for a powerful family of machine learning algorithms called kernel methods.

Concretely, we propose a novel approach for predicting the function of proteins. We model proteins as graphs, and we predict function using support vector machines and graph kernels. We evaluate our approach under two types of function prediction, the discrimination of proteins as enzymes or not, and the recognition of DNA binding proteins. In both cases, the resulting performance is higher than existing methods.

In addition, given the establishment of ontologies as a popular topic in biomedical research, we propose two novel semantic similarity measures between pairs of proteins. First, we introduce a novel semantic similarity method between pairs of gene ontology terms. Second, we propose an instance of the shortest path graph kernel for calculating the semantic similarity between proteins. This latter approach, when compared with state-of-the-art methods, yields an improved performance.

Checksum

28b98aef641f0a7a10501bb2e07ef547

Comments

This work made publicly available electronically on April 12, 2012.

Share

COinS