Date of Award:


Document Type:


Degree Name:

Master of Science (MS)


Mathematics and Statistics

Committee Chair(s)

John Stevens


John Stevens


Richard Cutler


Yan Sun


Today we know that there are many genetically driven diseases and health conditions. These problems often manifest only when a set of genes are either active or inactive. Recent technology allows us to measure the activity level of genes in cells, which we call gene expression. It is of great interest to society to be able to statistically compare the gene expression of a large number of genes between two or more groups. For example, we may want to compare the gene expression of a group of cancer patients with a group of non-cancer patients to better understand the genetic causes of that particular cancer. Understanding these genetic causes could potentially lead to improved treatment options.

Initially, gene expression was tested on a per gene level for statistical difference. In more recent years, it has been determined that grouping genes together by biological processes into gene sets and comparing groups at the gene set level probably makes more sense biologically. A number of gene set test methods have since been developed. It is critically important that we know if these gene set test methods are accurate.

In this research, we compare the accuracy of a group of popular gene set test methods across a range of biologically realistic scenarios. In order to measure accuracy, we need to know whether each gene set is differentially expressed or not. Since this is not possible in real gene expression data, we use simulated data. We develop a simulation framework that generates gene expression data that is representative of actual gene expression data and use it to test each gene set method over a range of biologically relevant scenarios. We then compare the power and false discovery rate of each method across these scenarios.