Date of Award:

5-2012

Document Type:

Dissertation

Degree Name:

Doctor of Philosophy (PhD)

Department:

Mathematics and Statistics

Committee Chair(s)

Christopher Corcoran

Committee

Christopher Corcoran

Committee

Adele Cutler

Committee

Kady Schneiter

Committee

John Stevens

Committee

Ronald Munger

Abstract

The mapping of the human genome and the completion of the Human HapMap project over the past decade have significantly altered how research is conducted with respect to the genetic epidemiology of human disease. Study designs and analytic approaches have evolved rapidly from investigations involving relatively few targeted candidate genes to hypothesis-free genome-wide association studies, where thousands – and now even millions – of single molecular mutations are simultaneously analyzed to identify regions of the genome that may influence disease. As laboratory techniques continue to improve and costs decrease, the volume of genetic data will inexorably rise, and robust tools for data management, statistical analysis, and computation will likewise need to keep pace.

Multiple hypothesis testing is the core problem in analyzing data from a genome-wide association study (GWAS). A conventional GWAS, focused on genetic risk factors leading to disease incidence, samples some number of disease and non-diseased subjects, genotypes these subjects for a common set of genetic mutations, and then carries out an individual hypothesis test of the association between each marker and disease status. Correction for multiple testing in GWAS typically relies upon the Bonferroni multiple testing procedure. With ever-growing panels of markers (the standard panel currently employs one million markers), this approach engenders numerous problems. First, it is overly conservative, both because of the sheer number of tests as well as the Bonferroni ideal that all tests are mutually independent. The growing density of marker panels results in marker loci that are more physically proximate, yielding hypothesis tests that have some dependence structure. Second, the commonly used corrected significance level on the order of 10−8 provides an extreme critical region for which the relative error of asymptotic approximations is large. Third, while approximations can be avoided by using a permutation distribution, such an approach is computationally challenging and has not been widely implemented or used. This is particularly critical in the context of alternative multiple correction procedures that solve the dependence problem, for which permutation distributions are hypothetically available but in practice are seldom used, if ever. Fourth, the distribution of test statistics across the various multiple testing approaches depends on additional features of the data, most prominently on what is referred to as the minor allele frequency (MAF), or the proportion of genetic loci for a given marker within the sampling population that carry the least frequent marker variant.

This research project has led to the development and implementation of a parallel processing algorithm which allows exceptionally rapid computation of the permutation distribution for multiple testing procedures that correct for dependence between tests. This eliminates the need for large-sample approximations, which have been found in prior studies to have poor operating characteristics under some common circumstances. This parallel processing approach relies upon existing hardware and software commonly available in desktop personal computers, allowing for efficient and cost effective computational tools to the research community. In addition, we have leveraged these efficient permutation tools in order to implement MAF-corrected exact tests, to eliminate bias for multiple testing procedures that arise in particular when the MAF is small. We have further extended these tools to other analytic problems in large-scaled genetic association settings, such as tests for gene-environment interactions.

Checksum

b6507b26d10e9f334e5ca8145fc1b130

Comments

This work made publicly available electronically on April 12, 2012.

Share

COinS