Date of Award:
5-2012
Document Type:
Dissertation
Degree Name:
Doctor of Philosophy (PhD)
Department:
Mathematics and Statistics
Committee Chair(s)
Christopher Corcoran
Committee
Christopher Corcoran
Committee
Adele Cutler
Committee
Kady Schneiter
Committee
John Stevens
Committee
Ronald Munger
Abstract
The mapping of the human genome and the completion of the Human HapMap project over the past decade have significantly altered how research is conducted with respect to the genetic epidemiology of human disease. Study designs and analytic approaches have evolved rapidly from investigations involving relatively few targeted candidate genes to hypothesis-free genome-wide association studies, where thousands – and now even millions – of single molecular mutations are simultaneously analyzed to identify regions of the genome that may influence disease. As laboratory techniques continue to improve and costs decrease, the volume of genetic data will inexorably rise, and robust tools for data management, statistical analysis, and computation will likewise need to keep pace.
Multiple hypothesis testing is the core problem in analyzing data from a genome-wide association study (GWAS). A conventional GWAS, focused on genetic risk factors leading to disease incidence, samples some number of disease and non-diseased subjects, genotypes these subjects for a common set of genetic mutations, and then carries out an individual hypothesis test of the association between each marker and disease status. Correction for multiple testing in GWAS typically relies upon the Bonferroni multiple testing procedure. With ever-growing panels of markers (the standard panel currently employs one million markers), this approach engenders numerous problems. First, it is overly conservative, both because of the sheer number of tests as well as the Bonferroni ideal that all tests are mutually independent. The growing density of marker panels results in marker loci that are more physically proximate, yielding hypothesis tests that have some dependence structure. Second, the commonly used corrected significance level on the order of 10−8 provides an extreme critical region for which the relative error of asymptotic approximations is large. Third, while approximations can be avoided by using a permutation distribution, such an approach is computationally challenging and has not been widely implemented or used. This is particularly critical in the context of alternative multiple correction procedures that solve the dependence problem, for which permutation distributions are hypothetically available but in practice are seldom used, if ever. Fourth, the distribution of test statistics across the various multiple testing approaches depends on additional features of the data, most prominently on what is referred to as the minor allele frequency (MAF), or the proportion of genetic loci for a given marker within the sampling population that carry the least frequent marker variant.
This research project has led to the development and implementation of a parallel processing algorithm which allows exceptionally rapid computation of the permutation distribution for multiple testing procedures that correct for dependence between tests. This eliminates the need for large-sample approximations, which have been found in prior studies to have poor operating characteristics under some common circumstances. This parallel processing approach relies upon existing hardware and software commonly available in desktop personal computers, allowing for efficient and cost effective computational tools to the research community. In addition, we have leveraged these efficient permutation tools in order to implement MAF-corrected exact tests, to eliminate bias for multiple testing procedures that arise in particular when the MAF is small. We have further extended these tools to other analytic problems in large-scaled genetic association settings, such as tests for gene-environment interactions.
Checksum
b6507b26d10e9f334e5ca8145fc1b130
Recommended Citation
Welbourn, William L. Jr., "Robust Computational Tools for Multiple Testing with Genetic Association Studies" (2012). All Graduate Theses and Dissertations, Spring 1920 to Summer 2023. 1172.
https://digitalcommons.usu.edu/etd/1172
Included in
Copyright for this work is retained by the student. If you have any questions regarding the inclusion of this work in the Digital Commons, please email us at .
Comments
This work made publicly available electronically on April 12, 2012.