Interpreting SNP-in-gene associations from GWAS studies

For most SNPs identified in GWAS studies, is the underlying assumption that if it is indeed associated with a phenotype (and lets assume its associated because it affects protein function), that you don't need two knocked out copies of that gene for it to confer susceptibility risk to the individual?

In other words, is my assumption that GWAS is likely uncovering alleles that confer risk in single copy, correct?

In short, yes.

If a gwas study links a SNP to a particular phenotype then yes, it is an effect of a single copy. Bear in mind, however, that a SNP is not a knockout or even a knockdown. It can be, but it is not always the case. SNPs can produce a change in the protein sequence or in the regulation of the production of that protein. Both types of variation can result in a phenotypical change.

In any case, SNP information is always about a single allele, I imagine that you can find cases where a cell is homozygous for a particular SNP but is is not necessarily, or even often as far as I know, the case.

A population genetic interpretation of GWAS findings for human quantitative traits

Human genome-wide association studies (GWASs) are revealing the genetic architecture of anthropomorphic and biomedical traits, i.e., the frequencies and effect sizes of variants that contribute to heritable variation in a trait. To interpret these findings, we need to understand how genetic architecture is shaped by basic population genetics processes—notably, by mutation, natural selection, and genetic drift. Because many quantitative traits are subject to stabilizing selection and because genetic variation that affects one trait often affects many others, we model the genetic architecture of a focal trait that arises under stabilizing selection in a multidimensional trait space. We solve the model for the phenotypic distribution and allelic dynamics at steady state and derive robust, closed-form solutions for summary statistics of the genetic architecture. Our results provide a simple interpretation for missing heritability and why it varies among traits. They predict that the distribution of variances contributed by loci identified in GWASs is well approximated by a simple functional form that depends on a single parameter: the expected contribution to genetic variance of a strongly selected site affecting the trait. We test this prediction against the results of GWASs for height and body mass index (BMI) and find that it fits the data well, allowing us to make inferences about the degree of pleiotropy and mutational target size for these traits. Our findings help to explain why the GWAS for height explains more of the heritable variance than the similarly sized GWAS for BMI and to predict the increase in explained heritability with study sample size. Considering the demographic history of European populations, in which these GWASs were performed, we further find that most of the associations they identified likely involve mutations that arose shortly before or during the Out-of-Africa bottleneck at sites with selection coefficients around s = 10 −3 .

Promises and challenges in human genetics of psychiatric disorders

Psychiatric disorders are highly polygenic and show a continuous range of variation influenced by both environmental and genetic factors [1]. A major goal of psychiatric genetic research is to better understand the molecular mechanisms through which genetic variants act to influence liability to these traits. The identification of novel genetic variants provides a foothold into the complex genetic architecture that undergirds psychiatric traits. Model organisms provide an avenue into understanding the biological mechanisms that are impacted by genetic variation. In this review, we outline big data approaches that efficiently weave the vast amounts of convergent genomic data from other species into human genetic findings to elevate the likelihood of uncovering biologically meaningful pathways for further experimental follow-up and therapeutic discovery.

The utility of genome-wide association studies (GWAS) in psychiatry

GWAS of psychiatric traits have generated an outpouring of recent discoveries in risk variant identification and polygenic prediction. From highly heritable traits, such as schizophrenia (for which 𾄀 common loci have been reported with N =�,064 [2]) to common but less heritable conditions such as problematic alcohol use (for which 29 independent loci have been reported with N =�,563 [3]) and major depression (for which 102 common loci were detected with N =�,553 [4]), as well as for liability across psychiatric disorders (109 loci with N =�,126 [5]) progress abounds. In addition, for substance use, a recent large GWAS of tobacco smoking (N for smoking initiation =𠂑,232,091) and typical drinking (N for drinks/week =�,280) has identified over 400 loci [6]. The increased power accumulated across studies of major psychiatric disorders, arising from collaborative research, has revealed clues into novel mechanisms of susceptibility to mental illnesses and substance use disorders. These large-scale GWAS have also revealed patterns of genetic variation associated with multiple disorders as well as disorder-specific loci, e.g., CADM2 has been linked to multiple substances and common addiction mechanisms (e.g., risk-taking cognition), while the alcohol dehydrogenase genes remain alcohol-specific (e.g., [7, 8]).

Challenges and opportunities within GWAS for psychiatric genetic studies

The recent gains in psychiatric genetic studies outlined above amplify the need to address several enduring challenges within GWAS. First, at a variant level, the bulk of GWAS “hits” fall in noncoding regions of the genome. A major advantage of GWAS as a means of discovering the biological basis of psychiatric disorders is that the lack of a priori, gene centric hypotheses enables discovery of trait regulatory variants in enhancer and promotor regions, lncRNAs, microRNAs, and any other molecular entity that is part of the gene-regulatory mechanism. However, in contrast to variants within coding genes, it is far more difficult to link statistically significant genetic associations to the gene products and biological mechanisms through which they act [9]. Interpretations of significant GWAS findings are complicated by patterns of related inheritance (e.g., linkage disequilibrium), such that the most strongly associated genetic variant in a locus may not be �usal” but could “tag” a true causal variant. This, coupled with long distance genomic regulation, poses challenges for unveiling specific genes and variants underlying human traits via GWAS [10]. In this review, we highlight how regulatory genetic variants can be integrated coherently with coding genes within and across species using unifying data structures.

A second challenge with GWAS is that power analyses reveal that the massive polygenicity underlying psychiatrically relevant traits and illnesses requires larger sample sizes for additional discoveries from GWAS data alone [11]. Likewise, the predictive power of a polygenic risk score (PRS), an index of aggregated genetic susceptibility to a disorder, for psychiatric disorders is also directly linked to the current statistical power of discovery GWAS [12]. However, the identification of additional trait-associated variants continues to substantially augment SNP-heritability estimates, especially in the case of rare variants, suggesting that there is more signal to be found in GWAS and sequencing studies [13], provided that higher sample sizes continue to be attained. In this review, we highlight approaches that exploit complementary data resources from model organisms that, when placed in an integrative framework with GWAS data, are showing some promise in prioritizing variants that are detected.

Third, consistent with indications from early family and twin studies, there is evidence for pleiotropy among psychiatric traits to a degree suggestive of an underlying dimension of genetic liability that parallels the general factor model of psychopathology [5, 14]. Thus, it is important to consider variants in context of both the underlying neurobiological mechanisms in which they function, and the multiple traits that are influenced by that variation to find the specific, as well as the overlapping biological mechanisms underlying behavioral traits.

A landmark contribution to our current ability to annotate GWAS signals arise from FUMA [15], a platform for functional and regulatory annotation of variants. Summary statistics from a GWAS can easily be aligned with tissue and cell-type-specific expression data and to a variety of regulatory and chromatin signatures with no computational burden on the user, making FUMA widely accessible. As an alternative to gene-based mapping techniques, software tools can also map variants to the noncoding transcriptome (e.g., LincSNP 3.0 [16]). Beyond variant mapping, harnessing multiple sources of omics data can be utilized in a multivariate framework to implicate �usal” gene sets for a disease state (e.g., SMR [17], iRIGs [18], PAINTOR [19], FOCUS [20]). Efforts are also underway, with varying degrees of success, to demonstrate to what extent similar regulatory enrichment of PRSs could enhance prediction (e.g., AnnoPred [21], LDpred-funct [22]). However, most of these approaches have been limited to human genetics and genomics data. In this review, we highlight approaches that bring together the breadth and depth of well-controlled model organism studies that place genetic and genomic findings in biobehavioral context that can expand on this or other interpretive tool sets.


Functional effect scores

We analyzed a cohort derived from the UKBB. Of

18K analyzed protein-coding genes, 17,843 were affected by at least one non-synonymous variant reported in the UKBB. On average, each of these genes was affected by 35.9 such variants (Fig. 2a).

Predicted genetic functional effect scores in the UKBB cohort. a The distribution of the number of non-synonymous variants per gene that affect its coding sequence (CDS), according to the (imputed) genetic data of the UKBB. Presented in a log scale. b The distribution of the

640K variant effect scores. Each score is a number between 0 (complete loss of function) and 1 (no damage to the protein product). c, d Aggregated gene scores according to the dominant (c) and recessive (d) inheritance models. Top panels: the mean (solid line) and standard deviation (shaded area) of the effect scores of the 18,053 analyzed protein-coding genes across the entire UKBB cohort (sorted by the mean score). Bottom panel: z values of the gene effect scores across 10 randomly selected samples (of the entire

500K samples in the UKBB). Each of the 10 samples is shown in a distinct color

The derivation of the gene effect score matrices is comprised of two steps. First, FIRM is used to predict an effect score for each protein-affecting variant (Fig. 2b). Intuitively, these predicted effect scores can be interpreted as the probability of the variant-affected protein to retain its function. The variant scores are then integrated with the cohort genotypes and aggregated together to derive per-sample dominant and recessive effect scores at the gene level (Fig. 2c, d). As expected, dominant genetic effects (capturing single hits) are more prevalent than recessive effects (of double hits). The derived gene scores capture genetic variability in the UKBB population observed even within a small number of samples. The objective of PWAS is to test whether this functional genetic variability correlates with phenotypes.

Simulation analysis

To examine the discovery potential of PWAS compared to GWAS and SKAT, we conducted a simulation analysis (Fig. 3). The simulation was carried on real genetic data (from the UKBB cohort), with phenotypes simulated by mixing genetic signal and noise. To test the sensitivity of PWAS to the inevitable inaccuracies of FIRM, we examined the effect of a noise parameter (ϵ) influencing its predictions. Specifically, we distorted the variant effect scores predicted by FIRM (in the range between 0 and 1) with additive Gaussian noise of standard deviation ϵ. It appears that under the modeling assumptions of the simulation, PWAS is not very sensitive to limited inaccuracies of the underlying machine learning predictor.

Simulation analysis. Results of a simulation analysis comparing between GWAS, SKAT, and PWAS. The statistical power of each method is shown as a function of cohort size (1000, 10,000, 50,000, 100,000, or all 332,709 filtered UKBB samples, shown in a log scale). Estimated values are shown as solid lines, with flanking 95% confidence intervals as semi-transparent area bands. Each iteration of the simulation considered a single protein-coding gene affecting a simulated continuous phenotype of the form y = βx + σ, where x is the effect of the gene on the phenotype (normalized to have mean 0 and standard deviation 1 across the UKBB population), β ∈ <0.01,0.05>is the gene’s effect size, and σ

N(0, 1) is a random Gaussian noise. The gene effect x was simulated according to the PWAS model, with either a dominant, recessive, or additive inheritance. A noise parameter ϵ ∈ <0,0.25>was introduced to FIRM, the underlying machine learning model that estimates the damage of variants. Gene architectures, genotyping data, and the 173 included covariates were taken from the UKBB cohort

Based on the simulation results, we expect the advantage of PWAS to be the most substantial when dealing with recessive inheritance. We find that with small effect size (β = 0.01), at least 100K samples are required to obtain sufficient statistical power (given 173 covariates). When the effect size is higher (β = 0.05), cohorts of 10K samples could be sufficient.

It is important to state that phenotypes were simulated from the genetic data by a modeling scheme compatible with the assumptions of PWAS. Therefore, these results should not be seen as evidence for the dominance of PWAS over GWAS or SKAT in the real world. Rather, these simulations simply examine the method’s range of applicability and assess the amount of data required for sufficient statistical power under the settings for which it was designed. In addition to this protein-centric modeling scheme, we also examined phenotypes simulated under a standard linear model, as well as binary phenotypes (Additional file 1: Fig. S1).

Case study: colorectal cancer

To examine PWAS on real phenotypes, we begin with a case study of colorectal cancer. A cohort of 260,127 controls and 2822 cases was derived from the UKBB to detect predisposition genes leading to increased risk of colorectal cancer through germline variants.

To exemplify how PWAS works, we begin with a demonstration of the analysis over a specific gene—MUTYH (Fig. 4a), a well-known predisposition gene for colorectal cancer [23]. In the studied cohort, there are 47 non-synonymous variants affecting the gene’s protein sequence. When considered by standard per-variant GWAS, the most significant of these variants yields a p value of 1.2E−03. Even if the entire flanking region of the gene is considered (up to 500,000 bp from each side of its open reading frame), the strongest significance obtained is still only p = 6.3E−04, far from the exome-wide significance threshold (5E−07). When analyzed by PWAS, on the other hand, this association exhibits overwhelming significance (FDR q value = 2.3E−06), far beyond the commonly used FDR significance threshold (q < 0.05).

Colorectal cancer case study. a Demonstration of a specific gene-phenotype association: MUTYH and colorectal cancer. Variants that affect the protein sequence are shown on top of the gene’s exons. As expected, variants within domains tend to be more damaging. While none of the variants that affect the protein is close to the exome-wide significance threshold (p < 5E−07), the association is very significant by PWAS (FDR q value = 2.3E−6). The full summary statistics of the 47 variants are presented in Additional file 2: Table S1. b PWAS QQ plot of all 18,053 genes tested for association with colorectal cancer

PWAS was able to uncover the association by aggregating signal spread across a large number of different variants, with 5 of the 47 protein-affecting variants showing mild associations (p < 0.05). As these 5 variants show consistent directionality (all risk increasing), and as most of them are predicted to be likely damaging, they were effectively aggregated into gene scores that significantly differ between cases and controls. Specifically, the MUTYH gene is significantly more damaged in cases than in controls according to the PWAS framework. The association is only significant according to the recessive model, with an estimated effect size of d = − 0.079 (standardized mean difference in the gene effect scores between cases and controls). This observation is consistent with previous reports about MUTYH, claiming a recessive inheritance mode [23].

To recover all protein-coding genes associated with colorectal cancer according to PWAS, we analyzed 18,053 genes (Fig. 4b), discovering 6 significant associations (Table 1). Of these 6 associations, 5 are supported by some literature evidence, 3 of which with level of evidence we consider strong. In 4 of the 5 supported associations, the directionality of the association reported in the literature (i.e., protective or risk gene) agrees with the effect size (Cohen’s d) detected by PWAS (only in POU5F1B it is inversed). Of the 6 genes, only POU5F1B is affected by a variant exceeding the exome-wide significance (rs6998061, p = 1.4E−07). The 5 other genes are not discovered by GWAS, even when considering all the variants in the gene’s region (up to 500,000 bp away from the gene). Notably, while GWAS determines significance by the Bonferroni-corrected significance level (p < 5E−07 for coding regions), PWAS determines significance by FDR (q < 0.05), like other gene-based methods.

Applicability of PWAS across 49 different phenotypes

Having case studied PWAS for a specific phenotype, we turn to consider its applicability for a diverse set of 49 prominent phenotypes (Fig. 5a). We applied both standard GWAS and PWAS across the 49 phenotypes on the same UKBB cohort (

330K samples), obtaining a rich collection of associations (Fig. 5b, c). Altogether, PWAS discovered 12,444 gene-phenotype associations, only 5294 of which (43%) contain a GWAS-significant non-synonymous variant in the gene’s coding region (Fig. 5b). In other words, although PWAS considers the exact same set of variants, in 57% of the associations, it is able to recover an aggregated signal that is overlooked by GWAS when considering each of the variants individually. Even when considering all the variants in the proximity of the gene to account for LD (up to 500,000 bp to each side of the coding region), 2743 of the 12,444 PWAS associations (22%) are still missed by GWAS (Fig. 5c, d).

PWAS enriches GWAS discoveries across phenotypes. a We analyzed 23 binary phenotypes, 25 continuous phenotypes, and 1 categorical phenotype (male-balding patterns) derived from

330K UK Biobank samples. Within binary phenotypes, the number of cases spans orders of magnitude (from only 127 in systemic sclerosis to 62K in hypertension). b, c Partition of the significant protein-coding genes, across the different phenotypes, that were detected by GWAS, PWAS, or both. The total number of significant genes is shown in brackets. In b, a gene was considered significant by GWAS if a non-synonymous variant within the coding region of the gene passed the exome-wide significance threshold (p < 5E−07). In c, a relaxed criterion was taken, considering all variants within 500,000 bp to each side of the coding region of the gene (here showing only the PWAS significant genes). d The number of significant genes per phenotype found by PWAS alone, according to the relaxed criterion of GWAS, as defined in c (i.e., without any significant variant within 500,000 bp)

Full summary of all 49 tested phenotypes, with complete per-gene summary statistics, is available in Additional file 3: Table S2 (for all the significant PWAS associations) and Additional file 4: Table S3 (with all 18,053 tested protein-coding genes). QQ plots of all 49 phenotypes are available in Additional file 1: Fig. S2.

To confirm the importance of the predicted functional effect scores assigned to variants, we tested the performance of a version of PWAS where the effect scores of non-synonymous variants were shuffled prior to their aggregation into gene scores. Indeed, we find that the original version of PWAS (capturing gene function) outperforms the shuffled version (Additional file 1: Fig. S3).

Comparison with SKAT

Having established the discovery power of PWAS beyond standard GWAS, we also compare it to SKAT [18], the most commonly used method for detecting genetic associations at the gene level. Importantly, whereas SKAT attempts to recover all existing genetic associations, PWAS focuses specifically on protein-coding genes that are associated with a phenotype through protein function.

We find that PWAS is superior to SKAT in the number of discovered associations for most phenotypes (Fig. 6a). We also examined the extent of overlap between the results reported by each of the two methods (see the “consensus” bars in Fig. 6a). It appears that PWAS and SKAT tend to recover distinct sets of genes, so the two methods can be considered as largely complementary.

PWAS and SKAT provide complementary results. a Number of significant genes detected by PWAS, SKAT, and the consensus of both, across the 49 tested phenotypes (over the same cohorts derived from the UKBB). Phenotypes are sorted by the highest of the three numbers. b An evidence score of gene-phenotype associations (derived from Open Targets Platform) is shown across phenotypes by its average over the significant genes detected by PWAS, SKAT, or the consensus of both. The numbers of significant genes (over which the averaging is performed) are shown over the bars. c Comparison of the FDR q values obtained by PWAS and SKAT over 4944 gene-phenotype associations with strong support by Open Targets Platform. d A similar comparison over 202 associations reported by OMIM to have a known molecular basis. The right plot (marked by red frames) is a zoom-in of the left

To assess the quality of discoveries, we appeal to Open Targets Platform (OTP) [32], an exhaustive resource curating established gene-disease associations based on multiple layers of evidence, and OMIM [33], the most prominent catalog of human genes implicated in genetic disorders. We compared the quality of associations discovered by the two methods, according to OTP-derived evidence scores, across the 24 tested diseases that are recorded in OTP (Fig. 6b). According to this metric, the results of PWAS and SKAT appear to be largely comparable, with consensus genes showing stronger evidence.

We further investigate how the two methods (PWAS and SKAT) recover externally validated associations provided by OTP (Fig. 6c) and OMIM (Fig. 6d). Of 4944 associations with strong support by OTP, 9 were recovered by SKAT compared to 6 recovered by PWAS. In the case of OMIM, which provides an even more restricted list of 202 high-quality gene-disease associations with known molecular basis, PWAS was somewhat superior (12 compared to 7 recovered associations, with the 7 being a subset of the 12). We observe no obvious trend between the types of phenotypes (e.g., cancer or other diseases) and the significance of associations obtained by the two methods (see colors in Fig. 6c, d).

Based on this comparative analysis, we conclude that PWAS and SKAT are complementary and that it may be advantageous to use both in association studies. We stress that the two methods are very distinct in the type of associations they seek and how they model them.

Highly significant associations not dominated by single variants

Among all the discovered associations, we seek to highlight those that are particularly characteristic to our new method, namely results that are uniquely discovered by PWAS and show strong evidence of being causal. To this end, we filtered associations by highly strict criteria: (i) strong significance (FDR q value < 0.01), (ii) no other significant genes in the region, and (iii) no single dominating variant association. Of the 2743 gene-phenotype associations uniquely found by PWAS (Fig. 5d), 48 meet these criteria and are referred to as “PWAS-exclusive” associations (Table 2 the full list is provided in Additional file 5: Table S4).

As expected, the PWAS-exclusive genes show no GWAS signal at all, and the PWAS associations are constrained to the associated genes (Fig. 7a). When considered by SKAT, none of the 48 associations comes up as significant (Fig. 7b), even though SKAT was not included in the criteria for defining those associations. Interestingly, most of the PWAS-exclusive associations are driven by recessive inheritance. Among the ten genes listed in Table 2, only one (SLC39A8) shows a dominant inheritance pattern. This suggests that the modeling of recessive inheritance is a unique advantage of PWAS over GWAS.

PWAS-exclusive associations. a Exemplifying the 48 PWAS-exclusive associations with the 3 genes associated with the intraocular pressure phenotype. The 3 genes demonstrate a complete lack of any GWAS pattern in the proximity of the genes (up to 500,000 bp to both directions of each gene). Each of the 3 depicted gene regions was divided into 200 bins, displaying the most significant variant in each bin. Also shown are the PWAS FDR q values of all analyzed protein-coding genes in those chromosomal regions. b Comparison of the FDR q values obtained by PWAS and SKAT for the 48 associations

Some of the listed associations are strongly supported by the literature. For example, interleukin 6 (IL6), here implicated with high light scatter (HLS) reticulocyte percentage of red blood cells with overwhelming significance (PWAS FDR q value = 1.8E−126), is known for its capacity to impair hemoglobin production and erythroid maturation. A connection of IL6 to erythroid maturation, anemia, and inflammation through impairment of mitochondrial function was also established [34]. Moreover, IL6 plays a role in the development of anemia of chronic kidney disease in children (CKD anemia). This IL6-dependent pathology is induced by the destruction of red blood cells through its effects on the erythropoietin (Epo) axis, confirming a direct link of IL6 to the percentage of red blood cells [35].

Similarly, MLLT3, which appears to be associated with red blood cell distribution width through recessive inheritance according to PWAS (FDR q value = 8.5E−06, r = − 0.01), was indeed reported to be a key regulatory gene in the bone marrow [36]. Among the 49 phenotypes tested in this work, we found the gene to also be significant in numerous other blood cell traits, as well as hand grip strength (Additional file 3: Table S2). Likewise, CD80, which PWAS associates with eosinophil counts through recessive inheritance (FDR q value = 1.1E−06, r = − 0.01), indeed has an important role in antigen presentation by eosinophils [37]. FOXP1 is another gene associated with eosinophil counts through recessive inheritance according to PWAS (FDR q value = 9.8E−17, r = − 0.016). While no direct evidence for this association is reported, FOXP1 is known to affect monocyte differentiation and macrophage function [38].

In other examples, while there is no clear indication for the reported association, there does exist strong molecular plausibility. Another transcription factor belonging to the forked head family is FOXG1, which plays a key role in the development of the retina (a function conserved in all vertebrates) [39]. The gene was shown to be associated with visual impairment in both mouse and human cohorts [40]. However, it has never been directly linked to intraocular pressure, an association that we observe here with outstanding significance according to the recessive model of PWAS (FDR q value = 2.6E−15). Specifically, the normal function of the gene (i.e., lack of damaging variants) appears to be positively correlated (r = 0.031) with intraocular pressure.

Another example is INPP1, which encodes the enzyme inositol polyphosphate-1-phosphatase. In the existing literature, it is mainly reported in the context of autism and mood disorders [41], while reported genetic associations in the Open Targets Platform [32] focus mainly on autoimmune disorders and blood characteristics. Nonetheless, it does not appear to be linked to lymphocyte counts, an association we observe here (recessive FDR q value = 1.9E−12, r = − 0.014). In general, genetic study of blood phenotypes appears to be somewhat neglected, and it is often uncertain how such associations relate to clinical outcomes.

In some instances, we find little to no literature evidence for reported PWAS-exclusive associations. For example, GAPT and CLVS2 are found to be associated with intraocular pressure. GAPT (growth factor receptor-bound protein 2-binding adapter protein, transmembrane) plays a role in regulating B cell activation and proper maintenance of the marginal zone [42]. CLVS2 (clathrin vesicle-associated Sec14 protein 2) is involved in cell membrane trafficking [43]. In both cases, a link to intraocular pressure is not yet reported. Another significant PWAS association lacking literature support is FAM160B1 with respect to leukemia. Despite the lack of existing literature support for those connections, the strong associations established by PWAS provide strong evidence for potential links which deserve further exploration.


Binary Effects Assumption

In our framework, we use a simplified model to describe heterogeneity among the studies which makes two assumptions. The first assumption is that effect is either present or absent in the studies. This assumption is different from the traditional assumption assuming normally distributed effect sizes [27]–[29]. Our assumption is inspired by the phenomenon that the effect sizes are sometimes observed to be much smaller in some studies than in the others. It is reported that different populations can cause such phenomenon [19], [20], [30], [31]. For example, the homozygosity for APOE 4 variant is known to confer fivefold smaller risk of Alzheimer disease in African Americans than in Asians [19], [30]. The HapK haplotype spanning the LTA4H gene is shown to confer threefold smaller risk of myocardial infraction in the populations of Europeans decent than in African Americans [31]. The HNF4A P2 promoter variants are shown to be associated with type 2 diabetes in Ashkenazi and the results have been replicated [20]. However, in the same study, the same variants did not show associations in four different cohorts of UK population suggesting a heterogeneous effect. Gene-environmental interactions can also cause such phenomenon. If a study lacks an environmental factor necessary for the interaction, the observed effect size can be much smaller in that study. It is generally agreed that the gene-environmental interactions exist in many diseases such as cardio vascular diseases [32], respiratory diseases [33], and mental disorders [34].

The second assumption is that if the effect exists, the effect sizes are similar between studies. We call these two assumptions together the binary effects assumption. While other types of heterogeneity structures are possible such as arbitrary effect sizes, for identifying which studies have an effect and which studies do not have an effect, we expect that this model will be appropriate.


We propose a statistic called the m-value which is the posterior probability that the effect exists in each study of a meta-analysis. Suppose that we analyze studies together in a meta-analysis. Let ( ) be the observed effect size of study and let be the estimated variance of . It is a common practice to consider the true variance. In the current GWASs, the distribution of is well approximated by a normal distribution due to the large sample sizes. Let denote the observed data.

If there is no effect in study , where is the probability density function of a normal distribution whose mean is and the variance is . If there is effect in study , where is the unknown true effect size.

Since we want a posterior probability, the Bayesian framework is a good fit. We assume that the prior for the effect size is A possible choice for in GWASs is 0.2 for small effect and 0.4 for large effect [35], [36].

Let be a random variable which has a value 1 if study has an effect and a value 0 if study does not have an effect. Let be the prior probability that each study will have an effect such that Then we assume a beta prior on Through this paper, we use the uniform distribution prior ( and ), but other priors can also be chosen.

Let be the vector indicating the existence of effect in all studies. can have different values. Let be the set of those values.

Our goal is to estimate the m-value , the posterior probability that the effect exists in study . By the Bayes' theorem, (1) where is a subset of whose elements' th value is 1. Thus, we only need to know for each the posterior probability of , consisting of the probability of given and the prior probability of .

The prior probability of is where is the number of 1's in and is the beta function.

And the probability of given is (2) where is the indices of 0 in and is the indices of 1 in . We can analytically work on the integration to obtain where where is the inverse variance or precision. The summations are all with respect to .

is a scaling factor such that The details of the derivation is in Text S1 in Supporting Information S1. As a result, we can calculate for every and therefore obtain for each study .

The drawback of the exact calculation of m-value is that we need to iterate over all which is exponential to . This is not problematic in most of the current meta-analyses of GWASs, but will be problematic in future studies if increases over several tens. Therefore, here we propose a simple Markov Chain Monte Carlo (MCMC) method to estimate m-value.

We propose the following Metropolis-Hastings algorithm [37].

  1. Start from a random .
  2. Choose a next .
  3. If , move to . Otherwise, move to with probability .
  4. Repeat from step 2.

The set of moves we use for choosing is . is a simple flipping move of between 0 and 1. is a move that shuffles the values of . This move is introduced to avoid being stuck on one mode in a special case that there are two modes which can happen when the observed direction of the effect is opposite in some studies. At each step, we randomly choose a move from this set assuming a uniform distribution. We allow burn-in and sample times. After sampling, samples gives us an approximation of the distribution over , which subsequently gives the approximations of m-values by the formula (1).

Interpretations and predictions.

The m-value has a valid probabilistic interpretation that it is the posterior probability that the effect exists in each study under our binary effects model. If we are to choose studies predicted to have an effect and studies predicted to not have an effect, a threshold is needed. In this paper, we use the threshold of m-value for the former and m-value for the latter. Although this thresholding is arbitrary, the actual level of threshold is often not of importance because outlier studies showing different characteristics from the other studies usually stand out in the plotting framework described below.

Relationship to PPA.

The m-value is closely related to the posterior probability of association (PPA) based on the Bayes factor (BF) [35] in the sense that the presence and absence of effects are essentially describing the same things as the alternative and null models in the association testing. There are two fundamental differences. First, in the usual PPA, the prior probability of association ( ) is given by a point prior which is usually a very small value in GWAS reflecting the fact that the true associations are few. In our framework, we focus on interpreting meta-analysis results after we find associations using meta-analysis. Thus, reflects our belief on the effect conditioned on that the associations are already significant. For this reason, we need not use a very small value but instead choose to use a distribution prior. Second, the PPA is calculated for each study separately. However, the m-value is calculated using all studies simultaneously utilizing cross-study information. Thus, if the binary effects assumption approximates the truth, the m-value is more effective in predicting effects than the PPA or equivalently the BF, as we show by simulations in Results.

P-M Plot

We propose plotting the studies' p-values and m-values together in two dimensions. This plot, which we call the P-M plot, can help interpreting the results of a meta-analysis. Figure 1 shows that how to interpret such a plot. The right-most (pink) region is where the studies are predicted to have an effect. Often, a study can be in this region even if the p-value is not very significant. The left-most (light-blue) region is where the studies are predicted to not have an effect. This suggests that the sample size is large but the observed effect size is close to zero, suggesting a possibility that there exists no effect in that study. The middle (green) region is where the prediction is ambiguous. A study can be in this region because the study is underpowered due to a small sample size. If the sample size increases, the study will be drawn to either the left or the right side.

Genome-Wide Association Studies

Genome-wide association studies (GWAS) use high-throughput genomic technologies to scan entire genomes of large numbers of subjects quickly, in order to find genetic variants correlated with a trait or disease. Understanding the genetic architecture of complex diseases relies heavily on discovery and characterization of disease-associated variants such as single nucleotide polymorphisms (SNPs) and copy number variations (CNVs).

GWAS for Common Variant Discovery

Complex diseases are often characterized by common variants, while the contribution of rare or low-frequency variants remains largely unknown. Large-scale GWAS using microarrays are efficient and cost-effective for identifying loci and imputing common SNP variants associated with disease. However, arrays are limited in detecting low-frequency SNP variants. The base-by-base resolution of whole-genome sequencing allows for the identification of both common and rare variants that may be associated with disease.

Benefits of Genome-Wide Association Studies

  • Identification of novel variant-trait associations, with more than 50,000 trait- and disease-associations reported to date 1
  • Genotype information that can be leveraged for clinical applications, including development of polygenic risk scores used for early detection, prevention, or treatment of disease as well as drug development, selection, and dosage
  • Generation of easily sharable data, facilitating analysis on increasingly large and diverse sample sets
Opportunities for GWAS and Genetic Disease

GWAS for many diseases and disorders have not yet been performed, and the large majority (79%) of participants in GWAS to date are of European ancestry. As the European population accounts for just

16% of the global population, there is a recognized need for more diverse GWAS datasets. 2

In addition to ethnic diversity, there is a need to perform GWAS on diverse disease indications for specific sub-groups. This will help provide clues about which genes and gene pathways could be involved in disease mechanisms and pathogenesis.

Successfully Identified Variants for Specific Complex Diseases

GWAS with the commonly used case-control setup approach, which compares two large groups of individuals–one case group affected by a disease and one healthy control group–have successfully identified variants for specific complex diseases, such as:

  • Type 2 diabetes
  • Parkinson’s disease
  • Crohn’s disease
  • Various types of heart disease including coronary artery, atrial fibrillation, cardiomyopathy, etc.
  • Multiple types of cancer including breast, colorectal, etc.

Understanding Variant to Function Research

Researchers study populations and groups to find connections that help us understand how variants relate to each other and various diseases. Genomics is essential in driving this research. Start making these connections in your research and share your stories using the hashtag #V2Fnow.

Understanding Variant to Function Research

Using GWAS to Map Complex Genetic Traits

Researchers perform large GWAS studies to identify disease-associated DNA risk loci and develop polygenic risk scores.

From GWAS to NGS: Genetics of Children's Complex Diseases

Professors at Children's Hospital of Philadelphia discuss how they use NGS to map variants to causal genes.

Featured GWAS Products

Infinium Global Diversity Array-8 v1.0 Kit

The Global Diversity Array-8 (GDA) v1.0 BeadChip combines exceptional coverage of clinical research variants with optimized multi-ethnic, genome-wide content.

Infinium Global Screening Array-24 Kit

The Infinium Global Screening Array-24 v3.0 BeadChip is a next-generation genotyping array for population-scale genetics, variant screening, pharmacogenomics studies, and precision medicine research.

Infinium Multi-Ethnic AMR/AFR-8 Kit

A cost-effective array for understanding complex disease in diverse human populations, focused on Hispanic and African American populations.

Prioritizing Functional Genetic Variants Through Advanced Sequencing Approaches

Genome-wide association studies have identified thousands of variants with putative roles in different diseases. However, going from statistical associations to true insight into disease mechanisms remains a challenge. Recent advances in sequencing technologies have facilitated the development of strategies for assaying GWAS SNPs for potential functional relevance.

Related Solutions

Whole-Genome Sequencing

Obtain a high-resolution view of the entire genome.


Analyze genetic variation on any scale, for a broad range of applications.

New to NGS?

Find resources designed to educate on the basics of next-generation sequencing.

  1. Tam V, Patel N, Turcotte M, et al. Benefits and limitations of genome-wide association studies. Nat Reviews. 201920:467-484.
  2. Martin, A.R.. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nature Genetics. 2019 51: 584-591

Innovative technologies

At Illumina, our goal is to apply innovative technologies to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. It is mission critical for us to deliver innovative, flexible, and scalable solutions to meet the needs of our customers. As a global company that places high value on collaborative interactions, rapid delivery of solutions, and providing the highest level of quality, we strive to meet this challenge. Illumina innovative sequencing and array technologies are fueling groundbreaking advancements in life science research, translational and consumer genomics, and molecular diagnostics.

For Research Use Only. Not for use in diagnostic procedures (except as specifically noted).

Metabolite-based genome-wide association studies in plants

mGWAS in plants benefit from huge diversity of the plant metabolome.

mGWAS is powerful in dissecting the genetic basis of the plant metabolome.

mGWAS provides an useful strategy for plant functional genomics.

mGWAS can be further applied to the dissection of complex traits in plants.

The plant metabolome is the readout of plant physiological status and is regarded as the bridge between the genome and the phenome of plants. Unraveling the natural variation and the underlying genetic basis of plant metabolism has received increasing interest from plant biologists. Enabled by the recent advances in high-throughput profiling and genotyping technologies, metabolite-based genome-wide association study (mGWAS) has emerged as a powerful alternative forward genetics strategy to dissect the genetic and biochemical bases of metabolism in model and crop plants. In this review, recent progress and applications of mGWAS in understanding the genetic control of plant metabolism and in interactive functional genomics and metabolomics are presented. Further directions and perspectives of mGWAS in plants are also discussed.

Naidoo P, Cloete S, Olivier J. Heritability estimates and correlations between subjectively assessed and objectively measured fleece traits in Merino sheep. South African J Animal Sci. 200434(6):13–4.

Hardy MH, Lyne AG. The pre-Natal development of wool follicles in Merino sheep. Aust J Biol Sci. 19569(3):423–41.

Fraser AS, Short BF: The biology of the fleece. Q Rev Biol. 1960(3):108.

Parry AL, Nixon AJ, Craven AJ, Pearson AJ. The microanatomy, cell replication, and keratin gene expression of hair follicles during a photoperiod-lnduced growth cycle in sheep. Cells Tissues Organs. 1995154(4):283–99.

Nixon A. Regulation of prolactin receptor expression in ovine skin in relation to circulating prolactin and wool follicle growth status. J Endocrinol. 2002172(3):605–14.

Auber L. VII.—the anatomy of follicles producing wool-Fibres, with special reference to keratinization. Earth Environmental Sci Transactions Royal Soc Edinburgh. 195262(01):191–254.

Hynd PI, Schlink AC, Phillips PM, Scobie DR. Mitotic activity in cells of the wool follicle bulb. Aust J Biol Sci. 198639(4):329.

Kaufman CK, Zhou P, Amalia PH, Michael R. GATA-3: an unexpected regulator of cell lineage determination in skin. Genes Dev. 200317(17):2108–22.

Rogers GE. Biology of the wool follicle: an excursion into a unique tissue interaction system waiting to be re-discovered. Exp Dermatol. 200615(12):931–49.

Wang Z, Zhang H, Yang H, Wang S, Rong E, Pei W, Li H, Wang N. Genome-wide association study for wool production traits in a Chinese Merino sheep population. PLoS One. 20149(9):e107101.

Hirschhorn JN, Daly MJ. Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 20056(2):95–108.

Jiang Z, Michal JJ, Chen J, Daniels TF, Kunej T, Garcia MD, et al. Discovery of novel genetic networks associated with 19 economically important traits in beef cattle. Int J Biol Sci. 20095(6):528.

Zhang C, Wang Z, Bruce H, Kemp R, Plastow G. Genome-wide association studies (GWAS) identify a QTL close to PRKAG3 affecting meat pH and colour in crossbred commercial pig lines. In: World Congress on Genetics Applied to Livestock Production 2014.

Irene VDB, Boichard D, Lund MS. Multi-breed GWAS and meta-analysis using sequences of five dairy cattle breeds improve accuracy of QTL mapping. In: Book of Abstracts of the Meeting of the European Federation of Animal Science 2015.

Zhang L, Liu JS, Ling-Yang XU, Zhao FP, Jian LU, Zhang SF, Wang HH, Zhang XN, Wei CH, Guo-Bin LU. Genome-wide Association Studies for Body Weight Traits in Sheep. China Animal Husbandry Vet Med. 2014.

Abdoli R, Mirhoseini SZ, Ghavi H-ZN, Zamani P, Gondro C. Genome-wide association study to identify genomic regions affecting prolificacy in Lori-Bakhtiari sheep. Anim Genet. 201849(5):488-91.

Jiang DI, Liu J, Xinming XU, Wang Q, LazateAiniwaer LYU. Genome-wide Association Studies for Wool and Body Mass Traits in Yearling Fine Wool Sheep. Acta Agriculturae Boreali-occidentalis Sinica. 201625(4):496-501.

Bolormaa S, Swan AA, Brown DJ, Hatcher S, Moghaddar N, van der Werf JH, Goddard ME, Daetwyler HD: Multiple-trait QTL mapping and genomic prediction for wool traits in sheep. Genet Sel Evol. 201749(1):1–22.

Meadows JR, Kijas EKCW. Linkage disequilibrium compared between five populations of domestic sheep. BMC Genet. 20089(1):1–10.

Sing CF: Introduction to Quantitative Genetics. Am J Human Genet. 199046(6):1231.

By I, Nagy J, Ölkner L, Komlósi L. Genetic parameters of production and fertility traits in Hungarian Merino sheep. J Animal Breeding Genet. 1999116(5):399–413.

Safari E, Fogarty NM, Gilmour AR. A review of genetic parameter estimates for wool, growth, meat and reproduction traits in sheep. Livest Prod Sci. 200592(3):271–89.

Di J, Zhang Y, Tian K-C, Lazate LJ-F, Xu X-M, Zhang Y-J, Zhang T-H. Estimation of (co) variance components and genetic parameters for growth and wool traits of Chinese superfine merino sheep with the use of a multi-trait animal model. Livest Sci. 2011138(1–3):278–88.

Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, Mcmullen MD, Gaut BS, Nielsen DM, Holland JB. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 200638(2):203–8.

Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 200436(5):512–7.

Vanraden PM. Efficient methods to compute genomic predictions. J Dairy Sci. 200891(11):0–4423.

Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 201042(4):348–54.

Bing-ru Z, Xue-feng F, Li-juan Y, Yue-zhen T, Jun-ming H, Xu-guang W, Xi-xia H, Ke-chuan T. The difference analysis of wool traits among strains in Chinese Merino (Xinjiang type). Xinjiang Agricultural Sci. 201653(11):2135–41.

Armstrong RA. When to use the Bonferroni correction. Ophthalmic Physiol Optics J Br Coll Ophthalmic Opticians. 201434(5):502–8.

Roberts T, Chetty M. Hypohidrotic ectodermal dysplasia: genetic aspects and clinical implications of hypodontia. Die Tydskrif Van Die Tandheelkundige Vereniging Van Suid Afrika. 201873:253–6.

Megdiche S, Mastrangelo S, Ben Hamouda M, Lenstra JA, Ciani E. A combined multi-cohort approach reveals novel and known genome-wide selection signatures for wool traits in Merino and Merino-derived sheep breeds. Front Genet. 201910:1025.

Xuemei T, A CP. Keratin 17 modulates hair follicle cycling in a TNFalpha-dependent fashion. Genes Dev. 200620(10):1353–64.

Plafker KS, Farjo KM, Wiechmann AF, Plafker SM. The human ubiquitin conjugating enzyme, UBE2E3, is required for proliferation of retinal pigment epithelial cells. Invest Ophthalmol Vis Sci. 200849(12):5611–8.

Gaur U, Aggarwal BB. Regulation of proliferation, survival and apoptosis by members of the TNF superfamily. Biochem Pharmacol. 200366(8):1403–8.

Kondo S, Yoneta A, Yazawa H, Kamada A, Jimbow K. Downregulation of CXCR-2 but not CXCR-1 expression by human keratinocytes by UVB. J Cell Physiol. 2000182(3):366–70.

Rezza A, Wang Z, Sennett R, Qiao W, Wang D, Heitman N, Mok KW, Clavel C, Yi R, Zandstra P, et al. Signaling networks among stem cell precursors, transit-amplifying progenitors, and their niche in developing hair follicles. Cell Rep. 201614(12):3001–18.

Liu C, Sello CT, Sun Y, Zhou Y, Lu H, Sui Y, Hu J, Xu C, Sun Y, Liu J et al: De Novo Transcriptome Sequencing Analysis of Goose (Anser anser) Embryonic Skin and the Identification of Genes Related to Feather Follicle Morphogenesis at Three Stages of Development. Int J Mol Sci. 201819(10):3170.

Peck JW, Oberst M, Bouker KB, Bowden E, Burbelo PD. The RhoA-binding protein, rhophilin-2, regulates actin cytoskeleton organization. J Biol Chem. 2002277(46):43924–32.

McMullan R, Lax S, Robertson VH, Radford DJ, Broad S, Watt FM, Rowles A, Croft DR, Olson MF, Hotchin NA. Keratinocyte differentiation is regulated by the rho and ROCK signaling pathway. Curr Biol. 200313(24):2185–9.

Koyama S, Purk A, Kaur M, Soini HA, Novotny MV, Davis K, Kao CC, Matsunami H, Mescher A. Beta-caryophyllene enhances wound healing through multiple routes. PLoS One. 201914(12):e0216104.

Calautti E, Cabodi S, Stein PL, Hatzfeld M, Kedersha N, Dotto GP. Tyrosine phosphorylation and src family kinases control keratinocyte cell–cell adhesion. J Cell Biol. 1998141(6):1449–65.

Gay DL, Yang CC, Plikus MV, Ito M, Rivera C, Treffeisen E, Doherty L, Spata M, Millar SE, Cotsarelis G. CD133 expression correlates with membrane beta-catenin and E-cadherin loss from human hair follicle placodes during morphogenesis. J Invest Dermatol. 2015135(1):45–55.

Santoro T, Maguire J, McBride OW, Avraham KB, Copeland NG, Jenkins NA, Kelly K. Chromosomal organization and transcriptional regulation of human GEM and localization of the human and mouse GEM loci encoding an inducible Ras-like protein. Genomics. 199530(3):558–64.

Popova NV, Suleimanian NE, Stepanova EA, Teti KA, Wu KQ, Morris RJ. Independent inheritance of genes regulating two subpopulations of mouse clonogenic keratinocyte stem cells. J Investig Dermatol Symp Proc. 20049(3):253–60.

Wiley LA, Dattilo LK, Kang KB, Giovannini M, Beebe DC. The tumor suppressor merlin is required for cell cycle exit, terminal differentiation, and cell polarity in the developing murine lens. Invest Ophthalmol Vis Sci. 201051(7):3611–8.

Ohyama M, Terunuma A, Tock CL, Radonovich MF, Pise-Masison CA, Hopping SB, Brady JN, Udey MC, Vogel JC. Characterization and isolation of stem cell-enriched human hair follicle bulge cells. J Clin Invest. 2006116(1):249–60.

Carvajal-Gonzalez JM, Mulero-Navarro S, Roman AC, Sauzeau V, Merino JM, Bustelo XR, Fernandez-Salguero PM. The dioxin receptor regulates the constitutive expression of the vav3 proto-oncogene and modulates cell shape and adhesion. Mol Biol Cell. 200920(6):1715–27.

Giannoni E, Buricchi F, Raugei G, Ramponi G, Chiarugi P. Intracellular reactive oxygen species activate Src tyrosine kinase during cell adhesion and anchorage-dependent cell growth. Mol Cell Biol. 200525(15):6391–403.

Chang C-H, Jiang T-X, Lin C-M, Burrus LW, Chuong C-M, Widelitz R. Distinct Wnt members regulate the hierarchical morphogenesis of skin regions (spinal tract) and individual feathers. Mech Dev. 2004121(2):157–71.

Sick S, Reinker S, Timmer J, Schlake T. WNT and DKK determine hair follicle spacing through a reaction-diffusion mechanism. Science. 2006314(5804):1447–50.

Holland JD, Klaus A, Garratt AN, Birchmeier W. Wnt signaling in stem and cancer stem cells. Curr Opin Cell Biol. 201325(2):254–64.

Iwamoto Y, Nishikawa K, Imai R, Furuya M, Uenaka M, Ohta Y, Morihana T, Itoi-Ochi S, Penninger JM, Katayama I, et al. Intercellular communication between keratinocytes and fibroblasts induces local osteoclast differentiation: a mechanism underlying Cholesteatoma-induced bone destruction. Mol Cell Biol. 201636(11):1610–20.

Sambrook JRD. A laboratory manual 2000.

Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 200925(14):1754–60.

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Genome project data processing S: the sequence alignment/map format and SAMtools. Bioinformatics. 200925(16):2078–9.

Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 201038(16):e164.

Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet. 201188(1):76–82.

Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 200521(2):263–5.

Gilmour AR, Thompson R, Cullis BR. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics. 199551(4):1440–50.

Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 201244(7):821–4.


Quality Control

One disadvantage of a case–control study design compared with family-based association studies is the lack of an internal check on genotyping quality. Standard laboratory practice of assigning both cases and controls to each plate, checking for differences in genotype frequency across plates, and genotyping duplicate samples can help eliminate systematic errors. Testing for HWE in controls can also identify problems with genotyping quality.

Hardy–Weinberg Equilibrium

Under HWE, alleles segregate randomly in the population, allowing expected genotype frequencies to be calculated from allele frequencies. A comparison of the expected and observed genotype frequencies provides a test of HWE (e.g., using a chi-square statistic). For alleles G and T, in which the frequency of allele G is p and the frequency of allele T is q = (1 – p), the expected frequencies of genotypes GG, GT, and TT are p 2 , 2pq, and q 2 . Allele frequencies (p, q) are usually estimated from the genotype sample under test, rather than obtained from external genotyping data.

Departure from HWE is generally tested for by using the Pearson chi-square test to assess goodness of fit (of the observed genotype counts to their expectation under HWE). Table 2 shows the step-by-step calculation with observed counts for genotypes GG, GT, and TT of a, b, c, and an application to a data set of 100 control genotypes (GG: 60, GT: 30, TT: 10). The estimated frequency of allele G is 0.75 (= [2 × 60 + 30]/200), noting the division by the number of alleles (2N) here, not genotypes (N). The chi-square goodness-of-fit test statistic is then calculated from summing (O – E)/E 2 across genotypes, giving chi-square = 4.0. Under the null hypothesis of no departure from HWE, the test statistic has one degree of freedom (not two degrees of freedom, as implied by the table dimensions), because the allele frequency p has been estimated from the observed data. In this test data set, a p value of 0.046 is obtained, giving slight evidence of departure from HWE, with a deficit in the number of observed heterozygotes.

Testing for departure from Hardy–Weinberg equilibrium

Departures from HWE in control samples may be caused by the following:

1. Genotyping error. In many genotyping platforms, calling heterozygotic individuals is more challenging than homozygotic individuals, and a higher rate of missing individuals for this genotype can distort HWE.

2. Assortative mating. HWE requires random mating for the SNP under test, which is reasonable for a random SNP across the genome, but may be violated for SNPs that affect mate choice, such as height.

3. Selection. Any genotype increasing the risk of fetal loss or early death is likely to be underrepresented.

4. Population stratification. Control samples that arise from a combination of genetically distinct subpopulations may not be in HWE.

5. Chance. HWE p values for studies of more than one SNP should be corrected appropriately for multiple testing.

Departures from HWE may be caused by any of these factors, but also by the genotyped SNP playing a role in disease susceptibility. Case genotypes for a disease mutation will only be in HWE if the genetic model is multiplicative, with genotype relative risks of 1, r, r 2 . However, for modest effect sizes, the power to detect departures from HWE may be low in cases.

No standard guidelines for rejecting SNPs that depart from HWE have been developed. In practice, all SNPs for which HWE p values decrease below a predetermined threshold should be checked manually for genotyping quality. Investigators should also be aware of SNPs showing significant association in which HWE p values are close to this threshold and unsupported by neighboring SNPs in LD.

Missing Genotypes

Another indication of poor genotyping quality is low call rates, with many missing genotypes for each SNP or each individual. This is a major issue in GWAS, but it is also applicable to candidate gene association studies. Genotypes that are missing at random will not bias a test, but poor genotype call rates may indicate nonrandom missingness, with one specific genotype (often heterozygotes) having a lower call rate. This may bias tests of association. Differential rates of missingness between cases and controls (for example, because of differences in DNA extraction and storage) may also be a problem (Clayton et al. 2005).

Population Stratification

Population stratification arises in case–control studies when the two study groups are poorly matched for genetic ancestry. Confounding then occurs between disease state (case, control) and genetic ancestry, with a subsequent increase in false-positive associations. For population stratification to occur, the underlying populations must differ in SNP allele frequency and be represented at different frequencies in the case and control groups. Detecting and controlling for population stratification is important, particularly in GWAS, in which even subtle differences between cases and controls can have major effects on the analysis. Several methods are available to detect and correct for population stratification, including genomic control, the Cochran/Mantel–Haenszel test, and the transmission disequilibrium test.

Genomic control (GC) assumes that population stratification inflates the association test statistics by a constant factor λ, which can be estimated from the median or mean test statistic from a series of unlinked SNPs genotyped in both cases and controls (Devlin and Roeder 1999). Test statistics are then divided by λ and compared with a chi-square distribution or an F distribution) to test for association (Devlin et al. 2004). Genotypes at SNPs uncorrelated with disease status can also be used to infer population ancestry, assigning the samples to distinct population groups, which can then be controlled for in the analysis (Pritchard et al. 2000). In GWAS, population substructure can be identified through a principal components analysis, which models ancestral genetic differences between cases and controls and then corrects for this in the analysis (Price et al. 2006).

Where individuals can be classified into known subgroups (e.g., by birthplace), analysis can be performed within each subgroup and combined using a Cochran/Mantel–Haenszel test (Clayton et al. 2005). The issue of population stratification can be avoided by using family-based studies. The most widely used method is the transmission disequilibrium test (TDT) (Spielman et al. 1993), which tests for non-Mendelian transmission of SNP alleles from heterozygous parents to affected offspring overtransmission suggests that the SNP allele increases risk of disease.

Estimating genetic nurture with summary statistics of multi-generational genome-wide association studies

Marginal effect estimates in genome-wide association studies (GWAS) are mixtures of direct and indirect genetic effects. Existing methods to dissect these effects require family-based, individual-level genetic and phenotypic data with large samples, which is difficult to obtain in practice. Here, we propose a novel statistical framework to estimate direct and indirect genetic effects using summary statistics from GWAS conducted on own and offspring phenotypes. Applied to birth weight, our method showed nearly identical results with those obtained using individual-level data. We also decomposed direct and indirect genetic effects of educational attainment (EA), which showed distinct patterns of genetic correlations with 45 complex traits. The known genetic correlations between EA and higher height, lower BMI, less active smoking behavior, and better health outcomes were mostly explained by the indirect genetic component of EA. In contrast, the consistently identified genetic correlation of autism spectrum disorder (ASD) with higher EA resides in the direct genetic component. Polygenic transmission disequilibrium test showed a significant over-transmission of the direct component of EA from healthy parents to ASD probands. Taken together, we demonstrate that traditional GWAS approaches, in conjunction with offspring phenotypic data collection in existing cohorts, could greatly benefit studies on genetic nurture and shed important light on the interpretation of genetic associations for human complex traits.

Watch the video: geneME What is a SNP? v4 (January 2022).