Tech News
← Back to articles

Mapping the genetic landscape across 14 psychiatric disorders

read original related products more articles

Quality control of summary statistics

A standard set of quality-control filters was applied to all univariate GWAS summary statistics before conducting cross-disorder analyses. Any additional quality-control filters applied by a method are noted in its corresponding section below. These quality-control filters included removing strand ambiguous SNPs, restricting to SNPs with an imputation score (INFO) > 0.6 and with a minor allele frequency > 1% when this information was available in the GWAS data. We also restricted analyses to SNPs with an SNP-specific sum of the effective sample that is >50% of the total sum of the effective sample or, when this SNP-specific information was not available, to SNPs for which >50% of the cohorts contributed information, as indexed by the direction column in the GWAS summary statistics. The MHC region was excluded from all summary statistics before the analysis. Base pair location is given in genome build GRCh37/hg19 throughout the Article and its Supplementary Information.

Genomic SEM

Genome-wide models

All GWAS summary statistics were run through the munge function before running the multivariable version of LDSC used as input to genomic SEM7. The munge function aligns GWAS effects to the same reference allele and restricts to HapMap3 SNPs and SNPs with INFO > 0.9. LDSC was estimated using these munged summary statistics, applying a liability threshold model for all case–control psychiatric disorders (that is, all disorders except for the NIC outcome, which reflects a GWAS of the continuous Fagerström test for nicotine dependence24). For comparability, population prevalence was chosen to match what was used in the corresponding manuscript that introduced the GWAS of each trait. The ascertainment correction was performed using the sum of effective sample sizes across contributing cohorts for each disorder72. We note that, for CUD26, we used the recently described formula72 for estimating the sum of effective sample size directly from the GWAS data. This is because, in this instance, we found that the implied sum of effective sample size was much smaller than the value computed from the reported sample sizes, which is probably attributable to the complex familial structure in the included deCODE sample.

The two primary estimates from multivariable LDSC are the genetic covariance matrix and the corresponding sampling covariance matrix. The genetic covariance matrix contains SNP-based heritabilities on the diagonal and the co-heritabilities (genetic covariances) across every pairwise combination of included disorders on the off-diagonal. The sampling covariance matrix contains squared standard errors (sampling variances) on the diagonal, which allows genomic SEM to appropriately account for differences in the precision of GWAS estimates for disorders with unequal power. The off-diagonal contains sampling dependencies, which will arise in the presence of sample overlap across GWAS phenotypes. As these sampling dependencies are estimated directly from the data, summary statistics can be included with varying and unknown levels of sample overlap. We note that study overlap between disorders is not expected to affect the findings, as study overlap affects only the covariance of error terms of the GWASs resulting in increased intercepts of cross-trait LDSC with no expected impact on the estimates of r g 4,43. To guard against model overfitting, an exploratory factor analysis (EFA) was performed on even chromosomes and used to inform the fitting of an confirmatory factor analysis (CFA) in odd chromosomes. The EFA was performed using the factanal R package for 2–5 factors using both promax (correlated) and varimax (orthogonal) rotations. Disorders were specified to load on a factor in the CFA when the standardized EFA loadings were >0.3, with disorders allowed to cross-load (for example, TS on the Compulsive and Neurodevelopmental factors) if this was the case for multiple factors. Models specified based on varimax EFA results still allowed for interfactor correlations, as allowing only subsets of disorders to load on each factor will induce genetic overlap. A common-factor model was also modelled to test a single-latent-factor model predicting all 14 disorders. We did not evaluate models with more than five factors as these caused issues with model convergence. Results revealed that a five-factor model specified based on the promax EFA results (Supplementary Table 3) fit the data best in odd chromosomes (CFI = 0.973, SRMR = 0.073; Supplementary Table 2). This model also fit the data well in all autosomes, and was subsequently carried forward for all analyses, along with the p-factor model described in the main text. Considering the high r g across PTSD and MD, we also evaluated a model (in odd autosomes) that estimated the residual genetic covariance across these two disorders; however, we found that this did not significantly improve model fit (model χ2 1 difference = 2.86, P = 0.094).

Stratified genomic SEM

Stratified genomic SEM proceeds in two stages27. In stage 1, the s_ldsc function in genomic SEM, a multivariable implementation of stratified LDSC (S-LDSC)58, was used to estimate the stratified genetic covariance and sampling covariance matrices within each functional annotation. We specifically used the zero-order estimates for these analyses. In stage 2, the enrich function was used to estimate the enrichment of the factor variances and residual genetic variances unique to the indicators. This is achieved by first estimating the model in the genome-wide annotation including all SNPs. The factor loadings from these genome-wide estimates are then fixed and the (residual) variances of the factors and disorders are freely estimated within each annotation. These reflect the within-annotation estimates for each variance component that are scaled to be comparable to the genome-wide estimates. This cumulative set of results is used to calculate the enrichment ratio of ratios. The numerator reflects the ratio of the estimate of the factor variance within an annotation over the genome-wide estimate. The denominator is the ratio of SNPs in the annotation over the total number of SNPs examined. Enrichment estimates greater than the null of 1 are therefore observed when an annotation explains a disproportionate level of genetic variance relative to the annotation’s size.

Functional annotations used to estimate the stratified matrices were obtained from a variety of data resources. This included: (1) the baseline annotations from the 1000 Genomes Phase 3 BaslineLD (v.2.2)73 from the S-LDSC developers58; (2) tissue-specific gene expression annotation files created using data from GTEx74 and DEPICT75; (3) tissue-specific histone marks from the Roadmap Epigenetics project76; (4) annotations that we created27 from data in GTEx74 and the Genome Aggregate Database (gnomAD)77 that index protein-truncating-variant-intolerant (PI) genes, genes expressed in different types of brain cells in the human hippocampus and prefrontal cortex, and their intersection; (5) 11 neuronal cell type annotations defined by peaks from single-cell assay for transposase accessibility by sequencing (scATAC–seq) in the human forebrain54; (6) an annotation defined by peaks from ATAC–seq data with greater accessibility in neural progenitor enriched regions encompassing the ventricular, subventricular and intermediate zones (GZ) over neuron-enriched regions within the subplate, marginal zone and cortical plate (CP; GZ > CP), and a second CP > GZ annotation reflecting the converse60; and (7) a fetal and an adult annotations defined by eQTLs identified using high-throughput RNA-seq45. We excluded 22 annotations that produced stratified genetic covariance matrices that were highly non-positive definite to examine a total of 162 annotations. We corrected for multiple testing by using a strict Bonferroni correction for the 162 annotations analysed that passed quality control across the 11 factors examined (the factors from the five-factor factor model and the p-factor and residuals of the five factors from the p-factor model) of P < 2.81 × 10−5.

Multivariate GWAS

... continue reading