Distinct genetic architecture in the tails of complex traits

Data processing

UKB genotype data

The UKB is a prospective cohort study of approximately 500,000 participants recruited across the United Kingdom from 2006 to 2010 (ref. 18). Phenotype data of anthropometric, biological and lifestyle measures were collected at baseline and in follow-up surveys, with further linkage to health and disease record data. The genetic dataset consists of 488,377 samples genotyped at 805,426 SNPs. To define population ancestries, 4-means clustering analysis was conducted on the first two principal components of the genotype data. Ancestries were then defined according to the country of birth (field ID: 20115) of the majority of individuals in the cluster, resulting in 461,931 European, 11,074 South Asian, 7,935 African, 2,585 West Asian and 2,550 East Asian individuals, and 1,619 individuals in clusters for which there was no majority country of birth.

Subsequent to clustering, standard QC procedures were applied independently to each ancestry cluster. SNPs with a minor allele frequency (MAF) < 0.01, genotype missingness > 0.02 or Hardy–Weinberg equilibrium test P < 10−8 were excluded. Samples exhibiting high levels of missingness or heterozygosity, or inconsistencies in genetic-inferred and self-reported sex, or showing aneuploidy of the sex chromosomes were removed in accordance with recommendations from the UKB data processing team18. For analyses in the general population, a greedy algorithm was used to remove related individuals to maximize sample retention while removing all third-degree relatives (kinship coefficient > 0.044)41. After these QC steps, 411,948 unrelated individuals remained, of whom 387,472 were of European ancestry. The European ancestry cluster included 18,340 individuals with repeated measures that were set aside, as well as 24,476 individuals of multiple non-European ancestries for replication analyses. This resulted in up to 369,132 unrelated individuals of European ancestry for the primary POPout analyses.

For the sibling analyses, sibling pairs were selected by first identifying first-degree relatives, defined as those with kinship coefficient within the range 0.177–0.354 (the expected value for siblings is 0.25) to account for expected variation among siblings42. Only individuals of European ancestry were included in the sibling analyses owing to insufficient statistical power of the multiple ancestries sample2. The proportion of SNPs with zero identity-by-state (IBS0) was used to distinguish sibling pairs from parent–offspring pairs, as parent–offspring pairs have IBS0 ≈ 0 across the autosomes. In the UKB, there is clear separation between pairs of first-degree relatives at IBS0 = 0.00112 (ref. 42); thus, sibling pairs were selected as those exceeding this threshold. This provided a total of 17,289 sibling pairs for analysis.

Quantitative trait QC

Initial trait selection

We considered an initial list of 408 non-procedural continuous traits from the UKB categories of biomarkers, physical measures, the touchscreen questionnaire and cognitive function to provide a broad group of well-studied traits. We removed three traits related to lifetime reproductive success used as outcomes in the selection analyses to avoid circularity, as well as three predicted traits that had directly measured alternatives. To minimize non-genetic causes of trait values, we then removed all individuals with a cancer diagnosis (34,698) and those on insulin treatment (1,311) or who were pregnant at baseline (109). We also removed 64,043 individuals on statins and other cholesterol-lowering drugs from the blood count and blood biochemistry subcategories and 29,395 individuals on heart rate medication from the arterial stiffness and blood pressure subcategories. Next, to ensure that extreme outliers did not affect results, we removed all samples that had trait values six standard deviations or more from the mean. After removal of these samples, we removed traits for which the top two modal values comprised more than half the samples and traits with fewer than 100,000 samples remaining in our primary dataset of unrelated individuals of European ancestry, resulting in a set of 141 quantitative traits. These traits were then residualized and subjected to further QC.

Trait residualization

To minimize the impact of environmental risk factors, we residualized trait values within each ancestry group (European and multiancestry) by applying a linear regression model to each trait, with covariates for age, age2, age3, sex (as reported by the UK National Health Service at birth, unless since revised by the participant), menopause status (among females), type II diabetes status, coronary artery disease status, alcohol intake (field ID: 1558), smoking (categorical, field ID: 20116; continuous, field ID: 1239), batch, centre and the first 40 genetic principal components. At this stage, trait residuals with an absolute skew greater than 2 in the larger European ancestry group were removed to mitigate measurement bias, and the remaining trait values were standardized using a rank inverse normal transform on the residuals. This inverse normal transformation of the traits introduced a conservative bias in the POPout test, although this was expected to be small because of the removal of highly skewed traits and application to trait residuals, and was performed to prevent inflation of the statistic in the linear regression framework of the tests. This provided traits adjusted by key environmental factors and genetic principal components and with relatively low skew, transformed to have standard normal distributions for all subsequent analyses.

... continue reading