Sample description and DNA sequencing
In this study, we generated new whole-genome sequencing (WGS) data from 128 Indigenous American individuals representing 45 populations and 28 language families across 8 Latin American countries (Extended Data Fig. 1 and Supplementary Note 1). Whole genomes were sequenced at the Beijing Genomics Institute (China) and Dasa Genômica (Brazil) using BGISEQ-500 and Illumina NovaSeq 6000, with an average sequencing depth of about 44×.
Ethical approval for sample collection was granted by local ethics committees in each country: Argentina (Puerto Madryn Zonal Hospital, resolution no. 009/2015; San Carlos de Bariloche Zonal Hospital, resolution no. 1510/2015), Brazil (CONEP, resolution nos. 763, 4599, 3828655, 7107656 and 8273857), Bolivia (Universidad Mayor de San Andrés), Ecuador (Universidad de Las Américas; Consejo Nacional de Ciencia y Tecnología—CONACyT, grant no. 69856; Instituto Nacional de Ciencias Médicas y de la Nutrición Salvador Zubirán, refs. 15,18), Mexico (CONACyT grant no. 69856; Instituto Nacional de Ciencias Médicas y de la Nutrición Salvador Zubirán, refs. 15,18; CNIC Salud 2013-01-201471; Committee of Ethics and Research, UADY, notice F-FENC-SAC-14/REV: 04, registry no. 09/17) and Peru (Universidad San Martín de Porres). Written informed consent was obtained from all participants before sample collection. Logistical support in Brazil was provided by the Fundação Nacional do Índio. All sampling adhered to the Declaration of Helsinki and the relevant national laws and regulations at the time (Supplementary Note 10).
Read mapping, variant calling and annotation
Whole-genome sequence data preprocessing, variant calling and annotation were performed using the Sarek v.3.5.0 pipeline69. Specifically, sequence data in FASTQ format were aligned to the GRCh38 reference genome and preprocessed according to the GATK Best Practices for germline variant discovery and joint variant calling. Variants in the joint cohort variant call format were then normalized and annotated using the Ensembl Variant Effect Predictor (VEP)70 v.113, incorporating several annotation sources. Annotations from dbSNP, ClinVar and more custom annotations were retrieved using SnpSift and VEP plugins. Ancestral alleles were filtered using the VEP Ancestral Allele plugin to improve the specificity of downstream population genetic analyses. New SNVs were identified by comparing their positions and alleles with those of the variants in public databases (1KGP18, HGDP13, gnomAD19 and dbSNP20). Variants absent from the dataset of more than 270,000 individuals were classified as new (Supplementary Note 2).
Site frequency spectra
We estimated the number of segregating single-nucleotide polymorphisms (SNPs) as a function of sample size using a rarefaction approach on the basis of the site frequency spectrum (SFS). Alternate allele counts were computed for biallelic sites and were used to construct the SFS using Scikit-Allel v.1.3.13. To account for different sample sizes and normalize the comparisons, the folded SFS was projected onto smaller sample sizes using a hypergeometric downsampling method.
Dataset assembly and quality control
Genomic coordinates of the newly sequenced individuals were mapped to the hg38 reference genome. The 128 newly generated genomes were merged with the following publicly available WGS databases (Supplementary Note 1): (1) 1KGP High Coverage, (2) HGDP and (3) SGDP. Sites and individuals with more than 5% missing data were eliminated, biallelic SNVs were selected, and positions with significant deviations (P < 10−8) from the Hardy–Weinberg equilibrium expectations were excluded. Ambiguous positions (A-T and C-G) were also removed. The resulting dataset comprised 199 Indigenous American individuals from 31 language families and 53 ethnic groups. It includes 5,308,880 biallelic SNPs and 3,710 individuals from 201 populations worldwide.
After the initial quality assessment, linkage disequilibrium pruning was performed with ‘SNPRelate’ v.1.28.0 R package71 to exclude markers exhibiting a pairwise correlation greater than 20% (r2 > 0.2) in a 50-kb sliding window, advancing in 10-kb steps. This procedure yielded an linkage-disequilibrium-pruned dataset for downstream analyses that required an independent set of markers (for example, PCA and ADMIXTURE).
... continue reading