Lethal plague outbreaks in Lake Baikal hunter-gatherers 5,500 years ago

Laboratory work

Ancient DNA was extracted from the dental cementum of molar or premolar teeth from archaeological skeletal remains studied by the Baikal Archaeology Project. Sampling for ancient DNA (aDNA) was undertaken in dedicated clean laboratory facilities at the Lundbeck Foundation Centre for GeoGenetics (Copenhagen) and at the Institute of Archaeology, University College London (London). Cementum was isolated specifically from the roots of teeth67 using a sterilized handheld rotary saw, and pulverized prior to demineralization and enzymatic digestion. Sampled aliquots were approximately 50–100 mg of material. Extraction, purification and library preparation of aDNA for shotgun sequencing followed the approach described in Allentoft et al.68, using a double-stranded library protocol following Margaryan et al.69 in the first instance, and the ‘Santa Cruz Reaction’ single-stranded library protocol70 for samples with low template DNA content. Concentrations for resulting libraries were obtained using an Agilent FragmentAnalyzer and pooled at equimolar concentration for sequencing on Illumina NovaSeq 6000 S4 flowcells (100 bp paired-end reads) at the GeoGenetics Sequencing Core (Copenhagen). All samples were screened without partial Uracil-DNA Glycosylase (UDG) treatment, and in some cases subsequent additional libraries were built with UDG treatment (following71).

For libraries where screening sequencing indicated the presence of Y. pestis DNA, in-solution capture enrichment was undertaken. Hybridization capture was performed using the Arbor Sciences myBaits kit following Wagner et al.72, using the manufacturer’s High Sensitivity protocol, but only with a single round of enrichment. Pooled libraries from the capture reactions were then re-amplified for 16 cycles and sequenced on the same platform as above.

Preliminary bioinformatics

Following base-calling of Illumina data using CASAVA (v.1.8.2)73, adapter sequences and polyN tails were trimmed from demultiplexed fastq files using AdapterRemoval (v.2.0). Reads were aligned to the human reference genome GRCh38 using bwa aln (v.0.7.18)74 (reference genome hg19 was also used for hapRoH analysis, see below). Aligned reads were converted to BAM files, merged across libraries at sample level, sorted, filtered and indexed using Samtools (v.1.21)75, then duplicates identified using MarkDuplicates from Picard (v2.18.7), with the following options in place: ‘OPTICAL_DUPLICATE_PIXEL_DISTANCE = 12000 REMOVE_DUPLICATES = false TAGGING_POLICY = All VALIDATION_STRINGENCY = LENIENT’. Duplicate reads were then filtered out using Samtools alongside reads with a mapping quality of <30. Summary statistics for sequencing depth and coverage were generated using BEDtools (v2.23.0)76 and pysam (https://github.com/pysam-developers/pysam). Estimation of human DNA contamination and damage patterns were performed at a library level, using contamMix77, ANGSD (v.0.940)78, and mapDamage2.079.

Human DNA Analysis

Chromosomal sex was inferred based on the ratio of Y and X chromosome aligned reads, following existing confidence intervals80. Chromosomal aneuploidies were not detected. Mitochondrial haplogroups were assigned using haplogrep (v.2.4.0)81 following annotation of variants using mutserve (v.1.3.0)82. Y chromosomal haplogroups were assigned following the approach in ref. 11.

For exploratory analysis of ancestry through principal component analysis (PCA), pseudohaploid genotypes were called by randomly selecting a variant from a pileup generated with Samtools. Samples were then projected into the variation space obtained from using smartpca83 to undertake PCA on 2,086,279 SNPs (filtered for transversions only and with minor allele frequency >0.1%) from a reference panel of ancient Eurasian populations68. The latter was lifted over from hg19 using hgLiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver), and the effects of liftover evaluated (Extended Data Fig. 2). All reported samples were included in the PCA by projection.

Diploid genotypes were called using bcftools (v.1.21), and for the analysis of IBD segment sharing, missing diploid genotypes were imputed using GLIMPSE84 (for samples with a minimum autosomal genome coverage of 0.1×) following the approach in ref. 68, and IBD segments called using IBDseq (v.r1206)85, followed by genetic clustering by IBD68.

Runs of homozygosity were detected from homozygous-by-descent segments obtained from IBDseq, and from pseudohaploid data subset to the 1240k SNP positions using hapRoH (v.1)86. Biological kinship was inferred using KIN35, initially running KINgaroo on filtered BAM files targeting the 2,086,279 SNPs described above, and then validated based on IBD sharing inferred from IBDseq. Pedigrees were then reconstructed taking into account the resulting log-likelihood estimates for kinship scenarios, uniparental haplotypes, sex and age-at-death (Supplementary Note 2).

... continue reading