Ethics and inclusion statement
Permission to access the ancient DNA in the human remains in this study was approved by the archaeological team that lead the excavation from Shaanxi Academy of Archaeology, Institute of Archaeology (Chinese Academy of Social Sciences) and Archaeology Institute of National Museum of China. The Institutional Review Board of the Chinese Academy of Sciences, Institute of Vertebrate Paleontology and Paleoanthropology provided further monitoring and permission for the sampling of ancient humans in this research. All the work was done in collaboration with local archaeologists, who were named co-authors for their contributions to the collection of material and archaeological information, such as on-site photographs, classification of high-level tombs and/or discussions that contributed to the associations derived from the archaeological research cited in this study. All wet laboratory work and data analysis are performed with equipment from the Molecular Paleontology Laboratory, Institute of Vertebrate Paleontology and Paleoanthropology, Chinese Academy of Sciences.
Ancient DNA experiments and sequencing
We sampled and sequenced 207 human remains from Shaanxi and Shanxi provinces, China, among which 169 individuals were analysed in this study (Supplementary Table 1). All extraction, sequencing and data processing of ancient human samples were carried out in dedicated laboratories at the Institute of Vertebrate Paleontology and Paleoanthropology, Chinese Academy of Sciences, in Beijing. Following standard protocols47, DNA was extracted from each sample from less than 100 mg of bone powder, obtained through drilling. We prepared double-stranded libraries (denoted ‘DS’ in Supplementary Table 1) for 134 samples with uracil-DNA glycosylase partially treated library protocol (denoted ‘half UDG’)48,49 (Supplementary Table 1). For 35 samples, we prepared single-stranded libraries (denoted ‘SS’, Supplementary Table 1) with full UDG treatment (4 samples; denoted ‘UDG’, Supplementary Table 1) or no UDG treatment (31 samples; denoted ‘No’ in Supplementary Table 1). To collect enough DNA for capture, libraries were amplified for 35 cycles using the AccuPrime Pfx polymerase. We then evaluated the amount of DNA extracted per sample using a Thermo Scientific NanoDrop 2000 spectrometer. We applied a capture strategy on both mitochondrial and nuclear DNA. For mitochondrial DNA (mtDNA), we used oligonucleotide probes synthesized from the complete human mitochondrial genome50, for nuclear DNA, oligonucleotide probes targeted 1.2 million SNPs (the ‘1,240k’ SNP panel) were applied. After enrichment, sequencing was performed on an Illumina MiSeq sequencing platform to generate 2 × 76 base pairs (bp) paired-end reads for the mtDNA and an Illumina Hiseq 4000 sequencing platform to generate 2 × 100 bp and 2 × 150 bp paired-end reads.
Read alignment and variant calling
We used leeHom51 to trim adaptors and merge paired-end reads into a single sequence (minimum overlap of 11 bp), keeping only merged reads with a length of at least 30 bp. Reads were aligned with BWA (v.0.5.10)52 using the bam2bam command with default parameters, except for samples with no UDG treatment, for which we used the parameters -n 0.01, -l 16500 and -o 2. We aligned the mtDNA reads to the revised Cambridge Reference Sequence53 and the nuclear DNA reads to the human reference genome hg19 (ref. 54). Duplicate reads with the same orientation, start and end positions were removed, and reads with a minimum mapping quality score of 30 were kept for analysis. The frequency of terminal C-to-T misincorporations was used to validate ancient DNA sequences, and contamination rates were estimated on the basis of two approaches. For all the individuals, we applied ContamMix55 to compare mtDNA fragments between the new consensus mitochondrial genomes with the present-day sequences50. To minimize the impact of damaged bases, we ignored the first and last five positions of the fragments during estimation. We treated the libraries as contaminated if the estimated contamination rate was greater than 5%. Contamination rates for men were also estimated using ANGSD56, leveraging the fact that men have one copy of the X chromosome, and verified using HapCon57, to improve the performance of low-coverage data. To keep enough individuals for further analysis, for 12 individuals with contamination above 5% (‘b’ annotated in the column of SNP number, Supplementary Table 1), we restricted our analysis to only damaged fragments with ancient DNA characteristics. The damaged fragments were obtained by pmdtools v.0.60 (ref. 58) with the --customterminus parameter, keeping fragments with at least one C → T substitution in the first three positions at each end. To eliminate the potential bias caused by the terminal deaminated cytosines, we masked 2 bp at the end of mapped reads for all the double-strand libraries with half UDG treatment, and 5 bp at the end of the reads were masked for all the single-strand libraries with no UDG treatment. To generate pseudo-haploid genotypes, heterozygote SNPs were randomly sampled to determine a single allele for the individual. During genotyping, the first and last 5 or 2 positions of the fragments were ignored for non-UDG-treated and UDG-treated libraries, respectively, and 13 poorly covered samples (with fewer than 27,000 SNPs) were removed.
Uniparental haplogroup identification
Mitochondrial sequences for each individual were mapped to the revised Cambridge Reference Sequence59. We only kept reads of a minimum of 30 bp in length and with a minimum mapping quality of 30. Haplogroups for each individual were called using HaploGrep2 (ref. 60) based on PhyloTree Build v.17 (ref. 59). We also confirmed all the haplogroups using the phylogenetic tree constructed with mtphyl v.5.003 and found that two individuals with an R# haplogroup (R + 16189)27 were assigned into the subclade of Haplogroup B4c1a since the 9-bp deletion (8281–8289). In comparison, four individuals with B haplogroups were assigned to the ancestral haplogroup with the best fit (that is, B4a, B4b1 and B4c1b). For the male individuals, Y-chromosome haplogroups were determined by identifying the assigned position in the phylogenetic tree on the basis of the International Society of Genetic Genealogy dataset version 9.77 (www.isogg.org/tree). In cases in which the most derived allele upstream of the Y-chromosome was a C to T or G to A substitution, indicative of possible deamination, at least two derived alleles were required to assign the Y-chromosome haplogroup. Otherwise, the haplogroup of the tested individual would be assigned to the ancestral haplogroup. When the subclade of the haplogroup assignment could not be determined, the haplogroup of the individual would be assigned to the most recent ancestral haplogroup they best fit (for example, No).
Population structure analysis
We conducted a principal component analysis (PCA) with smartpca in the EIGENSOFT package61. To calculate the principal components, we used 82 present-day populations from the Affymetrix Human Origins dataset62. We merged newly sequenced and published ancient individuals to the Human Origins dataset and projected them using the following program settings: ‘lsqproject: YES’, numoutlieriter: 0, and shrinkmode: YES. Newly sequenced or previously published ancient individuals were projected onto the principal components calculated based on present-day Eurasians (Fig. 1c) or only the East Asians (Fig. 1d). We estimated individual ancestries by model-based maximum likelihood clustering using ADMIXTURE63. We used 44 of the 82 populations used in the PCA, along with 10 present-day Han and Tibetan populations from ref. 64. Before the admixture analysis, we pruned genotypes with high linkage disequilibrium (r2 > 0.4) using PLINK (version v.1.90)65 and the parameters ‘-indep-pairwise 200 25 0.4’ were applied for SNP filtering, leaving 597,573 SNPs. ADMIXTURE analysis was conducted with K from 2 to 10. For each K, we ran the analyses ten times with different seeds to estimate the cross-validation error, and the best K was determined according to the lowest cross-validation error.
... continue reading