A sorghum pangenome reference improves global crop trait discovery

Plant material preparation and nucleic acid extractions

Young seedlings were grown in flat pans under healthy, pest-free and disease-free conditions until the first fully developed leaves emerged. To optimize DNA yield and quality, seedlings were dark-treated for 24–30 h under moist conditions. Tissue was collected by hand in small batches, cutting half an inch above the soil surface and immediately flash-freezing in liquid nitrogen within 1 min of excision. Approximately 50 g of tissue was collected in this manner, stored in pre-labelled freezer-quality ziplock bags at –80 °C and kept frozen until extraction. DNA was extracted from young tissue using a previously published protocol54 with minor modifications. Flash-frozen young leaves were ground to a fine powder in a frozen mortar with liquid nitrogen followed by gentle extraction in 2% CTAB buffer (including proteinase K, PVP-40 and β-mercaptoethanol) for 30 min to 1 h at 50 °C. After centrifugation, the supernatant was gently extracted twice with 24:1 chloroform and isoamyl alcohol. The aqueous phase was transferred to a new tube and one-tenth volume 3 M sodium acetate was added, gently mixed and DNA was precipitated with isopropanol. DNA precipitate was collected by centrifugation, washed with 70% ethanol, air dried for 5–10 min and dissolved thoroughly in an elution buffer at room temperature followed by RNAse treatment. DNA purity was measured with Nanodrop, and the DNA concentration was measured using a Qubit HS kit (Invitrogen). DNA size was validated using a Femto Pulse system (Agilent).

We also grew three biological replicates of the reference genotypes (see Supplementary Data 12 for phenotype data and Supplementary Data 13 for original photographs without the backgrounds removed) and sampled them at key developmental stages and organ types for transcriptome assembly and genome annotation (sample and data collection were performed by the team at Donald Danforth Plant Science Center in 2020). For the 29 diverse pangenome reference members and BTx623 (excluding, BTx642, Wray and RTx430), tissues were collected from greenhouse-grown sorghum plants maintained under controlled environmental conditions. For each replicate, tissues from multiple plants were pooled when individual tissue quantity was insufficient. Six tissue types were represented in the samples. (1) Approximately half-inch sections of stem tissue were excised from each internode and nodal region of the same stem at 60 days after planting (DAP), immediately frozen in liquid nitrogen and pooled to generate composite stem samples. (2) At 28 DAP, the youngest fully expanded leaf was collected and subdivided into tip, base, midrib and sheath segments, which were pooled to represent the composite leaf sample. (3) Entire primary and crown root systems were harvested at 28 DAP, gently rinsed with water to remove adhering soil and pooled per replicate. (4) Developing inflorescences were collected at both heading and anthesis to capture pre-pollination and post-pollination reproductive stages. Whole inflorescences were excised, flash-frozen and pooled across plants. (5) Developing grain was sampled at 5 DAP, 20 DAP and at the black layer stage to represent early, mid and late seed development, respectively. (6) Whole seedlings were collected at 12 DAP under well-watered (100% field capacity (FC)) and water-stressed (45% FC) conditions. Water stress was initiated 5 DAP, and both shoot and root tissues were harvested and pooled at 12 DAP. In addition to these developmental-stage collections, time-course sampling was conducted for two representative genotypes, PI660565 and PI276816, to capture early vegetative dynamics. For each line, three biological replicates were collected at 10, 17, 21, 25 and 31 DAP. At each time point, leaf and stem tissues were collected following the same procedures described above. Immediately after collection, all tissues were flash-frozen in liquid nitrogen and stored at –80 °C until RNA extraction. Tissue collection for Wray and for BTx642 and RTx430 (as previously detailed26) generally followed the above protocol, but with some differences in the types of tissues.

Genome assembly

Long-read sequencing technology, as used in the construction of our pangenome reference assemblies, has improved the resolution of repetitive regions and closed gaps in other complex genomes55,56. Although the quality and contiguity of our pangenome assemblies are consistent, the input data depth, sequencing technology, bioinformatic methods available and sequencing read characteristics necessitated some differences in computational genome assembly strategy. Here we provide a summary of the basic methods (see Supplementary Note 2 for a full description of genome assembly methods for each reference genome).

The majority of the pangenome sequences were generated by assembling PACBIO CLR reads with CANU (v.1.8)57, whereas five accessions (PI156178, RTx430, BTx642 and PI565121) were assembled with MECAT (v.1.4)58, and one accession (Wray) was generated using PACBIO HiFi data assembled with HiFiAsm+HIC (v.16.1)59. All sequencing was conducted at the HudsonAlpha Institute for Biotechnology and the Department of Energy (DOE) Joint Genome Institute. All assemblies were subsequently polished using ARROW (v.2.1)60, except BTx623 and Wray, which were polished using RACON (v.1.14)61. A combination of 32,400 syntenic markers (unique, non-repetitive, nonoverlapping 1.0-kb sequences from BTx642) and 12,641 annotated genes from BTx623 were used to identify misjoins in the assembly. Contigs were then ordered, oriented and assembled into ten chromosomes after examining syntenic marker and gene alignment positions on the polished assemblies. Each chromosome join was padded with 10,000 Ns. Contigs terminating in significant telomeric sequence were identified using the (TTTAGGG)n repeat, and care was taken to make sure that they were properly oriented in the production assembly. Chromosomes were numbered using BTx623 V3 naming convention. The remaining scaffolds were screened against bacterial proteins, organelle sequences, and GenBank non-redundant datasets and removed if found to be a contaminant. Chloroplast and mitochondrial genomes were generated using the OatK pipeline (v.1.0)62. Homozygous SNPs and indels were corrected to match the consensus call from Illumina fragment reads (2 × 150, 400 bp insert) by aligning the reads using bwa-mem (v.0.7.17-r1188)63 and identifying homozygous SNPs and indels with the UnifiedGenotyper tool in GATK (v.3.6-0-g89b7209)64.

Genome annotation

Genome annotation was performed using the pipeline developed by the DOE Joint Genome Institute and Phytozome65. As methodological considerations can substantially affect estimates of protein-coding gene PAVs30, each pangenome member underwent two rounds of gene prediction. In the first round, gene models were independently predicted for each genome followed by the propagation of these predictions across the entire pangenome in the second round.

In the initial round, transcript assemblies were generated from 2 × 150 bp stranded paired-end Illumina RNA sequencing reads (Supplementary Table 2) using PERTRAN (as previously described in detail66) for genome-guided assembly via GSNAP (v.2013-09-30)67, followed by splice graph construction, alignment validation and correction. PacBio Iso-Seq CCS reads, when available, were corrected and collapsed using a GMAP-based pipeline67 to refine alignments, correct splice junctions and cluster alignments when their intron (or introns) matches if spliced or 95% similar if it was a single exon. Final transcript assemblies were produced using PASA (v.2.0.2)68, integrating RNA sequencing assemblies, corrected CCS reads and Sanger ESTs. A species-specific repeat library was built from BTx623 V3 using RepeatModeler (v.open1.0.11)69. Repeats were functionally analysed with InterProScan (v.5.47-82.0)70, incorporating Pfam71 and PANTHER72 databases, and those with significant hits to protein-coding domains were excluded. Genomes were soft-masked using RepeatMasker (v.open1.0.11)69 with the curated repeat library. Gene loci were identified using transcript assembly and EXONERATE (v.2.4.0)73 alignments of proteins from Arabidopsis, soybean, poplar, Aquilegia, grape, rice, Setaria viridis, Brachypodium, Panicum hallii, pineapple, Acorus americanus and Swiss-Prot (2020_06) to the repeat soft-masked genomes, with up to 2,000-bp extension on both ends unless extending into another locus on the same strand. Gene models were predicted using FGENESH+ (v.3.1.0)74, FGENESH_EST, EXONERATE, PASA assembly ORFs and AUGUSTUS (v.3.1.0)75 trained on high-confidence PASA assembly open reading frames with intron hints from RNA sequencing alignments. Candidate models were selected based on EST and protein support and penalized for repeat overlaps. PASA refinement added UTRs, corrected splicing and incorporated alternative isoforms. Cscore (BLASTP score ratio and protein coverage) was computed for PASA-refined proteins. Transcripts were retained if Cscore and coverage was ≥0.5 or supported by ESTs; those with >20% CDS-repeat overlap required a Cscore ≥ 0.9 and ≥ 70% coverage. Models with >30% TE domain overlap (Pfam) were excluded.

In the second round, each genome was hard-masked with its high-confidence gene models (supported by transcriptome and homology evidence). BLASTX and EXONERATE were used to predict new gene models by projecting high-confidence models from other pangenome members onto the hard-masked assemblies. Predicted models were retained if they showed stronger homology support than existing models and were not contradicted by transcript evidence, or if no first-round model existed at that locus. Incomplete gene models, which had low homology support without full transcriptome support, or short single exon genes (<300 bp CDS) without protein domain or good expression were manually excluded.

... continue reading