Genomes
Genome assemblies for each of the 138 Saccharomycetaceae species, 4 outgroup species (W. anomalus, C. albicans, P. kudriavzevii and Y. lipolytica), Saccharomycodaceae and Mucoromycota were downloaded from the National Center for Biotechnology Information (NCBI) (Supplementary Data 5). Two additional Mucoromycota genomes were downloaded from the JGI (Supplementary Data 5). To explore S. cerevisiae intraspecies centromere variation, 2,737 assemblies were downloaded from the NCBI and from the supplemental data of 4 studies48,49,50,51 (Supplementary Data 6). As some strain backgrounds (for example, S288c) are overrepresented in this dataset, we limited our selection to 1,493 unique strains with phylogenetic information for Fig. 3. Regardless, the overall frequency of variant centromeres was very similar between the reduced and full datasets (Extended Data Fig. 5a).
Point centromere annotation pipeline
PCAn pipeline
FIMO52 from the MEME suite (v.4.11.2)53 was used to search each genome assembly with a chosen 26-bp-long CDEIII motif (see below), using a threshold ranging from 1.0 × 10−3 to 1.0 × 10−7. The coordinates of the hits were used to compile a list of 250-bp-long sequences that contain the CDEIII motif plus 224 bp upstream. Subsequently, this output was searched again using FIMO and a chosen 8-bp-long CDEI motif with a less-restrictive threshold of 1.0 × 10−2. The coordinates of these hits were used to compile a list of sequences that contain a CDEI motif on one end and a CDEIII motif at the other end. For each of these sequences, the length and AT percentage of the intermediary CDEII motif were calculated. Every sequence was assigned a combined score on the basis of the two FIMO hit scores and the CDEII AT percentage. Sequences were sorted on the basis of the combined score, and only the top 50 sequences were retained. Next, we determined the median CDEII length of the top 5 sequences and removed sequences with lengths differing by more than 30 nucleotides from the median. Finally, after removing sequences with a low CDEII AT content (<70%) and removing duplicates (sometimes found on small contigs of lower-quality assemblies), we removed sequences in which both CDEII length and AT content differed too much from the median (≷median length ± 10 nucleotides and
Motif construction and selection
CDEI and CDEIII sequences from ref. 9 were used to produce the initial search motifs, using the sites2meme command from the MEME suite (v.4.11.2)53. Newly discovered sequences were each verified using synteny and used to make new motifs and perform more sensitive searches in specific clades or species. The final PCAn pipeline uses the best-performing custom motifs and thresholds for each clade or species.
Synteny checks
Synteny checks were performed by identifying proteins encoded 10 kb upstream and downstream of centromere hits. After using contig and coordinate information of each centromere hit to extract the approximately 20-kb region around each potential centromere, open reading frames (ORFs) were identified using the getORFProteins function from ORFFinder Python (v.1.8) (minimum_length = 525, remove_nested = True, return_loci = True)54. ORFs were then BLASTed against the S. cerevisiae proteome using a local version of NCBI blastp (v.2.13.0+, default parameters)55. Finally, S. cerevisiae protein identifiers were matched to the ancestral Saccharomycetaceae gene order identifiers from the Yeast Gene Order Browser56, which were used for visualization.
Adapting PCAn for Mucoromycota
... continue reading