Expanding the human proteome with microproteins and peptideins

PeptideAtlas database construction and searching

The human non-HLA PeptideAtlas 2023-06 build contains 295 ProteomeXchange datasets (PXDs)50 split into 1,172 different experiments that comprise a total of 3.5 billion MS/MS spectra. Sequence database searching was performed using MSFragger51 v.3.7 using search parameters appropriate for each dataset, depending on alkylation, labelling, fragmentation type, instrument, enrichment strategy and more. All datasets were searched with semi-enzymatic settings (typically semi-tryptic). The search database was 2023-02 THISP level 4 database52 (https://peptideatlas.org/thisp/), which included the 7,264 Ribo-seq ORFs from ref. 4 as well as other contributed sequences that might be translated. All datasets were searched with generic artifactual variable modifications methionine oxidation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid from glutamic acid or glutamine, and asparagine and glutamine deamidation. The alkylation modification was set as a fixed modification (typically carbamidomethylated cysteine).

Statistical validation of the results for each experiment was performed using the Trans-Proteomic Pipeline (TPP)53,54 v.7.0 tools PeptideProphet55, iProphet56 and PTMProphet57, and the results were mapped to the human proteome using ProteoMapper58, taking known variants into account, and to the genome using the ENSEMBL59 toolkit as previously described60. A complete list of datasets used and a summary of the search results in each build are available online (https://peptideatlas.org/builds/human/non-hla/). Supplementary Table 15 provides FDR metrics at the PSM-, peptide- and protein-levels, as well as for certain subsets of proteins, including the neXtProt core proteome, the 7264 Ribo-seq ncORFs, as well as all CONTRIB sequences, many of which are putative ncORFs.

The Human HLA PeptideAtlas 2023-11 build comprises a set of 118 HLA immunopeptide-enriched publicly available PXDs, which we split into 592 separate experiments, containing 240 million MS/MS spectra from 9,776 MS runs (Supplementary Table 16). Sequence database searching was performed with MSFragger v.3.7 using search parameters appropriate for each dataset, depending on sample handling. All datasets were searched in no-enzyme mode. While some HLA peptides have a lysine or arginine on the C terminus, and therefore exhibit fragmentation patterns typical of tryptic peptides, many HLA peptides do not have such characteristics and their spectra may therefore have strong b ions and internal fragmentation ions, rather than strong y ions, which are customary in tryptic peptide spectra.

To estimate the FDR of ncORF detections, we used the target–decoy entrapment approach61,62. We generate fake protein sequences at a 1:1 ratio with target sequences by scrambling the sequence of each target protein and adding it to the sequence database. This gives the search engine and post-processing software the opportunity to assign some spectra to decoy sequences (presumed to be wrong), and then the number of target-sequence mistaken assignments can be estimated as 1:1 with the number of decoy-sequence mistaken assignments. The entrapment part of the approach is to ensure that the software pipeline is not given knowledge of the decoys. The approach is not perfect as decoys do not model target sequences perfectly; in one direction, the similarity of some real protein sequences to one another may cause a small bias toward making errors to target sequences, while, in the other direction, scrambled sequences may yield an observable sequence (not already in the target database) occasionally by pure chance. Despite small imperfections, the technique is widely regarded as sufficiently accurate and is the standard in the community. Supplementary Table 16 provides FDR metrics at the PSM, peptide and protein levels, as well as for certain subsets of proteins, including the neXtProt core proteome, the 7,264 Ribo-seq ncORFs, as well as all CONTRIB sequences, many of which are putative ncORFs.

The search database was the 2023-07 THISP level 4 database52 (https://peptideatlas.org/thisp/), which included the 7,264 Ribo-seq ORFs from ref. 4 as well as other contributed sequences that might be translated. 299 common contaminants based on the list from ref. 63 minus the human proteins are included in the search database (available at https://peptideatlas.org/thisp/). All datasets were searched with generic artifactual variable modifications methionine oxidation, cysteine cysteinylation, protein N-terminal acetylation, peptide N-terminal pyro-glutamic acid from glutamic acid or glutamine, and asparagine and glutamine deamidation. Static carbamidomethylation of cysteine was set for experiments that used Iodoacetamide. For samples that were treated with tandem mass tag or SILAC or enriched for phosphorylated peptides, appropriate mass modifications were applied. Statistical validation was performed as described above by the TPP. A complete list of datasets used and a summary of the search results in each build are available online (https://peptideatlas.org/builds/human/hla/).

Protein identifications and categories

Peptides are preferentially mapped using ProteomeMapper58 (in TPP v.7.0) to the 20,389 entries (core proteome) and their isoforms of the 2023 version of neXtProt64 taking into account all single amino acid variants encoded in neXtProt. Proteins that have 2 or more uniquely mapping non-nested (contained completely within the other) peptides of length 9 or more amino acids, together covering at least 18 amino acids are categorized as canonical by PeptideAtlas. If a protein entry meets the above two-peptide criteria with peptides that cannot be mapped to the core proteome, they are termed non-core canonical. There are nine additional categories, including indistinguishable representative, indistinguishable, representative, marginally distinguished, subsumed, weak, insufficient evidence for various scenarios of ambiguous and redundant evidence. Finally, the categories ‘identical’ are assigned to entries that are sequence-identical to another entry, and proteins that have no peptide evidence whatsoever are categorized as not detected. See ref. 65 for an extensive description of the PeptideAtlas protein categories. For reasons of integration with the HPP annual metrics66,67,68, only sequence entries that belong to the core set of around 20,389 neXtProt64 and UniProtKB/Swiss-Prot69 protein-coding genes can achieve canonical status.

Manual inspection of ORF MS spectra

Despite extraordinary efforts to minimize false positives, both builds do contain some false positives, and they are most easily found mapping to proteins that are unlikely to be detected. For gene annotation purposes, manual inspection is therefore crucial to ensure that few false positives are reported for extraordinary detections, as described extensively previously15. We manually inspected each of the peptides corresponding to ncORFs and provided a manual categorization as well as a commentary. The manual categories are as follows: excellent (highly compelling evidence that the peptide identification is completely correct); good (the PSM is likely correct but lacks sufficient quality and coverage of the residues to provide highly compelling evidence); false positive; close but false positive (the PSM has many matching ions and is likely to be almost the correct peptidoform, but slight discrepancies indicate that the true identification is very close but not quite the listed sequence); low information (the ions that are detected are compatible with the identification, but coverage is too low to be compelling). The best peptide-spectrum match is also listed in the Supplementary Tables 2 and 6 as a USI that can be resolved and viewed online (https://proteomecentral.proteomexchange.org/usi/), or in cases in which a USI cannot be achieved, a direct URL for the spectrum in the PeptideAtlas web interface. In any case, all protein entries, peptides and spectra may be browsed using the PeptideAtlas web interface starting at the URLs provided above.

... continue reading