An audit of the PeptideAtlas database uncovers evidence for repurposed pseudogenes and co-opted retroviral ORFs
database[Title] 2025-11-23
BMC Genomics. 2025 Nov 21. doi: 10.1186/s12864-025-12238-w. Online ahead of print.
ABSTRACT
BACKGROUND: The human genome has been the subject of scrutiny for more than two decades, yet new protein coding genes are still being uncovered and recently ribosome profiling experiments have provided evidence for the translation of thousands of novel open reading frames (ORFs). To determine how many of these novel ORFs have peptide support, we carried out an in-depth investigation of an entire mass spectrometry proteomics database.
RESULTS: We analysed the peptides housed in the human build of the PeptideAtlas database and identified reliable evidence for 35 potential coding genes not annotated in the Ensembl/GENCODE reference gene set. Evidence from complementary sources confirmed that 16 were almost certainly coding genes, but we believe that at least 14 are most likely to be undergoing aberrant translation. These 14 genes had reading frames that were not preserved beyond human and their peptides were restricted to cancers or cell lines. Remarkably, three of the sixteen likely coding genes were derived from endogenous retroviral gag ORFs and were expressed only in placenta. All three had evidence of purifying selection. Retroviral env ORFs (syncytins) with distinct origins are expressed in almost all mammalian placentae and these results suggest that co-opted gag ORFs may also play an important role in placental development.
CONCLUSIONS: Our analysis shows that proteomics data can be used in conjunction with evolutionary evidence to confirm the existence of new coding genes. The evidence suggests that both testis and placenta are the tissues most likely to express still to be identified coding genes, and that there may be other transposon-derived ORF that have been co-opted as coding genes. The strong evidence for the translation of regions under dysregulated conditions has important implications for the annotation of coding genes and in the analysis of cancer and other degenerative diseases.
PMID:41272456 | DOI:10.1186/s12864-025-12238-w