PaxDb v6.0: reprocessed, LLM-selected, curated protein abundance data across organisms
(database[TitleAbstract]) AND (Nucleic acids research[Journal]) 2026-01-21
Nucleic Acids Res. 2026 Jan 6;54(D1):D427-D439. doi: 10.1093/nar/gkaf1066.
ABSTRACT
Proteomics captures the biological and functional state of cells, condensing complex molecular information into quantitative measurements, yet the reuse of public mass spectrometry (MS) data is impeded by heterogeneous processing and incomplete metadata, limiting its potential to generate new biological insights. These issues restrict reproducibility and cross-study integration, underscoring the need for standardized, high-coverage reference resources. PaxDb addresses this by providing a protein abundance reference at organism- and tissue-level, for the healthy, wild-type state. The v6.0 release integrates 1639 datasets from 392 species, nearly doubling coverage since v5.0, with expanded representation across all kingdoms of life. A new end-to-end MS data processing pipeline enables consistent re-analysis from raw files using the FragPipe framework, integrating standardized metadata, orthology mappings, and protein-protein interaction-based quality scoring. To our knowledge, this is the first large-scale, unbiased, automated reprocessing of public MS data, including links to metadata. We further developed large-language model ensemble classifiers to semi-automate the curator-selection of relevant ProteomeXchange (PX) projects, as well as a user-facing tool for peptide-level abundance calculation, dataset scoring, and direct comparison with PaxDb reference data. The updated database is available at https://www.pax-db.org.
PMID:41182819 | PMC:PMC12807614 | DOI:10.1093/nar/gkaf1066