The impact of curation errors in the PDBBind Database on machine learning predictions of protein-protein binding affinity
Database (Oxford) 2025-11-25
Database (Oxford). 2025 Jan 18;2025:baaf061. doi: 10.1093/database/baaf061.
ABSTRACT
The PDBBind database has been widely utilized for the computational prediction of protein-protein binding affinities. While the accuracy of the PDBBind-curated equilibrium dissociation constants (KD) has been reported for the protein-ligand subset of the PDBBind database, the curation accuracy has not been reported for the protein-protein subset. Here, we present a detailed manual analysis for the subset of PDBBind records with PubMed Central Open Access primary publications and find that ~19% of these records had KD values that were not supported by their primary publications. The impact of these putative curation errors on the machine learning-based prediction of KD from experimental protein-protein 3D structures was evaluated and correcting the curation errors improved the Pearson correlation coefficient between measured and random forest-predicted log10(KD) values by ~8 percentage points. This finding underscores the importance of dataset accuracy for computational modelling and highlights the need for more stringent curation processes when extracting information from the scientific literature.
PMID:40996705 | PMC:PMC12462375 | DOI:10.1093/database/baaf061