HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences
database[Title] 2025-11-23
Nucleic Acids Res. 2025 Nov 20:gkaf1184. doi: 10.1093/nar/gkaf1184. Online ahead of print.
ABSTRACT
Deciphering the functional effects of genetic variants, especially those inherited together on the same haplotype, remains a major challenge in human genetics, where epistasis among co-occurring variants can further complicate interpretation. To address this, we present HapScoreDB, a database offering protein language model-derived scores for haplotype-resolved protein-coding sequences across all human transcript isoforms. Leveraging GENCODE and Ensembl annotations with phased variant data from the 1000 Genomes Project, HapScoreDB includes over 130 000 distinct protein haplotypes from >18 000 genes and 78 000 transcripts, encompassing over 94 000 coding variants. Fitness scores for each haplotype were computed using state-of-the-art protein language models. Preliminary analyses show that haplotypes harboring cancer GWAS variants tend to have significantly reduced predicted fitness. Moreover, variability in scores across haplotypes of the same transcript highlights known cancer genes, suggesting that dispersion in predicted fitness may capture functionally important variation. HapScoreDB features a user-friendly web interface for interactive exploration, visualization, and download of both full and customized datasets. As a dynamic and expandable platform, it connects real-world human genetic variation with advanced protein modeling, enabling novel approaches in variant interpretation, isoform prioritization, and population-scale functional genomics. Access HapScoreDB at https://bcglab.cibio.unitn.it/hapscoredb.
PMID:41261743 | DOI:10.1093/nar/gkaf1184