Efficient discovery of frequently co-occurring mutations in a sequence database with matrix factorization

database[Title] 2025-04-25

PLoS Comput Biol. 2025 Apr 24;21(4):e1012391. doi: 10.1371/journal.pcbi.1012391. Online ahead of print.

ABSTRACT

We have developed a robust method for efficiently tracking multiple co-occurring mutations in a sequence database. Evolution often hinges on the interaction of several mutations to produce significant phenotypic changes that lead to the proliferation of a variant. However, identifying numerous simultaneous mutations across a vast database of sequences poses a significant computational challenge. Our approach leverages a matrix factorization technique to automatically and efficiently pinpoint subsets of positions where co-mutations occur, appearing in a substantial number of sequences within the database. We validated our method using SARS-CoV-2 receptor-binding domains, comprising approximately seven hundred thousand sequences of the Spike protein, demonstrating superior performance compared to a reasonably exhaustive brute-force method. Furthermore, we explore the biological significance of the identified co-mutational positions (CMPs) and their potential impact on the virus's evolution and functionality, identifying key mutations in Delta and Omicron variants. This analysis underscores the significant role of identified CMPs in understanding the evolutionary trajectory. By tracking the "birth" and "death" of CMPs, we can elucidate the persistence and impact of specific groups of mutations across different viral strains, providing valuable insights into the virus' adaptability and thus, possibly aiding vaccine design strategies.

PMID:40273414 | DOI:10.1371/journal.pcbi.1012391