Biomedical literature-based clinical phenotype definition discovery using large language models

Database (Oxford) 2025-11-26

Database (Oxford). 2025 Jan 18;2025:baaf047. doi: 10.1093/database/baaf047.

ABSTRACT

Electronic health record (EHR) phenotyping is a high-demand task because most phenotypes are not usually readily defined. The objective of this study is to develop an effective text-mining approach that automatically extracts clinical phenotype definitions-related sentences from biomedical literature. Abstract-level and full-text sentence-level classifiers were developed for clinical phenotype discovery from PubMed. We compared the performance of the abstract-level classifier on machine learning algorithms: support vector machine (SVM), logistic regression (LR), naïve Bayes, and decision tree. SVM classifier showed the best performance (F-measure = 98%) in identifying clinical phenotype-relevant abstracts. It predicted 459 406 clinical phenotype-related abstracts. For the full-text sentence-level classifier, we compared the performance of SVM, LR, naïve Bayes, decision trees, convolutional neural networks, Bidirectional Encoder Representations from Transformers (BERT), and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT). BioBERT model was the best performer among the full-text sentence-level classifiers (F-measure = 91%). We used these two optimal classifiers for large-scale screening of the PubMed database, starting with abstract retrieval and followed by predicting clinical phenotype-related sentences from full texts. The large-scale screening predicted over two million clinical phenotype-related sentences. Lastly, we developed a knowledgebase using positively predicted sentences, allowing users to query clinical phenotype-related sentences with a phenotype term of interest. The Clinical Phenotype Knowledgebase (CliPheKB) enables users to search for clinical phenotype terms and retrieve sentences related to a specific clinical phenotype of interest (https://cliphekb.shinyapps.io/phenotype-main/). Building upon prior methods, we developed a text mining pipeline to automatically extract clinical phenotype definition-related sentences from the literature. This high-throughput phenotyping approach is generalizable and scalable, and it is complementary to existing EHR phenotyping methods.

PMID:40996710 | PMC:PMC12462612 | DOI:10.1093/database/baaf047