Development and validation of machine learning models for predicting synchronous lung metastasis in United States colorectal cancer patients: a SEER database analysis
database[Title] 2026-04-20
Transl Cancer Res. 2026 Mar 31;15(3):144. doi: 10.21037/tcr-2025-aw-2203. Epub 2026 Feb 27.
ABSTRACT
BACKGROUND: Colorectal cancer lung metastases (CRCLM) significantly influence treatment planning and prognosis in colorectal cancer (CRC). This study aimed to develop and validate machine learning-based models to support individualized risk stratification for chest computed tomography (CT) utilization during baseline evaluation by predicting synchronous CRCLM at diagnosis.
METHODS: Patients with primary CRC diagnosed between 2010 and 2015 were identified from the Surveillance, Epidemiology, and End Results (SEER) database using International Classification of Diseases for Oncology, 3rd edition (ICD-O-3) codes. Synchronous CRCLM was defined by the variable "CS Mets at DX-Lung". Predictors included age, sex, race, primary tumor site, grade, histologic type, tumor stage (T stage), node stage (N stage), tumor size, carcinoembryonic antigen (CEA) level, tumor deposits, and perineural invasion. The cohort was randomly divided into training (70%) and validation (30%) sets. eXtreme gradient boosting (XGB), random forest (RF), decision tree (DT), and logistic regression (LR) models were developed and evaluated mainly by receiver operating characteristic (ROC) curve, calibration curve, and decision curve analysis (DCA). Model interpretability was assessed using SHapley Additive exPlanation (SHAP).
RESULTS: Among 51,553 patients, 1,329 (2.6%) had synchronous CRCLM. In the validation cohort, the area under the curve was 0.81 for XGB, 0.81 for RF, 0.79 for DT, and 0.73 for LR after hyperparameter optimization. Calibration curves indicated high consistency between predictions and observations. DCA revealed substantial clinical utility for all models. SHAP analysis highlighted CEA and N stage as the strongest predictors in the RF model, while CEA and T stage were most influential in the XGB model.
CONCLUSIONS: Machine learning models, particularly XGB and RF, demonstrated robust performance in predicting synchronous CRCLM. CEA was consistently identified as the most important risk factor, supporting personalized chest CT utilization during initial CRC staging.
PMID:41969444 | PMC:PMC13067032 | DOI:10.21037/tcr-2025-aw-2203