Predicting distant metastasis in early-onset kidney cancer using machine learning: a SEER database study with external validation

database[Title] 2025-11-22

Clin Exp Med. 2025 Nov 18;26(1):3. doi: 10.1007/s10238-025-01886-7.

ABSTRACT

Patients with early-onset kidney cancer (EOKC) face a marked decline in prognosis after distant metastasis, yet the accuracy of current predictive methods remains limited. This study aims to develop a predictive model using multiple machine learning algorithms to establish a precise tool for predicting distant metastasis in EOKC. Patients diagnosed with EOKC from 2004 to 2015 in the Surveillance, Epidemiology, and End Results (SEER) database were included in the study. After rigorous screening, a total of 8868 patients were selected for further analysis and randomly divided into a training cohort and an internal validation cohort at a 7:3 ratio. Additionally, 229 patients from the First Affiliated Hospital of Nanchang University and the First Hospital of Putian City were collected as an external validation cohort. The least absolute shrinkage and selection operator (LASSO) regression and logistic regression were used to screen key variables. Based on this, machine learning models including support vector machine (SVM), K-nearest neighbors (KNN), gradient boosting decision tree (GBDT), linear discriminant analysis (LDA), and logistic regression (LR) were constructed. The discriminative ability, calibration, and clinical utility of the models were comprehensively evaluated using accuracy, precision, F1 score, area under curve (AUC), calibration curve, and decision curve analysis (DCA). SHapley Additive exPlanations (SHAP) was applied to interpret the best-performing model, and the importance scores of the predictive variables for the best model were ranked. The probability of distant metastasis in EOKC within the SEER population was 5.20% (n = 438), while in the external validation cohort it was 6.11% (n = 14). Lasso regression, together with univariate and multivariate logistic regression, indicated that tumor T stage, N stage, pathological grade, and tumor size are independent risk factors for distant metastasis in EOKC. The GBDT model demonstrated the best AUC values across cohorts: training (AUC = 0.940, 95% CI 0.926-0.951), internal validation (AUC = 0.913, 95% CI 0.885-0.938), and external validation (AUC = 0.920, 95% CI 0.754-0.994). Calibration curves and decision curve analysis (DCA) indicate that the GBDT model exhibits strong robustness and clinical utility. The model's prediction accuracy, precision, and F1 score further support its superior predictive performance, with tumor size and tumor grade identified as the most significant features. SHAP analysis confirmed their important contributions to the model. We developed and validated an EOKC distant metastasis prediction model based on a machine learning algorithm. Its strong predictive performance and interpretability offer a reliable tool to support individualized clinical decision-making.

PMID:41249564 | PMC:PMC12628394 | DOI:10.1007/s10238-025-01886-7