Background and Aim: Despite deep learning's reported superiority in several medical fields, its performance in fracture prediction remains unclear. This study evaluates deep learning algorithms compared to the Cox proportional hazards (CoxPH) model for predicting individual fracture risk.
Methods: We utilized data from the Study of Osteoporotic Fractures (SOF: n=7960 women) and the Osteoporotic Fractures in Men Study (MrOS: n=5990 men). Fracture was ascertained after baseline assessment. We conducted two types of experiments on each dataset: one using 11 risk factors from the FRAX model that includes factors such as age, height, weight, bone mineral density, fracture history, and corticosteroid use. For fracture prediction, we employed two DL methods (DeepSurv and DeepHit) and the CoxPH model. Their performance was evaluated using the concordance index (c-index) and Brier score.
Results: During a median follow-up of 14.2 years (IQR: 5.7-17.1 years), 3363 women and 1084 men experienced a fragility fracture. The CoxPH model demonstrated comparable discriminative and calibration performance to both DL algorithms. The c-index for CoxPH was 0.67 in women (0.70 in men), slightly better than DeepSurv (0.66, 0.66) and DeepHit (0.66, 0.65). Additionally, the CoxPH model showed comparable Brier scores (0.153 in women, 0.144 in men) to DeepSurv (0.143, 0.147) and DeepHit (0.156, 0.165).
Conclusions: These results indicate that the Cox proportional hazards model is as good as or better than the DL algorithms for predicting fracture risk.