Reliability and Clinical Utility of Machine Learning to Predict Stroke Prognosis: Comparison with Logistic Regression

Article information

J Stroke. 2020;22(3):403-406

Publication date (electronic) : 2020 September 29

doi : https://doi.org/10.5853/jos.2020.02537

Su-Kyeong Jang ^a^,^*, Jun Young Chang ^a^,^*, Ji Sung Lee ^b^,^c, Eun-Jae Lee ^a, Yong-Hwan Kim ^d, Jung Hoon Han ^a, Dae-Il Chang ^e, Han Jin Cho ^f, Jae-Kwan Cha ^g, Kyung Ho Yu ^h, Jin-Man Jung ⁱ, Seong Hwan Ahn ^j, Dong-Eog Kim ^k, Sung-Il Sohn ^l, Ju Hun Lee ^m, Kyung-Pil Park ⁿ, Sun U. Kwon ^a, Jong S. Kim ^a, Dong-Wha Kang^,^a

, KOSNI Investigators

¹Department of Neurology, Asan Medical Center, Seoul, Korea

²Clinical Research Center, Asan Medical Center, Seoul, Korea

³Department of Clinical Epidemiology and Biostatistics, Asan Medical Center, Seoul, Korea

⁴Asan Institute for Life Sciences, Asan Medical Center, Seoul, Korea

⁵Department of Neurology, Kyung Hee University Medical Center, Seoul, Korea

⁶Department of Neurology, Pusan National University Hospital, Busan, Korea

⁷Department of Neurology, Dong-A University Hospital, Busan, Korea

⁸Department of Neurology, Hallym University Sacred Heart Hospital, Anyang, Korea

⁹Department of Neurology, Korea University Ansan Hospital, Ansan, Korea

¹⁰Department of Neurology, Chosun University Hospital, Gwangju, Korea

¹¹Department of Neurology, Dongguk University Ilsan Hospital, Goyang, Korea

¹²Department of Neurology, Keimyung University Medical Center, Daegu, Korea

¹³Department of Neurology, Hallym University Kangdong Sacred Heart Hospital, Seoul, Korea

¹⁴Department of Neurology, Pusan National University Yangsan Hospital, Yangsan, Korea

Correspondence: Dong-Wha Kang Department of Neurology, Asan Medical Center, University of Ulsan College of Medicine, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea Tel: +82-2-3010-3440 Fax: +82-2-474-4691 E-mail: dwkang@amc.seoul.kr

*These authors contributed equally to the manuscript as first author.

Received 2020 June 25; Revised 2020 July 29; Accepted 2020 August 13.

Dear Sir:

The accurate prediction of functional recovery after a stroke is essential for post-discharge treatment planning and resource utilization. Recently, machine learning (ML) algorithms with baseline clinical variables have demonstrated better performance for predicting the functional outcome of ischemic stroke compared with preexisting scoring systems developed by conventional statistics [1,2]. However, most studies compared model performance by area under curve (AUC) only, and ML and conventional statistical approaches were not sufficiently evaluated in terms of the reliability and clinical utility [3]. We aimed to compare the performance of the ML with that of the conventional logistic regression (LR) model by evaluating accuracy, reliability, and clinical utility using AUC comparison, calibration, and decision curve analysis to predict the outcome of a stroke using KOrean Stroke Neuroimaging Initiative (KOSNI) database.

Using clinical variables measurable at admission (Supplementary methods 1), we used various ML algorithms including deep learning (DL), support vector machine (SVM), random forest (RF), XGboost (XGB), and conventional LR models for predicting 3-month modified Rankin Scale (mRS) >2 or 1 (Supplementary methods 2). Receiver operating characteristic (ROC) curve analysis was performed to evaluate the sensitivity and specificity of each model across each decision threshold. Calibration was evaluated using a reliability diagram and expected calibration error (ECE) to assess the reliability of estimates between the predicted and actual outcomes [4]. The decision curve analysis was constructed to assess the clinical utility of various developed models (Supplementary methods 3) [5].

Six thousand seven hundred thirty-one patients included from 10 tertiary stroke centers in South Korea. This study was approved by the Institutional Review Boards of all participating institutions and comprehensive written informed consent was obtained from patients enrolled in the prospective study. Four thousand seven hundred nine (70%) of the datasets from the former part in the order of admission date were used for training, whereas the remaining 2,019 (30%) from the latter were used as a test set for evaluating the final performance. The baseline characteristics stratified by the outcomes were summarized in Supplementary Table 1.

When the predictive ability was compared with the LR model (AUC of the ROC curve: 0.860 for predicting mRS >2; 0.831 for predicting mRS >1), DL achieved AUC of 0.864 for predicting mRS >2 (P=0.11) and 0.834 for predicting mRS >1 (P=0.06), which was not statistically different. The AUC of SVM, RF, and XGB were 0.871 (P<0.001), 0.870 (P=0.01), and 0.871 (P<0.01) for mRS >2, 0.838 (P<0.001), 0.844 (P<0.001), and 0.843 (P<0.001) for mRS >1 respectively, which demonstrated better performance than the LR model (Figure 1). The detailed confusion matrix and accuracy are described in Supplementary Table 2. In the reliability diagram, the ECE values of SVM was the lowest for predicting both mRS >2 (0.020) and mRS >1 (0.037), suggesting that the SVM model was the most calibrated (Figure 2). The decision curve analysis indicated that the level of clinical benefit throughout the risk thresholds were similar for various ML and LR models (Figure 2).

Figure 1.

Receiver operating characteristic curve of classifiers to predict modified Rankin Scale (mRS) >2 (A) and mRS >1 (B). The P-value was calculated using DeLong’s test for the curve of logistic regression (LR) and the machine learning model. AUC, area under curve; DL, deep learning; SVM, support vector machine; RF, random forest; XGB, XGBoost.

Figure 2.

Comparison of calibration and clinical utility in different algorithms (A, B) reliability diagram (C, D). Decision curves of models which predict 3-month (A, C) modified Rankin Scale (mRS) >2 and (B, D) mRS >1. LR, logistic regression; ECE, expected calibration error; DL, deep learning; SVM, support vector machine; RF, random forest; XGB, XGBoost.

Our study shows that ML models had better discriminated power evaluated by AUC and reliability in predicting clinical outcome after a stroke than conventional LR models. It should be noted that, however, both ML and LR models demonstrated moderate-to-good performances, and ML model did not outperform LR models in terms of clinical utility.

This study has the advantage that we evaluated reliability and clinical utility of the models in addition to discriminated power comparison. The assessment of the agreement between the predicted and actual outcomes on the calibration plot is a requisite for model validation [6]. Also clinical net benefit needs to be evaluated using decision curve analysis [7]. The results indicate ML was also comparable or superior to LR in terms of reliability and clinical net benefit.

ML is effective in dealing with wide data where the number of variables per study subjects is relatively large and interactions between variables exist [8]. Introducing mixed-media data including image (computer tomography, magnetic resonance imaging), biosignal data acquired from continuous monitoring (blood pressure, heart rate, electrocardiography, and electroencephalography) in the analysis in addition to clinical variables with numeric, symbolic features may enable us to develop more accurate predictive ML model [9]. Training to predict an outcome with strong signal-to-noise ratio rather than an outcome of poor signal-to-noise ratio such as clinical outcome prediction may also improve performance power of ML [3].

The limitation of our study is that we only use baseline clinical variables and treatment-related factors were not included for model construction. Variables associated with acute stroke management to prevent stroke progression or recurrence and patient’s will for active rehabilitation could have a significant impact on functional recovery.

In conclusion, our study revealed that ML algorithms using baseline clinical parameters had better accuracy, reliability, and similar clinical net benefits to the traditional LR models in predicting functional recovery after an acute ischemic stroke.

Supplementary materials

Supplementary materials related to this article can be found online at https://doi.org/10.5853/jos.2020.02537.

Supplementary Table 1.

Characteristics of patients based on outcomes

jos-2020-02537-suppl1.pdf

Supplementary Table 2.

Confusion matrix and accuracy

jos-2020-02537-suppl2.pdf

Supplementary methods 1.

Selection of variables which were used as input for model

jos-2020-02537-suppl3.pdf

Supplementary methods 2.

Developments of model

jos-2020-02537-suppl3.pdf

Supplementary methods 3.

Evaluation of reliability and clinical benefit

jos-2020-02537-suppl3.pdf

Acknowledgements

This research was supported by grants from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant numbers: HI18C2383 and HI12C1847).

Notes

The authors have no financial conflicts of interest.

References

1. Nishi H, Oishi N, Ishii A, Ono I, Ogura T, Sunohara T, et al. Predicting clinical outcomes of large vessel occlusion before mechanical thrombectomy using machine learning. Stroke 2019;50:2379–2388.

2. Heo J, Yoon JG, Park H, Kim YD, Nam HS, Heo JH. Machine learning-based model for prediction of outcomes in acute stroke. Stroke 2019;50:1263–1265.

3. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol 2019;110:12–22.

4. Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using Bayesian binning. Proc Conf AAAI Artif Intell 2015;2015:2901–2907.

5. Kerr KF, Brown MD, Zhu K, Janes H. Assessing the clinical impact of risk prediction models with decision curves: guidance for correct interpretation and appropriate use. J Clin Oncol 2016;34:2534–2540.

6. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016;74:167–176.

7. Van Calster B, Wynants L, Verbeek JFM, Verbakel JY, Christodoulou E, Vickers AJ, et al. Reporting and interpreting decision curve analysis: a guide for investigators. Eur Urol 2018;74:796–804.

8. Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods 2018;15:233–234.

9. Mitchell TM. Does machine learning really work? AI Mag 1997;18:11.

Article information Continued

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original work is properly cited.