nature.com

Deep neural networks excel in Covid-19 disease severity prediction—a meta-regression analysis

AbstractCOVID-19 is a disease in which early prognosis of severity is critical for desired patient outcomes and for the management of limited resources like intensive care unit beds and ventilation equipment. Many prognostic statistical tools have been developed for the prediction of disease severity, but it is still unclear which ones should be used in practice. We aim to guide clinicians in choosing the best available tools to make optimal decisions and assess their role in resource management and assess what can be learned from the COVID-19 scenario for development of prediction models in similar medical applications. Using the five major medical databases: MEDLINE (via PubMed), Embase, Cochrane Library (CENTRAL), Cochrane COVID-19 Study Register, and Scopus, we conducted a comprehensive systematic review of prediction tools between 2020 January and 2023 April for hospitalized COVID-19 patients. We identified both the relevant confounding factors of tool performance using the MetaForest algorithm and the best tools—comparing linear, machine learning, and deep learning methods—with mixed-effects meta-regression models. The risk of bias was evaluated using the PROBAST tool. Our systematic search identified eligible 27,312 studies, out of which 290 were eligible for data extraction, reporting on 430 independent evaluations of severity prediction tools with ~ 2.8 million patients. Neural Network-based tools have the highest performance with a pooled AUC of 0.893 (0.748–1.000), 0.752 (0.614–0.853) sensitivity, 0.914 (0.849–0.952) specificity, using clinical, laboratory, and imaging data. The relevant confounders of performance are the geographic region of patients, the rate of severe cases, and the use of C-Reactive Protein as input data. 88% of studies have a high risk of bias, mostly because of deficiencies in the data analysis. All investigated tools in use aid decision-making for COVID-19 severity prediction, but Machine Learning tools, specifically Neural Networks clearly outperform other methods, especially in cases when the basic characteristics of severe and non-severe patient groups are similar, and without the need for more data. When highly specific biomarkers are not available—such as in the case of COVID-19—practitioners should abandon general clinical severity scores and turn to disease specific Machine Learning tools.

IntroductionAs of late 2024, over 700 M cases of coronavirus disease (COVID-19) have been registered, with around 7 M mortalities1 in connection with severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) infections. Respiratory symptoms are most commonly observed, but the disease may also affect the heart, kidneys, liver, pancreas, GI tract, brain, and blood vessels2. Infection-induced microangiopathy and microthrombi throughout the body also lead to multiorgan damage3.Making a correct decision on the adequate level of care is critical for patients as the time from disease onset is a strong determinant of outcomes4 and is also crucial for the management of limited resources like intensive care unit (ICU) beds and ventilation equipment5. Many prognostic tools are in use for early prediction of disease severity, some of these being general clinical scores (e.g., MuLBSTA and CURB-65 6. In contrast, others were developed specifically for COVID-19, ranging in methods from nomograms to artificial intelligence (AI) driven online applications7. The widespread use of vaccines and milder virus variants8 could not eliminate, only decrease the ratio of severe to total cases; thus, effective severity prediction remains essential.There needs to be a consensus on which prognostic tools should be used in real-world situations. Our systematic review and meta-analysis quantifies the performance differences between currently available severity prediction tools to guide clinicians in making optimal decisions. We also aim to provide generalizable conclusions for developing severity prediction tools for other diseases, especially in resource-constrained scenarios.As pulmonary complications are distinctive of COVID-19 and deep learning (DL) has been most successful in medical imaging9, we hypothesized that neural networks would outperform other methods in this application, especially the ones that incorporate imaging data.Another important question related to this research is about the possible implications for model development in other diseases for similar severity prediction applications; thus, we investigate the relationship between the characteristics of patient groups and tool complexity to ascertain the requirements of different methods.ResultsAn overview of key results and corresponding methods are shown in Table 1.Table 1 Summary of key results and methods.Full size tableSearch and selectionWe identified 27,312 studies by systematically searching the five medical databases above (Medline: 13096, Embase: 5385, Cochrane and Cochrane Covid: 3411, Scopus: 5420). 16,399 records were screened, first based on the title and abstract, then on full texts. Ultimately, out of the 683 studies retrieved, 280 from database search and 10 from citations were eligible for data extraction. The selection process is summarised in Fig. 1. These 290 studies reported 430 independent evaluations of severity prediction tools.Fig. 1PRISMA 2020 flow diagram.Full size imageDescriptive characteristics of included studiesThe 430 prediction tool evaluations used data from a total of 2,817,359 patients, with a mean sample size of 6,552 and a median of 320. Summary statistics for the three main tool types are given in Table 2; individual study characteristics are detailed in Supplementary Data S1, and references are listed in Supplementary References S1.Table 2 Descriptive statistics for tool types.Full size tableOf the 430 studies, 221 (51.4%) use mortality—either after 30 days of hospital admission or the final data collection date—to define severe cases and survivors as non-severe cases of COVID-19. Most of the other 209 studies use one of the following definitions: (1) moved to ICU, (2) needed some type of mechanical ventilation, or (3) the composite definition of the National Health Commission of the Peoples’s Republic of China10: having respiratory distress, respiratory rate ≥ 30 times/min, oxygen saturation ≤ 93%, PaO2/FiO2 ≤ 300 mm Hg. We grouped these non-mortality severe outcome definitions as none of the analyses showed substantial differences.Performance of severity prediction toolsFrom the 290 studies that provided enough data to assess Area Under receiver operating characteristic Curve (AUC), 430 independent evaluations could be extracted as many studies used different patient groups to develop and validate tools. 61.2% of tools are either preexisting clinical severity scores used as is or modified for COVID-19 or based on the simple logistic regression. Thus, most tools are linear classifiers that are easy to use and interpret, achieving a pooled AUC of 0.855 (SE 0.009) based on a random-effect meta-analysis. Without neural networks, machine learning tools have a pooled AUC of 0.891 (SE 0.008), while for neural network-based tools, the pooled AUC is 0.900 (SE 0.015). Tools in the latter two groups significantly (p < 0.001) outperform linear methods on average. However, substantial residual heterogeneity (I2 = 95.59%, and R2 = 5.84%) shows that other important factors affect performance besides the mathematical structure of the tools. (See detailed regression output in Supplementary Table S1 and results in Supplementary Table S2.)Identification of confounders of tool performanceIn total, 24 potential variables were identified as possible confounders for measuring the effect of tool type on severity prediction performance. (The complete list of variables can be found in Supplementary Table S3.) Out of all candidates, 13 proved to have positive explained variance in out-of-bag samples in 99% of the 200 replications of the MetaForest model with 5000 trees: region of patient population, total number of patients included in the study, rate of severe patients, how severity was measured (as composite severity or as mortality) in the study, the time of the study, and the inclusion of age, C-reactive protein (CRP), respiratory rate, white blood count (WBC), measurement of blood gases, lactate dehydrogenase (LDH), blood urea nitrogen (BUN), and albumin as input variables of the prediction tool. Partial dependence plots suggested a quadratic relationship between AUC and the rate of severe patients in the study. (See replication importance values in Supplementary Fig. S1.)Area under the curve for severity predictionAfter the preselection of confounders by the MetaForest algorithm, we used a mixed-effects meta-regression to estimate effect sizes and select a more parsimonious model by the permutation of the preselected variables. We opted for a model with 3 variables that retain an R2 of 35.12% compared to the full model’s 38.31%. As Table 3 shows, the 3 variables that have a significant effect are the region of the patient population, the rate of severe patients (and its squared value because of the quadratic dependence), and the inclusion of C-reactive protein (CRP) as an input variable of the prediction tool. (The results of the univariate analyses for the three confounders of AUC can be found in Supplementary Table S4.)Table 3 Mixed-effects meta-regression results for AUC.Full size tableAs the reference setup, using a logistic regression prediction tool on European patients without CRP as input data and a 25% severity rate (the mean in collected data), the expected AUC is 0.857 (SE 0.010). A simple clinical score would have a significantly lower pooled AUC by 0.029 (SE 0.012), while neural network-based tools have a higher pooled AUC by 0.039 (SE 0.020). Thus, the difference in AUC between the best and worst tool types is 0.068 on average. For patients from China, prediction performance is better than that of Europeans by 0.069 (SE 0.009), and using CRP increases the expected pooled AUC by 0.022 (SE 0.007). To illustrate the magnitude of differences in performance, the expected AUC of a simple clinical score used in a European setting with 40% severity rate, not considering CRP, is 0.780 (SE 0.012), while a Neural Network in a Chinese setting with 5% severity rate and using CRP values is 0.998 (SE 0.020). The first tool is usually labelled fair, but the second is excellent11.Sensitivity and specificity of severity predictionStudies are heterogeneous in their method of choosing sensitivity and specificity, but this is independent of tool type; thus, a comparison is valid. Values in Table 4 are adjusted for the region of the patient population, the rate of severe patients, and the inclusion of CRP as an input for the prediction. Because of insufficient reporting by the original studies, only 124 could be included in this analysis.Table 4 Mean sensitivity and specificity by tool type, adjusted for confounders.Full size tableSupplementary Table S5 shows that the three tool types have significantly different mean sensitivity and specificity values. Linear classifiers are the least performant, with close to 72% sensitivity and 80% specificity. Meanwhile, neural network-based tools are the best, with a mean sensitivity and specificity of more than 75% and 91%, respectively. Study-level adjusted mean values are displayed in Fig. 2.Fig. 2Study-level adjusted mean sensitivity and false positive rate. The figure shows the study-level adjusted mean sensitivity and false positive rate by tool type with 95% confidence interval for the mean and estimated hierarchical summary receiver operating characteristic curves (HSROC). (Note that the false positive rate goes up only to 50% on the figure.)Full size imageThe same ordering of tool types can be observed: linear tools are the worst, machine learning tools are better, and neural networks are the best. Table 5 shows a hypothetical patient population of 700 million with 10% severe cases. Linear classifier-based tools give a positive test result for 271 M patients and correctly identify 44 M, machine learning tools produce 227 M positives, out of which 49 M are true positives. In contrast, Neural networks give a positive result for 144 M out of which 47 M are true positive cases.Table 5 Estimated and adjusted performance metrics. The table shows the estimated and adjusted performance metrics for different tool types at a 10% patient severity rate in a patient population of size 700 M (equal to the total number of COVID-19 cases worldwide until late 2024).Full size tableComparative analysis for tool developmentInput modalitiesInvestigating the different types of input data modalities, we found that the ratio of tools that rely on only tabular data is 78.4%. Basic demographic data and laboratory measurements are used in about two-thirds of cases for both data types, 74.6% and 74.0%, respectively. Imaging data is used in 21.6% of cases, and none of the predictive tools incorporated text data. Using the mixed-effects meta-regression separately for the three main tool types, Table 6 shows that for linear methods, the use of laboratory data and imaging data both increase AUC, but only when one of these is used and not both, as indicated by the significance and effect size of the interaction term for lab and imaging. For machine learning or neural network-based tools, adding imaging or laboratory data to demographic and other clinical input data does not change the estimated pooled AUC significantly, suggesting that machine learning and neural network-based tools gain their classification performance benefit more from the non-linear mathematical structure than specific input data.Table 6 Meta-regression estimates of the effects of input data types on AUC for the three tool types. The constant term is the estimated pooled AUC for a given tool type without imaging or laboratory data, and the interaction term is the additional effect of the two data types on AUC when both are used. Demographic and other clinical input data May be used in all cases.Full size tableWe analysed AUC as a function of the mean age difference between severe and non-severe patient groups, where the latter serves as a proxy for the classification problem difficulty, including the number of patients, region, and rate of severe patients in the model as well. As mean age difference is an aggregate value, ecological bias may be present, but our separate patient-level analysis suggested this is not an issue. If the mean age difference is large, the two patient groups are easily separated by demographics, making the mathematical structure of the prediction tool or other input data less important. In such a setting, simple linear methods are expected to work well. As age is highly correlated with both disease severity and many clinical and laboratory measurements12, if the mean age difference is small, the two patient groups are not expected to be easily separable. Figure 3 shows that for linear tools, AUC is expected to significantly increase with an increase in mean age difference (slope of 0.0045 AUC/year), while machine learning or neural network-based tools do not display such a relationship. The number of patients in the dataset has a small but significant negative association with AUC for linear methods, which means prediction tools from larger studies have a lower performance. The dependence of AUC on the number of patients is not significant for machine learning and neural network-based tools, confirming that the increase in performance comes from the mathematical structure and not additional data.131 studies had data for linear classifiers and 67 for machine learning tools. The estimated regression coefficients are in Table 7.Fig. 3AUC by mean age difference for linear classifiers and machine learning/neural network tools separately. Blue lines show the estimated linear relationship, circles are studies.Full size imageTable 7 Parameter estimates for mixed-effects meta-regression on AUC.Full size tableRisk of bias assessmentThe risk of bias and applicability assessment is shown in Fig. 4, as proposed by Moons et al.13. There is a high risk of bias overall for close to 88% of studies, predominantly because of the high risk of bias in the analysis phase. The two most often encountered errors in the analysis were (a) using univariate association measures between potential explanatory factors and the severity outcome to select variables to be included in a multivariate prediction tool and (b) p-value-driven variable selection. Handling missing data is also an issue, as most studies opt to exclude patients who have “too many” missing values, and even if missing values are imputed, detailed statistics comparing original and imputed values are not provided. Overfitting is primarily avoided by separating training and validation sets of patients, but performance metrics are only reported on separate test sets if data is gathered from multiple sources.Supplementary Fig. S2 shows funnel plots for subgroups by main tool types and geographic region of the patient population, and corresponding Egger’s test results can be found in Supplementary Table S6.Fig. 4PROBAST results for Risk of Bias (left panel) and for Applicability (right panel).Full size imageDiscussionMany prognostic tools have been developed for early prediction of disease severity in COVID-19, but which should be used in practice is an open question. We have conducted a systematic review and meta-analysis to quantify performance differences between severity prediction tools to aid clinicians in choosing the best options, and to assess what lessons can be learned for future prediction tool development for similar applications.The MetaForest algorithm showed that after the geographic region of the study population, the second most important factor in predicting disease outcomes is the type of prediction tool itself. Machine learning tools outperform simple clinical scores and classical statistical models in all cases, even when sample size, study population characteristics, and other methodological choices are controlled for. Neural networks, in particular, have the highest pooled AUC of 0.893 (0.748–1.000), a sensitivity of 0.752 (0.614–0.853), and a specificity of 0.914 (0.849–0.952). Thus, clinicians’ default choice should not be the most often used simple scores and logistic regressions. The absence of clear sample size dependence of AUC suggests that if there is enough data for developing simple prediction tools, choosing an ML-based version is also possible, and has higher performance potential.Using our original search key of “(Covid) AND (severity) AND (prediction OR prognosis)”, we identified 36 systematic reviews and meta-analyses on MEDLINE (via PubMed) up till 2023.07.01 the review of the association of demographic, clinical, laboratory measurements, medical imaging-based values, or comorbidities with disease severity. (For the list of review studies, see Supplementary References S2.) Out of these, five only consider comorbidities, while two cover only medical imaging. The top five measurements found to be associated with severity are D-Dimer, C-reactive protein (CRP), white blood cells (WBC), lactate dehydrogenase (LDH), and lymphocytes. Supplementary Table S7 shows the most important factors from the reviews with the ratio of prognostic tools using those factors. C-reactive protein is the only laboratory measurement that we considered in the final parsimonious meta-regression model, which was found to be the second most common factor associated with disease severity, showing to be a statistically significant prognostic feature in 51.7% of reviews.We have also found four systematic reviews and meta-analyses focusing on severity prediction. Wynants et al.14 is an early review of available tools without a meta-analysis and highlights the fundamental problem of not following guidelines (e.g. TRIPOD, the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis15) in reporting, which makes performance evaluations very difficult. de Jong et al.16 conducted an individual-level meta-analysis to validate previously developed prediction tools. They also found considerable heterogeneity in the performance of different tools but only considered simple logistic regression-based methods, omitting ML or DL tools altogether because of limited reproducibility. Wang et al.17 focus on deep learning tools for diagnosis and severity prediction using imaging data. Pooled sensitivity and specificity are 0.76 (95% CI 0.74–0.79) and 0.82 (95% CI 0.78–0.86) for prediction, which aligns with our findings. Chen et al.18 review 33 studies using machine learning tools (according to their definition of ML, which differs from the one used in our review). The pooled sensitivity for all studies is 0.86 (95% CI, 0.79–0.90), the specificity is 0.87 (95% CI, 0.80–0.92), and the AUC is 0.93 (95% CI, 0.90–0.95). These suggest better performance than our results, which can be attributed to their strict selection criteria (e.g., omitting studies with small sample sizes or without ML) that systematically favour tools with higher performance. Their list of the most important input variables for the prediction tools is highly similar to our list of variables found using the MetaForest selection algorithm, the top five being age, LDH, lymphocytes, respiratory rate, and CRP.While our meta-analysis agrees with previous studies about the most important laboratory measurements, we could show that other more influential factors affect a prediction tool’s performance. The patient population plays a key role, at least through a regional effect and through the rate of severe cases. Prediction tools developed and used on European and North American patients were less precise than those on Chinese patients, and a lower rate of severe cases naturally meant that it was harder to accurately identify those patients. Additional analysis concerning the development of prediction tools suggested that using machine learning models is possible even without a large dataset and can handle cases with a complex relationship between measurements and severity categorization.Our systematic review and meta-analysis is the most comprehensive to date. As we included an extensive range of COVID-19 disease severity prediction tools, the empirical foundation of our conclusions is more robust than that of previous reviews, which are all narrower in scope. Using the MetaForest algorithm, we could account for the relevant prognostic performance confounders, clearly showing ML tools’ superiority. While most previous reviews dealt with finding the best prediction factors, we could show that prognostic tools’ other features – notably the mathematical structure – have an even more critical role in determining performance.A limitation for the generalizability of our results stems from the nature of COVID-19, as we have found that for such a systemic disease, there are many equally good sets of input measurements for prediction tools. Different tools based on information from clinical data, laboratory values, or medical imaging could all perform similarly well. For other diseases, finding a single most important biomarker out of many candidates or high reliance on medical imaging may prove the need for more specialized prediction tools.A general implication for future research is that better adherence to reporting guidelines like TRIPOD15 is essential. Otherwise, it is tough to compare different methods and tools meticulously. Specifically, for COVID-19 severity prediction, as detailed imaging data does not have a predominant role here, the real strength of deep learning methods could not be utilized. However, when investigating the impact of patient demographics on performance, we found that simple linear methods worked well when the non-severe and the severe patient groups were clearly separable according to these, while machine learning tools performed well even if the two groups has similar basic characteristics.It is well known that tree-based methods outperform neural networks for tabular data without specific changes to their architecture19. Both for new diseases and for new severity prediction tools in known ones, incorporating existing knowledge about the relationship between different measurements may be beneficial9. Graph-based deep learning methods20 could be viable options as the input graph structure can encode prior information about the features.A potential drawback of using deep learning prediction tools is their limited interpretability compared to simple linear methods, but paired with auxiliary aids like adding SHAP values21 and interactive dashboards22 can mitigate this issue.For practitioners, it is essential to understand that simple clinical scores may be useful if no distinct data is available for developing a disease-specific prognostic tool; substituting general prognostic tools for disease-specific ones should be done as quickly as possible because performance differences may be substantial. This highlights the importance of interdisciplinary cooperation, as put forth in the Cycle Model for Translational Medicine23,24, which argues that supporting a biostatistics team working alongside clinicians is essential in implementing state-of-the-art decision support tools.All tools in use today aid decision-making for COVID-19 severity prediction, but machine learning tools, and specifically neural networks, clearly outperform other methods, most notably by decreasing the number of incorrectly identified severe cases and consequently helping a more efficient utilization of healthcare resources (e.g. ICU beds), which is critical in high-load scenarios like a pandemic. However, Deep learning radiomics tools are not required for this application, as image data seems redundant once clinical and laboratory measurements are available. When highly specific biomarkers are not available – such as in the case of COVID-19 – practitioners should abandon simple cutoff-based linear models (e.g. logistic regression) and general clinical severity scores (e.g. CURB-65 or NEWS2) as soon as possible and strive to develop task-specific machine learning tools.MethodsStudy designThis systematic review and meta-analysis is reported according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 statement25 and follows Debray et al.’s A Guide to Systematic Review and Meta-analysis of Prediction Model Performance by the Cochrane Prognosis Methods Group26,27. We also considered the recommendations for the development, validation, and impact of statistical models that predict individual risk proposed by the PROGRESS (PROGnosis RESearch Strategy) partnership28 when evaluating the included studies, and the CHARMS checklist (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) which guides the reporting of prediction models29. The review protocol was registered on PROSPERO (registration number CRD42022377599, see www.crd.york.ac.uk/prospero).Information sources and eligibility criteriaThe systematic literature search was conducted on 2023.04.02. in five major medical databases: MEDLINE (via PubMed), Embase, Cochrane Library (CENTRAL), Cochrane COVID-19 Study Register, and Scopus. Studies starting from 2020.01.01. up till the search date that reported on hospitalized adult patients with Reverse Transcription Polymerase Chain Reaction (RT-PCR) confirmed Sars-CoV-2 infection were included. The basis of comparison was using different severity prediction tools. In contrast, one of the outcomes measured needed to be dichotomous disease severity and to assess prediction tool performance, at least Area Under receiver operating characteristic Curve (AUC), sensitivity and specificity, or confusion matrix was required.Studies reporting on paediatric, geriatric, or other sub-populations (e.g., patients with certain chronic diseases) were excluded, as well as those that only measured severity as a non-dichotomous quantity or exclusively investigated the role of a single factor of disease severity (e.g., the association of obesity with severity). Non-English articles, case reports/series, reviews, commentary articles, and letters to the editor were not considered. Only the original study was included whenever multiple studies analysed the same data.Search strategyIn all databases listed above, we used the simple search key of “(Covid) AND (severity) AND (prediction OR prognosis)” as the components of this are Medical Subject Headings (MeSH terms), which include relevant variations.Selection processThe selection was performed by two independent reviewers from the team (MR and GAR; Cohen’s Kappa = 0.92) for all studies after the duplicates were removed, first by title and abstract, then by assessing the full text. Disagreements were resolved by a third author (PF).Data collection process and data itemsBased on a pilot data collection and the consensus of methodological and clinical experts, we created a standardized data collection sheet in Microsoft Excel. Data from all eligible articles were extracted independently by two authors. The data items collected were study period and geographical Area, study design, outcome definitions, patient demographics (group sizes, age, sex, country of origin), prediction tool type, statistical methods of variable selection, variables used in the prediction model, performance metrics (primary: AUC, secondary: sensitivity, specificity, positive predictive value, negative predictive value, confusion matrix entries).Study risk of bias and quality assessmentThe risk of bias in the results and the applicability of the prognostic model were assessed using the PROBAST tool (Prediction model Risk of Bias Assessment Tool)30 by MR. The four domains for risk of bias assessment are participants, predictors, outcomes, and analysis; for applicability, participants, predictors, and outcomes. Each domain was classified as having a high, low, or unclear value.Synthesis methodsAs the primary performance metric for comparing severity prediction tools, the Area Under the Receiver Operating Characteristic Curve (AUC) was pooled for tool types based on mathematical-statistical structure using a random-effects model31. The list of 8 types, from simpler toward more sophisticated models, are as follows: (a) clinical scores, (b) logistic and Cox regressions, (c) support vector machines (SVM), (d) random forests, (e) boosting models, (f) simple neural networks (single hidden-layer perceptron), (g) deep learning models (with at least two hidden-layers in the network). We also grouped types into 3 categories for some analyses: a-b as linear classifiers, types c-e as machine learning methods, and separately considering neural networks (f and g).As secondary performance metrics, we analysed sensitivity and specificity values using a single point on the Receiver Operating Characteristic (ROC) curve as given in the original articles.There are many possible confounding factors influencing tool performance (geographic region of study participants, number and types of demographic, clinical, laboratory, or other measurements used for prediction, etc.), which required a method that could perform an exploratory search and selection of relevant variables, while also accounted for between-study heterogeneity. The MetaForest method was selected as it was specifically developed for such problems32. It is a modification of the random forest algorithm by Breiman33, retaining its advantages in being robust against overfitting and capacity to model non-linear relationships, including higher-order interactions between the explanatory factors and the outcome performance metric. We used 200 replications with 5000 trees and random-effects weights, choosing the variables with positive estimates of explained variance in out-of-bag samples in 99% of cases.First, we used MetaForest for AUC to find the important subset of control variables. Then we included these variables in a multivariate mixed-effects meta-regression model34, ensuring that no influential non-linear or interaction effects were excluded. Model checking was done using a QQ-plot for the residuals.Although AUC is an important measure to compare tools and the one most commonly reported, the pair of sensitivity and specificity values are more characteristic of practical use as these refer to the actual operating point on the ROC curve while not being influenced by the true rate of severe patients. The same control variables were used when pooling sensitivity and specificity as for AUC. The bivariate model of Reitsma et al.35,36 was fitted, as this approach considers the dependency between sensitivity and specificity. The advantage of the analysis of the sensitivity and specificity is that these do not depend on the disease’s prevalence.Positive predictive value (PPV) and negative predictive value (NPV) are more easily interpretable but depend on the rate of severe patients in the population. Nevertheless, we estimated positive predictive value (PPV) and negative predictive value (NPV) as a function of severity rate for different tool types. We hypothetically compared different tools in different application scenarios to show their influence on resource management.The statistical analysis of the data was conducted using the R software (R Core Team, 2024, Vienna, Austria).

Data availability

All data used in this study can be found in the full-text articles included in the systematic review and meta-analysis. Data is provided within the supplementary information files.

ReferencesWorldometers.info. Worldometer Covid-19 data. (2024).Peiris, S. et al. Pathological findings in organs and tissues of patients with COVID-19: A systematic review. PLoS ONE. 16, e0250708 (2021).CAS 

PubMed 

PubMed Central 

MATH 

Google Scholar 

Raviraj, K. G. & Shobhana, S. S. Findings and inferences from full autopsies, minimally invasive autopsies and biopsy studies in patients who died as a result of COVID19—A systematic review. Forens. Sci. Med. Pathol. 18, 369–381 (2022).CAS 

PubMed 

PubMed Central 

Google Scholar 

Chen, T. et al. Clinical characteristics and outcomes of older patients with coronavirus disease 2019 (COVID-19) in Wuhan, China: A Single-Centered, retrospective study. J. Gerontol. Ser. A. 75, 1788–1795 (2020).CAS 

MATH 

Google Scholar 

Anderegg, N., Panczak, R., Egger, M., Low, N. & Riou, J. Survival among people hospitalized with COVID-19 in Switzerland: A nationwide population-based analysis. BMC Med. 20, 164 (2022).CAS 

PubMed 

PubMed Central 

Google Scholar 

Cheng, P. et al. Pneumonia scoring systems for severe COVID-19: Which one is better. Virol. J. 18, 33 (2021).CAS 

PubMed 

PubMed Central 

MATH 

Google Scholar 

Liang, W. et al. Early triage of critically ill COVID-19 patients using deep learning. Nat. Commun. 11, 3543 (2020).ADS 

CAS 

PubMed 

PubMed Central 

MATH 

Google Scholar 

World Health Organization. Initial Risk Assessment. (2023). https://www.who.int/docs/default-source/coronaviruse/21042023xbb.1.16ra-v2.pdfRajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).CAS 

PubMed 

Google Scholar 

National Health Commission (NHC). National Health Commission (NHC) of the People’s Republic of China. Guidelines for the Prevention, Diagnosis, and Treatment of Novel Coronavirus-Induced Pneumonia. (2020).de Hond, A. A. H., Steyerberg, E. W. & van Calster, B. Interpreting area under the receiver operating characteristic curve. Lancet Digit. Health. 4, e853–e855 (2022).PubMed 

MATH 

Google Scholar 

Katzenschlager, R. et al. Long-term safety and efficacy of apomorphine infusion in Parkinson’s disease patients with persistent motor fluctuations: Results of the open-label phase of the TOLEDO study. Parkinsonism Relat. Disord. 83, 79–85 (2021).CAS 

PubMed 

Google Scholar 

Moons, K. G. M. et al. PROBAST: A tool to assess risk of bias and applicability of prediction model studies: Explanation and elaboration. Ann. Intern. Med. 170, W1–W33 (2019).PubMed 

MATH 

Google Scholar 

Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. BMJ 369, m1328 (2020).PubMed 

PubMed Central 

MATH 

Google Scholar 

Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMJ 350, g7594 (2015).de Jong, V. M. T. et al. Clinical prediction models for mortality in patients with covid-19: External validation and individual participant data meta-analysis. BMJ 378, e069881 (2022).Wang, C., Liu, S., Tang, Y., Yang, H. & Liu, J. Diagnostic test accuracy of deep learning prediction models on COVID-19 severity: Systematic review and Meta-Analysis. J. Med. Internet. Res. 25, e46340 (2023).PubMed 

PubMed Central 

MATH 

Google Scholar 

Chen, R. et al. Prediction of prognosis in COVID-19 patients using machine learning: A systematic review and meta-analysis. Int. J. Med. Informat. 177, 105151 (2023).MATH 

Google Scholar 

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Preprint at (2022). https://doi.org/10.48550/arXiv.2207.08815Veličković, P. et al. Graph Attention Networks. (2018). https://doi.org/10.48550/arXiv.1710.10903Lundberg, S. & Lee, S. I. A unified approach to interpreting model predictions. (2017). https://doi.org/10.48550/arXiv.1705.07874Kui, B. et al. EASY-APP: An artificial intelligence model and application for early and easy prediction of severity in acute pancreatitis. Clin. Transl. Med. 12, e842 (2022).CAS 

PubMed 

PubMed Central 

Google Scholar 

Hegyi, P. et al. Academia Europaea position paper on translational medicine: The cycle model for translating scientific results into community benefits. J. Clin. Med. 9, 1532 (2020).PubMed 

PubMed Central 

MATH 

Google Scholar 

Hegyi, P., Erőss, B., Izbéki, F., Párniczky, A. & Szentesi, A. Accelerating the translational medicine cycle: The academia Europaea pilot. Nat. Med. 27, 1317–1319 (2021).CAS 

PubMed 

Google Scholar 

Page, M. J. et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ (Clinical Res. ed.) 372, (2021).Debray, T. P. A. et al. A guide to systematic review and meta-analysis of prediction model performance. BMJ 356, i6460 (2017).PubMed 

MATH 

Google Scholar 

Cochrane Handbook for Systematic Reviews of Interventions. 2nd Edition | Wiley. Wiley.com https://www.wiley.com/en-ie/Cochrane+Handbook+for+Systematic+Reviews+of+Interventions%2C+2nd+Edition-p-9781119536628Hemingway, H. et al. Prognosis research strategy (PROGRESS) 1: A framework for researching clinical outcomes. BMJ 346, e5595 (2013).PubMed 

PubMed Central 

MATH 

Google Scholar 

Moons, K. G. M. et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLoS Med. 11, e1001744 (2014).PubMed 

PubMed Central 

MATH 

Google Scholar 

Wolff, R. F. et al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Ann. Intern. Med. 170, 51–58 (2019).PubMed 

MATH 

Google Scholar 

Harrer, M., Cuijpers, P., Furukawa, T. A. & Ebert, D. D. Doing Meta-Analysis in R. (2021). https://doi.org/10.1201/9781003107347van Lissa, C. J. & MetaForest Exploring heterogeneity in meta-analysis using random forests. Preprint At. https://doi.org/10.31234/osf.io/myg6s (2017).MATH 

Google Scholar 

Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).MATH 

Google Scholar 

Xu, C. & Doi, S. A. R. Meta-regression. in meta-analysis: Methods for health and experimental studies (ed Khan, S.) 243–254 (Springer, 2020). https://doi.org/10.1007/978-981-15-5032-4_11.MATH 

Google Scholar 

Reitsma, J. B. et al. Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews. J. Clin. Epidemiol. 58, 982–990 (2005).PubMed 

MATH 

Google Scholar 

Chu, H. & Cole, S. R. Bivariate meta-analysis of sensitivity and specificity with sparse data: A generalized linear mixed model approach. J. Clin. Epidemiol. 59, 1331–1332 (2006).PubMed 

MATH 

Google Scholar 

Harbord, R. M., Deeks, J. J., Egger, M., Whiting, P. & Sterne, J. A. C. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 8, 239–251 (2007).PubMed 

MATH 

Google Scholar 

Egger, M., Smith, G. D., Schneider, M. & Minder, C. Bias in meta-analysis detected by a simple, graphical test. BMJ 315, 629–634 (1997).CAS 

PubMed 

PubMed Central 

MATH 

Google Scholar 

Deeks, J. J., Macaskill, P. & Irwig, L. The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J. Clin. Epidemiol. 58, 882–893 (2005).PubMed 

MATH 

Google Scholar 

Debray, T. P. A. et al. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J. Clin. Epidemiol. 68, 279–289 (2015).PubMed 

MATH 

Google Scholar 

Dhiman, P. et al. Risk of bias of prognostic models developed using machine learning: A systematic review in oncology. Diagn. Prognostic Res. 6, 13 (2022).MATH 

Google Scholar 

Chen, N. et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: A descriptive study. Lancet 395, 507–513 (2020).CAS 

PubMed 

PubMed Central 

MATH 

Google Scholar 

Bellman, R. E. Dynamic Programming (Dover Publications, Inc., 2003).Vergouwe, Y., Moons, K. G. M. & Steyerberg, E. W. External validity of risk models: Use of benchmark values to disentangle a case-mix effect from incorrect coefficients. Am. J. Epidemiol. 172, 971–980 (2010).PubMed 

PubMed Central 

MATH 

Google Scholar 

Vapnik, V. N. The Nature of Statistical Learning Theory (Springer, 2000). https://doi.org/10.1007/978-1-4757-3264-1Small Sample Size Solutions: A Guide for Applied Researchers and Practitioners. (Routledge, 2020). https://doi.org/10.4324/9780429273872Menezes, R. G. et al. Postmortem findings in COVID-19 fatalities: A systematic review of current evidence. Leg. Med. 54, 102001 (2022).CAS 

MATH 

Google Scholar 

Simmonds, M. Quantifying the risk of error when interpreting funnel plots. Syst. Rev. 4, (2015).Leeflang, M. M. G., Deeks, J. J., Rutjes, A. W. S., Reitsma, J. B. & Bossuyt, P. M. M. Bivariate meta-analysis of predictive values of diagnostic tests can be an alternative to bivariate meta-analysis of sensitivity and specificity. J. Clin. Epidemiol. 65, 1088–1097 (2012).PubMed 

MATH 

Google Scholar 

Kuo, K. M., Talley, P. C. & Chang, C. S. The accuracy of machine learning approaches using non-image data for the prediction of COVID-19: A meta-analysis. Int. J. Med. Informat. 164, 104791 (2022).MATH 

Google Scholar 

Download referencesFundingOpen access funding provided by Semmelweis University.Funding was provided by the Centre for Translational Medicine, Semmelweis University. Sponsors had no role in the design, data collection, analysis, interpretation, and manuscript preparation.Author informationAuthors and AffiliationsCentre for Translational Medicine, Semmelweis University, Budapest, HungaryMárton Rakovics, Fanni Adél Meznerics, Péter Fehérvári, Tamás Kói, Dezső Csupor, András Bánvölgyi, Gabriella Anna Rapszky, Marie Anne Engh, Péter Hegyi & Andrea HarnosFaculty of Social Sciences, Department of Statistics, ELTE Eötvös Loránd University, Budapest, HungaryMárton RakovicsDepartment of Dermatology, Venereology and Dermatooncology, Semmelweis University, Budapest, HungaryFanni Adél Meznerics & András BánvölgyiBiostatistics Department, University of Veterinary Medicine, Budapest, HungaryPéter Fehérvári & Andrea HarnosDepartment of Stochastics, Budapest University of Technology and Economics, Budapest, HungaryTamás KóiInstitute of Clinical Pharmacy, University of Szeged, Szeged, HungaryDezső CsuporInstitute for Translational Medicine, Medical School, University of Pécs, Pécs, HungaryDezső Csupor & Péter HegyiInstitute of Pancreatic Diseases, Semmelweis University, Budapest, HungaryPéter HegyiAuthorsMárton RakovicsView author publicationsYou can also search for this author inPubMed Google ScholarFanni Adél MeznericsView author publicationsYou can also search for this author inPubMed Google ScholarPéter FehérváriView author publicationsYou can also search for this author inPubMed Google ScholarTamás KóiView author publicationsYou can also search for this author inPubMed Google ScholarDezső CsuporView author publicationsYou can also search for this author inPubMed Google ScholarAndrás BánvölgyiView author publicationsYou can also search for this author inPubMed Google ScholarGabriella Anna RapszkyView author publicationsYou can also search for this author inPubMed Google ScholarMarie Anne EnghView author publicationsYou can also search for this author inPubMed Google ScholarPéter HegyiView author publicationsYou can also search for this author inPubMed Google ScholarAndrea HarnosView author publicationsYou can also search for this author inPubMed Google ScholarContributionsConceptualization: MR, FAM, AH, PF, TK, DCs, AB, PH. Project administration: MR, FAM. Data curation: GAR. Methodology: MR, PF, TK. Formal analysis: MR, TK. Visualization: MR. Writing—original draft: MR. Writing—review & editing: FAM, AH. Review: PF, TK, GAR, MAE. Supervision: AH, DCs, AB, MAE, PH. Funding acquisition: PH. All authors certify that they have participated sufficiently in the work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the manuscript.Corresponding authorCorrespondence to

Márton Rakovics.Ethics declarations

Competing interests

The authors declare no competing interests.

Additional informationPublisher’s noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Electronic supplementary materialBelow is the link to the electronic supplementary material.Supplementary Material 1Supplementary Material 2Supplementary Material 3Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissionsAbout this articleCite this articleRakovics, M., Meznerics, F.A., Fehérvári, P. et al. Deep neural networks excel in COVID-19 disease severity prediction—a meta-regression analysis.

Sci Rep 15, 10350 (2025). https://doi.org/10.1038/s41598-025-95282-6Download citationReceived: 03 October 2024Accepted: 20 March 2025Published: 26 March 2025DOI: https://doi.org/10.1038/s41598-025-95282-6Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard

Provided by the Springer Nature SharedIt content-sharing initiative

KeywordsCOVID-19Severity predictionMachine learningDeep learningArtificial intelligence

Read full news in source page