nature.com

Identifying and ranking non-traditional risk factors for cardiovascular disease prediction in people with type 2…

AbstractBackgroundCardiovascular disease (CVD) prediction models perform poorly in people with type 2 diabetes (T2DM). We aimed to identify potentially non-traditional CVD predictors for six facets of CVD (including coronary heart disease, ischemic stroke, heart failure, and atrial fibrillation) in people with T2DM.MethodsWe analysed data on 600+ features from the UK Biobank, stratified by history of CVD and T2DM: 459,142 participants without diabetes or CVD, 14,610 with diabetes but without CVD, and 4432 with diabetes and CVD. A penalised generalized linear model with a binomial distribution was used to identify CVD-related features. Subsequently, a 20% hold-out set was used to replicate identified features and provide an importance based ranking.ResultsHere we show that non-traditional risk factors are of particular importance in people with diabetes. Classical CVD risk factors (e.g. family history, high blood pressure) rank highly in people without diabetes. For individuals with T2DM but no CVD, top predictors include cystatin C, self-reported health satisfaction, biochemical measures of ill health. In people with diabetes and CVD, key predictors are self-reported ill health and blood cell counts. Unique diabetes-related risk factors include dietary patterns, mental health and biochemistry measures (e.g. oestradiol, rheumatoid factor). Adding these features improves risk stratification; per 1000 people with diabetes, 133 CVD and 165 HF cases receive a higher risk.ConclusionsThis study identifies numerous replicated non-traditional CVD risk factors for people with T2DM, providing insight to improve guideline recommended risk prediction models which currently overlook these features.Plain language summary

People living with type-2 diabetes have a higher chance of developing problems with their heart or blood circulation, known as cardiovascular disease. Cardiovascular disease is a major cause of illness and death in people with diabetes. Commonly used risk factors for cardiovascular disease are unable to identify who will develop disease in future. This study aims to uncover less well-known cardiovascular risk factors of particular importance to identify people with diabetes who are likely to develop heart disease in the next 10 years. Our findings reveal the importance of self-reported health issues, non-cholesterol blood test results, dietary patterns, and mental health in identifying people with diabetes who will develop cardiovascular disease. Overall, these factors can improve cardiovascular disease care and self-management in people with type-2 diabetes.

IntroductionWe, and others1,2, have shown that cardiovascular (CVD) risk prediction models do not perform well in people with type 2 diabetes (T2DM). Importantly, performance did not differ meaningfully between 22 CVD risk prediction models, with c-statistics, estimating discrimination, close to 0.703, while for general population this is 0.88 in women and 0.86 in men (based on the QRISK34). This near-constant lower performance of most CVD prediction rules in people with T2DM likely reflect the considerable overlap in considered predictor variables, such as age, sex, blood pressure, and cholesterol reflecting a focus on features with a proven CVD association. The relative poor performance in people with T2DM identifies a need to consider less classical features for CVD prediction.The need to include nonstandard predictors for CVD has previously been shown by Wang et al. 5, where up to 20% of participants with coronary disease did not possess conventional CVD risk factors, and 40% presented with only a single risk factor. This is especially important because individuals with T2DM exhibit an elevated risk of cardiovascular morbidity and mortality and are disproportionately affected by CVD compared to people without T2DM6,7. The rising prevalence of T2DM, combined with advancements in post-CVD event care, contributes to a growing population at risk for CVD events8,9. Given that T2DM itself is a risk factor for CVD, a substantial number of people with diabetes are living with established CVD. This highlights the need to determine to what extent CVD risk factors in people with diabetes differ by history of CVD.The UK Biobank (UKB) was initiated to further understanding of health in all its facets, and therefore collects measurements irrespective of clinical indication. For example, in clinical settings glucose and glycated haemoglobin (HbA1c) are typically only measured in people with, or at risk for, diabetes. In the UKB these features have been measured for nearly all enrolled participants, where during initial assessment information was collected on basic lifestyle and health information, anthropometric measurements, blood and urine samples, body composition, as well as a wealth of additional features. The large amounts of available measurements, taken independent of clinical indication, make the UKB particularly suited for a “hypothesis-free” data-driven approach to potentially identify predictors especially important for people with diabetes.The current study aimed to uncover features for the 10-years risk of CVD in people with T2DM (w T2DM) and with T2DM and CVD (w T2DM&CVD), which can be used to improve attempts at early identification of high-risk individuals and help with the management of CVD. Our objective was to assess a larger number of features than typically considered in CVD risk prediction models, providing a comprehensive overview of key features that can, in turn, be used to inform the development of de novo models or to enrich existing CVD prediction models, tailoring them for people with type 2 diabetes with or without established CVD. To achieve this, we crafted an integrated data engineering and feature selection pipeline to identify the subset of 600 + UKB measured feature which are predictive of the onset of CVD during a 10-year follow-up period.Analyses were conducted in three distinct groups of participants based on their clinical risk of CVD: without a history of diabetes or CVD at enrolment (T2DM/CVD), with T2DM at enrolment, and with T2DM&CVD at enrolment.In this study, we show that classical risk factors are less important for predicting CVD in people with T2DM. We uncover and replicate numerous predictors which are relatively more important in people with diabetes, including mental health, familial CVD history, markers of ill health, kidney disease, and diabetes control, with some (e.g., glycated haemoglobin, cystatin C) predicting CVD regardless of diabetes status. Diabetes-specific features include dietary patterns (e.g. fruit consumption, dietary variability, or poultry or oily fish consumption) and biochemistry measures (e.g. alanine aminotransferase, rheumatoid factor, haematocrit percentage, or monocyte counts). Consideration of these non-traditional CVD risk factors in guideline recommended prediction models may improve their performance in people with diabetes.MethodsData sourceData was sourced from the UKB, a cohort of ~500,000 men and women aged 40–69 years between 2006 and 2010 enrolled from primary care registers across the UK10. The UK Biobank has ethical approval from the North West Multi-centre Research Ethics Committee to handle human participant data, no additional ethical approval was required because the study involved the secondary use of data. Written informed consent was obtained from all participants and all data is deidentified for analysis. Eligible researchers may access UK Biobank data on www.ukbiobank.ac.uk upon registration. This study was approved under UK Biobank Resource application numbers 12113, 24711 and 44972.Th UKB data were stratified into three groups: people without a diagnosis of CVD or T2DM at enrolment (wo T2DM/CVD), the second group included participants with T2DM diagnosis but no history of CVD at enrolment (w T2DM,), and the final group included individuals with diabetes and a history of CVD at enrolment (w T2DM&CVD). Follow-up considered the time from enrolment until the first CVD event, death or end of the study (10-years after enrolment), whichever came first; see Supplementary Table 1. The candidate predictors were measured at the time of enrolment.To identify features associating with the 10-years risk of CVD, we extracted variables (data fields) from 31 distinct UKB categories. The selected fields considered a range of information including anthropometry, blood chemistry, questionnaire data, and sociodemographic characteristics, jointly consisting of 603 unique variables; see Supplementary Table 2. Please see Supplementary Methods, Supplementary Figs. 1–3, Supplementary Tables 3–6, and Supplementary Data 1 for an overview of the sourced data and applied data engineering strategy.Statistics and reproducibilityAfter randomly splitting the UK Biobank data into 80% for training, and 20% for testing, the training data were used to prune data on multicollinearity (Spearman’s correlation ≥ ±0.70) and absence of an outcome associations (univariable p-value ≥ 0.80).To identify CVD-related features we leveraged a generalized linear model with a binomial distribution and an elastic net penalty11 (combining L1 and L2 regularisation), seamlessly removing less important features11. Ten-fold cross-validation, stratified by case (people who developed CVD) and control status (people who did not develop CVD), was used to optimize model hyper-parameters. Given the substantial differences in subgroup sizes, models were separately trained for each participant subgroup and outcome, enabling the identification of features distinct to individuals with or without diabetes.The feature importance of each selected variable was evaluated by applying a permutation feature importance algorithm (10 permutations) to the test data. This method quantified the change in the c-statistic and allowed us to identify and remove features with a zero or negative feature importance, which indicates a failure to replicate the original association. Features were subsequently ranked by their c-statistic change, stratified by outcome type (CVD + AF + HF, CVD, CHD, HF, AF, Isch. Stroke) and a participant group (“wo T2DM/CVD”, “w T2DM”, “w T2DM&CVD”). By estimating the feature importance in the test data, we were able to identify replicated findings (features with positive importance), which were unaffected by any potential overfitting. We dropped features with a zero or negative feature importance in the test data, indicating a failure to replicate. As age and sex are well-known and dominant CVD risk factors, the main text focussed on the remaining features, noting these remaining features are conditionally independent of age and sex; see full results in Supplementary Data 2.To assess the extent of variation in feature importance rankings, we calculated the difference between diabetes groups and the non-diabetes group using the Wilcoxon statistical test. Next, we examined the importance of traditional CVD risk factors in relation to our identified non-traditional risk factors. For this purpose we determined the rank of features used in any the following three clinically used prediction models: ASCVD12, QRISK34 and the Framingham 199813 score. Additionally, to examine the contribution of lipid lowering medication on the overall study results, particularly on the low-density lipoprotein cholesterol (LDL-C) rank, we retrained the elastic net models using the training dataset and stratified by baseline use of lipid lowering medication.Furthermore, to investigate non-linear associations, we conducted a random forest analysis for each participant subgroup-outcome combination. The ranked random forest-based feature importance, which quantifies the change in the c-statistic, was compared to the ranked feature importance from the elastic net model using Spearman’s correlation coefficients.To ensure the results are reproducible, the manuscript, including Supplementary detail provide a comprehensive description of the data engineering and statistical analysis employed, with reproducibility further exchanged by sharing the codebase used for the described analyses14.Analyses were carried out in Python v3.6 using scikit-learn15, statsmodels16, pandas17, and numpy18, plots were generated using matplotlib19, and seaborn20, imputation was performed in R v4.1 using mice21.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.ResultsOur findings reveal that classical CVD risk factors, including LDL-C, are less important in people with diabetes. Depending on the history of CVD the most important predictors for people with diabetes but without a history of CVD included: cystatin C, self-reported health satisfaction, biochemical measures of ill health (e.g. plasma albumin). For people with diabetes and a history of CVD top features included self-reported ill health, and blood cell counts measurements (e.g. red cell distribution width). CVD risk factors specific to people with diabetes included information on dietary patterns, mental health, and biochemistry measures such as oestradiol, rheumatoid factor, and monocyte count.Participant characteristicsData were available on 459,142 participants without T2DM and CVD at baseline “wo T2DM/CVD”, 14,610 individuals with T2DM but without a history of CVD at the time of diagnosis “w T2DM”, and 4,432 participants who had a history of CVD at the time of T2DM diagnosis “w T2DM&CVD”. Participants with CVD were on average older, male, had a higher BMI, and higher HbA1c concentrations; see Table 1. During a median follow-up time of 10 years, 40,350 (8.8%) of the “wo T2DM/CVD” participants experienced a CVD + HF + AF event, with 2,671 (18.3%) CVD + HF + AF events in the “w T2DM” group, and 3,453 (77.9%) of the “w T2DM&CVD” group; Supplementary Table 7.Table 1 Clinical characteristics of UK biobank participants stratified by history of CVD and T2DM at enrolmentFull size tablePrioritized features for CVD predictionOut of the 603 initially available UKB data fields, 382 were retained after the data engineering steps, depending on the type of CVD and participant subgroup between 229 and 258 features remainder after filtering on univariable association and multicollinearity; see Supplementary Table 8, Supplementary Fig. 4. An elastic net algorithm was applied to identify a subset of variables associated with CVD outcomes, which were subsequently replicated in the independent test set, resulting in a range of replicated features between 32 (for CHD in “w T2DM”) and 200 (for Isch. Stroke in “w T2DM&CVD”); Supplementary Fig. 4. Generally, our pipeline identified the most features for the wo T2DM/CVD subgroup (on average 156 features were replicated across the six considered CVD outcomes), followed by w T2DM&CVD (an average of 126 features), and w T2DM (an average of 63 features); Supplementary Fig. 4.Ranking features on their summed c-statistic, aggregated across CVD outcomes, (Fig. 1, Supplementary Data 2) highlighted the importance of plasma biomarkers such as cystatin C, red blood cell distribution width (RDW), HbA1c, plasma albumin, plasma urate, glucose, testosterone, and urine microalbumin, as well as clinical characteristics such as diastolic and systolic blood pressure (DBP/SBP), and estimated trunk mass. Additionally, many of the top ranking features included indicators of a poor health status (e.g., “self-reported: health satisfaction”, “quit smoking due to illness”, “self-reported: recent tiredness”, “disability parking permit (blue badge)”), family (father, mother, sibling) history of heart disease. While systolic/diastolic blood pressure (SBP/DBP) and high-density lipoprotein were ranked 12th, 28th, and 24th respectively, LDL-C conveyed relatively limited discriminative ability, ranking 112th. The results of the retrained elastic net models, stratified by the use of lipid-lowering medication, confirmed the limited importance of LDL-C for CVD prediction in diabetes subgroups, with LDL-C only being selected for the “w T2DM&CVD” group without initial lipid-lowering medication, where it was ranked 70th. LDL-C demonstrated better discriminative ability in the general population, ranking 32nd and 13th for individuals with and without lipid lowering medication, respectively.Fig. 1: Top 40 most predictive features for the 10-years risk of cardiovascular disease.The y-axis presents the top 40 features (excluding age and sex) based on the summed feature importance aggregated across the six types of CVD considered (CVD + AF + HF, CVD, CHD, Ischemic Stroke, HF, AF) stratified by participants subgroup: people without diabetes or a history of CVD at enrolment (“wo T2DM/CVD”), people with diabetes but without a history of CVD at enrolment (“w T2DM”), and people with a history of diabetes and CVD at enrolment (“w T2DM&CVD”). Features were colour-coded by the UK Biobank defined category. The middle heatmap represents the number of CVD outcomes for which a specific feature was identified for (at most 6), while the stacked bar chart encodes the summed feature importance, stratified by participant subgroup. Feature importance was calculated using a permuted feature importance algorithm recording the change in c-statistic. The algorithm was applied to the hold-out test data and hence represent an unbiased estimates of the feature importance as well as reflecting features which were independently replicated. The complete list of identified features is provided in Supplementary Data 2. T2DM type 2 diabetes, CVD cardiovascular disease, RDW red blood cell distribution width, HbA1c glycated haemoglobin, SBP systolic blood pressure, HDL-C high-density lipoprotein cholesterol, DBP diastolic blood pressure, HLR high light scatter reticulocyte count.Full size imageThe comparison of the coefficient and c-statistic feature rankings for CVD showed relatively strong agreement, with correlation coefficient of 0.90 (p-value ≈ 7.23 × 10−33) for “wo T2DM/CVD”, 0.76 (p-value ≈ 5.80 × 10−5) for “w T2DM”, and 0.60 (p-value ≈ 7.19 × 10−6) for “w T2DM&CVD”; see Supplementary Table 9 and Supplementary Data 3.The comparison between features identified by the elastic net models and those identified by non-linear model, such as random forest models (Supplementary Tables 10–13) revealed a relatively high overlap between the top 40 features identified across all subgroups and outcomes. These features mainly included blood and urine assays, blood pressure, and physical measures; see Supplementary Fig. 5. The Spearman’s correlation between feature importance ranks from elastic net models and random forest models ranged between 0.59 and 0.69 for the “wo T2DM/CVD”, 0.41 and 0.69 for the “w T2DM” group, and 0.29 and 0.82 for the “w T2DM&CVD”; Supplementary Table 14. Please see Supplementary Data 4 for the random forest based feature importance for each participant subgroup and outcome.The top 5 most relevant features per CVD type and participant subgroupThe top 5 most important features per CVD type and participant subgroup is presented in Fig. 2, with the full list of features presented in Supplementary Data 2, and Supplementary Figs. 6–8. For people without T2DM or CVD at enrolment (wo T2DM/CVD), SBP and family history of heart disease were important features for CVD + AF + HF, CVD, and CHD, while HDL-C and HbA1c were particularly important for CVD and CHD. Cystatin C and urine microalbumin populated the top 5 most important predictors for ischaemic stroke as well as HF, while estimated trunk mass was ranked highest for CVD + AF + HF and HF; Fig. 2.Fig. 2: Top five most predictive features for the 10-years risk of cardiovascular disease, stratified by disease type.The y-axis presents the top five features (excluding age and sex), ranked by their feature importance, as shown on the x-axis. Results are stratified by type of CVD (a. CVD + AF + HF; b. CVD; c. CHD; d. Ischemic Stroke; e. AF; f. HF) and participant subgroup: people without diabetes or a history of CVD at enrolment (“wo T2DM/CVD”), people with diabetes but without a history of CVD at enrolment (“w T2DM”), and people with a history of diabetes and CVD at enrolment (“w T2DM&CVD”). Feature importance was calculated using a permuted feature importance algorithm recording the change in c-statistic. The algorithm was applied to the hold-out test data and hence represent an unbiased estimates of the feature importance as well as reflecting features which were independently replicated. The source data for this figure is provided in Supplementary Data 2. T2DM type 2 diabetes, CVD cardiovascular disease, CHD coronary heart disease, AF atrial fibrillation, HF heart failure, Isch. Stroke ischemic stroke, SBP systolic blood pressure, HDL-C high-density lipoprotein cholesterol, HbA1c glycated haemoglobin, RDW red blood cell distribution width, DBP diastolic blood pressure, disability parking permit (Receives: blue badge), AST aspartate aminotransferase.Full size imageFor people with T2DM at the time of enrolment (w T2DM), Cystatin C was particularly important to predict all 6 CVD outcomes, with a feature importance of 0.028 for CHD in the “w T2DM” group, compared to only 0.001 in people without diabetes. Self-reported health satisfaction and self-reported insomnia were important for CVD + AF + HF, CVD, and CHD, where self-reported insomnia was also included as a top 5 predictor for CVD and CHD. HbA1c was a strong predictor for Ischaemic stroke and HF, while fat mass and plasma albumin were important for AF and HF; Fig. 2 and Supplementary Data 2.For people with T2DM and CVD at the time of enrolment (w T2DM&CVD), indicators of, sometimes recent, adversity in (perceived) health or stress were important risk factors for CVD. For example, owning a disability parking permit (CVD + HF + AF, HF), self-reported nervousness (CVD), self-reported worrier/anxious feelings (AF), or receiving a disability allowance (HF). Furthermore, familial history of heart disease was often included in the top 5 features for CHD + AF + HF, CVD, and CHD. More traditional biomarkers/clinical measurements were also retained in the top 5: urine albumin (CVD + AF + HF, CVD, CHD), DBP (CHD), haematocrit (CVD + AF + HF), RDW (AF, HF), plasma urate (AF, HF), glucose (ischaemic stroke) and HbA1c (ischaemic stroke, HF); see Fig. 2 and Supplementary Data 2.Finally, as detailed in Supplementary Data 2, we note that while age and sex were often the most important predictors irrespective of a participants diabetes status, in people with diabetes the c-statistic was severely attenuated compared to people without diabetes. For example, for CVD + AF + HF the c-statistic for age was 0.083 in the wo T2DM/CVD group, compared to 0.041 in the w T2DM group.CVD features selected in all three participants groupsWe next identified features that were selected for all three participants groups, stratifying by CVD outcome type; Supplementary Data 5 and Supplementary Results. The number of common features ranged from 14 for CHD to 20 for AF. Briefly, we observed that HbA1c was an important predictor irrespective of the diabetes status, particularly for CVD, HF, AF, and Is. stroke. Interestingly, HbA1c was the 6th most important predictor for CVD and CHD in people without diabetes, for people with diabetes HbA1c was selected for CVD, HF, and Isch. Stroke prediction. Cystatin C and RDW were predictive of HF, AF and Is. stroke (cystatin C only) irrespective of the participant subgroup. Aside from these biochemistry measures, we observed that information on familial disease history, self-reported health (satisfaction), mental-health and socio-economic factors were often predictive of CVD irrespective of the diabetes status.CVD features unique for people with T2DMGiven the poor performance of CVD prediction models in people with T2DM, we next identified the union of features which were uniquely selected for “w T2DM” or “w T2DM&CVD” participant subgroups; Fig. 3, Supplementary Table 15, Supplementary Data 6−8. On average 5 lifestyle factor were unique to CVD prediction in people with diabetes, were particularly diet related information such as fruit consumption, dietary variability, or poultry or oily fish consumption l was paramount. Information on mental health (on average 2 features) and blood assay (on average 3 features) were also important for CVD prediction in people with diabetes. For example, self-reported nervousness, or severe depression, guilty feelings, and recent divorce or separation were all relevant and unique to CVD prediction in people with diabetes. Similarly, alanine aminotransferase, rheumatoid factor, haematocrit percentage, monocyte counts, and oestradiol were important and unique predictors for CVD in people with diabetes (Fig. 3). The difference in ranked feature importance between individuals with and without diabetes, determined for each combination of participant subgroups and type of CVD, confirmed significant differences for most combinations, except for AF and ischaemic stroke in “w T2DM”, and HF in “w T2DM&CVD”; see Supplementary Fig. 9.Fig. 3: Diabetes specific features for the 10-years risk of cardiovascular disease.The figure focuses on features which were predictive for at least three of the six considered CVD outcomes, the complete list of features is presented in Supplementary Data 4−6. The large coloured dots in the middle panel indicate which diabetes-specific feature was selected for each outcome, with the dot colours representing the categories assigned to the features by the UK Biobank. The small grey dots in the middle panel indicates the feature was not selected for the specific outcome. The stacked bar chart represents the number of selected features grouped by UK Biobank assigned categories. T2DM type 2 diabetes, CVD cardiovascular disease, CHD coronary heart disease, AF atrial fibrillation, HF heart failure, Isch. Stroke ischemic stroke, MET metabolic equivalent task, ALT alanine aminotransferase. The top panel reflect the number of times feature belonging to a specific UK biobank category was selected.Full size imageRanking features used by three clinical CVD prediction modelsWe determined the importance and rank of 18 features included in at least one of the clinically used prediction models: ASCVD12, QRISK34, and the Framingham13, Supplementary Tables 16−18. In people without diabetes (“wo T2DM/CVD”) 9 features were selected to predict CVD, with 7 in the top 10% (in order of importance: age, sex, SBP, paternal history of CVD, HDL-C, maternal history of CVD, and sibling history of CVD); Supplementary Table 16. For the “w T2DM” participants, 6 features were selected for CVD, and only age and sex were retained in the top 10 most important features; Supplementary Table 17. For “w T2DM&CVD” participants 8 features were selected for CVD, where age, sex, and maternal and sibling history of CVD were the top 10 most important features; Supplementary Table 17.The following features were not selected for either diabetes subgroup (“w T2DM”, or “w T2DM&CVD”): LDL cholesterol, total cholesterol, smoking status, BMI, severe mental illness, and ethnicity; Supplementary Tables 17–18. Finally, LDL-C was ranked 26.03% for people without diabetes, and not selected for people with diabetes irrespective of CVD history at enrolment, or baseline use of lipid-lowering medication, except for the “w T2DM&CVD” without initial intake of lipid-lowering medication, where LDL-C was ranked 70 in stratified analysis using retrained models.Benefit of considering the identified non-traditional CVD risk factorsWe additionally estimated how many cases were appropriately assigned a higher risk by additionally considering information from the here identified non-classical risk factors. For this we compared risk group assignment (using the canonical risk groups <10%, between 10% and 20%, and ≥20%) based on the 18 classical risk factors used in the ASCVD12, QRISK34, or the Framingham13 against risk group assignment combining classical and non-classical risk factors. Per 1,000 people without diabetes (the “wo T2DM/CVD” group) 253 participants who went on to develop CVD appropriately received a higher risk, for HF this was 36. Per 1,000 people with diabetes (the “w T2DM” group) these numbers were 133 for CVD and 165 for HF.DiscussionIn this study, we leveraged data from the richly phenotyped UKB to identify potential predictors for CVD, specifically focusing on people with diabetes. Combining a bespoke data-engineering strategy with supervised feature selection using an elastic net algorithm, we found a prioritized list of 32 to 200 features, depending on the type of CVD considered. The classical risk factors (e.g. parental or maternal history of heart disease, and blood pressure) were relatively highly ranked for people without diabetes (“wo T2DM/CVD”), however, truncal mass was selected instead of BMI, and HDL-C rather than total cholesterol or LDL-C.These traditional predictors were much less important in people with diabetes. Instead the following features were import to predict CVD in people with diabetes but without a history of CVD (“w T2DM”): HbA1c, cystatin C, self-reported health satisfaction, biochemical measures of ill health e.g. plasma albumin (representing liver and kidney damage). In people with diabetes and a history with CVD important features to predict CVD consisted of: self-reported ill health, and biochemical measures of ill health (RDW, haematocrit – representing anaemia) were selected. We additionally identified features common between people with and without diabetes. For example, HbA1c was important for people with and without diabetes to predict CVD and HF. Interestingly, HbA1c was important for predicting CVD and CHD in people without diabetes, ranking 6th. Additionally, features related to mental health (e.g., self-reported health satisfaction), disability status, urine microalbumin, and family disease history were common predictors for HF, CHD, and CVD. Focusing on features unique to people with diabetes highlighted the importance of information on diet variation (e.g. fruit or poultry consumption), physical activity, mental health (e.g., guilty feelings, nervousness), socioeconomic status (e.g., full/part-time student), family disease history (e.g., mother’s cancer), as well as blood assay measurements such as monocyte count, ALT, haematocrit, oestradiol, and rheumatoid factor.The presented results suggest there is a need to consider additional, non-traditional, risk factors to identify people with diabetes at high-risk of CVD for early detection and treatment of CVD. Importantly, a substantial number of these risk factors may already be registered by clinicians (e.g. general practitioners) during their standard family and clinical history examination (e.g. on family disease, life events, employment). Other features may be readily obtained through questions focussing on diet and living environment, or by applying more bespoke anthropometrics focussing on truncal mass instead of BMI for example. Some features may require further consideration of costs and logistics, such as the need for additional lab measurements like cystatin c. The features identified in this study can be used to guide the development of new models or enhance existing CVD prediction models, specifically tailoring them for individuals with type 2 diabetes.Due to the unique design of the UK Biobank, where measurements are obtained from all participants, irrespective of potential clinical diagnosis, we were able to highlight the relevance of kidney and diabetes markers for CVD prediction in people without such an indication. Importantly, these features were typically more discriminative than lipid measurements, suggesting that currently available risk prediction tools for CVD might be further optimized by adding early markers for kidney disease and diabetes. This held true for people with and people without diabetes, for example HbA1c was in the top 5% of most important CHD predictors for people without diabetes. The observations from this study, indicating the limited predictive ability of LDL-C, align with previous reports22. Notably, popular risk scores such as QRISK34 and Framingham9 do not consider LDL-C as a required predictor), and most CVD models include HDL-C or TG instead; for instance, out of 22 CVD risk prediction models considered by Dziopa et al. 2021, only a single model did not incorporate HDL-C3. This of course does not imply that LDL-C does not have a causal effect on CHD or CVD, instead this highlights that good predictors of disease risk need not have a causal effect on disease manifestation. This adage is also exemplified by the selection algorithm including features, such as sex, maternal and paternal disease history, educational attainment or socioeconomic status.While some of the here reported features have previously been associated with CVD and its individual components such as CHD, ischemic stroke, AF and HF, the current study is able to uniquely account for their pairwise correlation, thereby ensuring the identified features provide independent information. Additionally, by training a multivariable model we were able to estimate relative feature importance, and thereby rank features on their relevance for disease classification. Such a rank provides directly actionable information for clinicians and researchers wishing to enrich the currently available prediction algorithms. Furthermore, by identifying features for six types of CVD, we observe that many of the identified features are shared by distinct diseases such as CHD, HF, and AF. The commonality between features suggests that the same or similar information can be used to predict multiple types of CVD and hence better inform participant risk and optimize care, further supporting our previous study which generalized CVD risk prediction tools to predict the 10-years risk of HF and AF3. Importantly, we wish to clarify the communality is not the result of applying a single machine learning algorithm to model all disease across the three participant groups. Our analyses were specifically designed to maximise flexibility by employing independent models and feature selection processes for each participant group and outcome combination.This study has some limitations which warrant discussion. Firstly, the UKB predominantly consists of white European participants, with an on average higher socioeconomic status, and hence generalizability to other ethnicities should be explored. Secondly, some of the identified features might be specific to the UK, for example, a Blue Badge is the UK’s version of a disabled parking permit. We do expect that many of these UK-specific features can be implemented in different countries given sufficiently careful mapping, similar to how laboratory measurement might need to be recalibrated between distinct labs. Thirdly, we emphasize that the identified features, feature importance, and feature rank were based on the testing data and therefore are unaffected by any potential model overfit and at the same time represents independently replicated findings. Nevertheless, with increasing sample size additional features will likely be identified, particularly for relatively infrequent outcomes such as Is. Stroke. Aside from the influence of sample size, due to the multivariate nature (where features are correlated among themselves), similar but slightly different features might be identified in subsequent studies (e.g., SBP may stand in for DBP due to the correlation between both measurements). While non-linear associations are to be expected in health and healthcare, often simplified models such as applied here are sufficient to detect the presence of an association. A follow-up analysis exploring potential non-linear relationships by applying a random forest algorithm, revealed a relatively high overlap among the top 40 most important features identified by random forest and elastic net models was relatively high (e.g. above 0.60). However, we observed reduced correlations for the “w T2DM&CVD” group, which may indicate an increased importance of non-linear associations, or could also reflect the smaller sample size, as well as the fact that, unlike elastic net algorithms, random forest does not actively remove features through penalisation.In this scaled analysis of the UK Biobank, we showed that the classical risk factors were less importance in people with diabetes. We have identified numerous independent variables which predict CVD in people with diabetes, covering information on mental health, familial CVD and non-CVD disease histories, as well as markers for general ill-health, early kidney disease and diabetes (control). We note that some of the identified features, such as HbA1c and cystatin C, predict CVD irrespective of the diabetes status. Furthermore, we identified diabetes specific features predicting CVD typically covering dietary patterns, mental health and biochemistry measures. The identified features are typically overlooked by currently available CVD prediction models and provide actionable leads to improve CVD prediction models.

Data availability

The UK Biobank dataset analysed during the current study is available via the UK Biobank data access process (see http://www.ukbiobank.ac.uk/register-apply/). Detailed information about UK biobank categories and data fields is available at https://biobank.ctsu.ox.ac.uk/crystal/browse.cgi and https://biobank.ctsu.ox.ac.uk/ukb/help.cgi?cd=data_field. The source data for Figs. 1–2 are provided in Supplementary Data 2, and for Fig. 3 in Supplementary Data 8.

Code availability

See https://doi.org/10.5281/zenodo.1477998114 for the codebase underpinning this work.

AbbreviationsAF:

Atrial fibrillation

BMI:

Body Mass Index

CHD:

Coronary heart disease

CVD:

Cardiovascular disease

CVD + HF + AF:

Cardiovascular disease including heart failure and/or atrial fibrillation

DBP:

Diastolic blood pressure

HDL-C:

High-density lipoprotein cholesterol

HF:

Heart failure

LDL-C:

Low-density lipoprotein cholesterol

MI:

Myocardial infarction

SBP:

Systolic blood pressure

T2DM:

Type 2 diabetes

UKB:

UK Biobank

w T2DM:

Individuals with T2DM diagnosis but not history of CVD

w T2DM&CVD:

Individuals with T2DM and a history of CVD

wo T2DM/CVD:

Individuals without T2DM and a history of CVD

ReferencesRead, S. H. et al. Performance of Cardiovascular Disease Risk Scores in People Diagnosed With Type 2 Diabetes: External Validation Using Data From the National Scottish Diabetes Register. Diabetes Care 41, 2010–2018 (2018).Article 

PubMed 

Google Scholar 

van der Leeuw, J. et al. The validation of cardiovascular risk scores for patients with type 2 diabetes mellitus. Heart Br. Card. Soc. 101, 222–229 (2015).

Google Scholar 

Dziopa, K., Asselbergs, F. W., Gratton, J., Chaturvedi, N. & Schmidt, A. F. Cardiovascular risk prediction in type 2 diabetes: a comparison of 22 risk scores in primary care settings. Diabetologia 65, 644–656 (2022).Article 

CAS 

PubMed 

PubMed Central 

Google Scholar 

Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ 357, j2099 (2017).Article 

PubMed 

PubMed Central 

Google Scholar 

Wang, J. et al. Novel biomarkers for cardiovascular risk prediction. J. Geriatr. Cardiol. JGC 14, 135–150 (2017).CAS 

PubMed 

Google Scholar 

Martín-Timón, I., Sevillano-Collantes, C., Segura-Galindo, A. & del Cañizo-Gómez, F. J. Type 2 diabetes and cardiovascular disease: Have all risk factors the same strength? World J. Diabetes 5, 444–470 (2014).Article 

PubMed 

PubMed Central 

Google Scholar 

Einarson, T. R., Acs, A., Ludwig, C. & Panton, U. H. Prevalence of cardiovascular disease in type 2 diabetes: a systematic literature review of scientific evidence from across the world in 2007–2017. Cardiovasc. Diabetol. 17, 83 (2018).Article 

PubMed 

PubMed Central 

Google Scholar 

Giorda, C. B. et al. Recurrence of Cardiovascular Events in Patients With Type 2 Diabetes. Diabetes Care 31, 2154–2159 (2008).Article 

PubMed 

PubMed Central 

Google Scholar 

van der Heijden, A. A. W. A. et al. Risk of a Recurrent Cardiovascular Event in Individuals With Type 2 Diabetes or Intermediate Hyperglycemia. Diabetes Care 36, 3498–3502 (2013).Article 

PubMed 

PubMed Central 

Google Scholar 

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).Zou, H. & Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 301–320 (2005).Article 

Google Scholar 

Arnett, D. K. et al. 2019 ACC/AHA Guideline on the Primary Prevention of Cardiovascular Disease: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation 140, e596–e646 (2019).Prediction of Coronary Heart Disease Using Risk Factor Categories | Circulation. https://doi.org/10.1161/01.CIR.97.18.1837.KasiaSmietanka. KasiaSmietanka/novel_features_cvd_prediction: novel_features_cvd_prediction. Zenodo https://doi.org/10.5281/zenodo.14779981 (2025).scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation. https://scikit-learn.org/stable/.Introduction — statsmodels. https://www.statsmodels.org/stable/index.html.pandas - Python Data Analysis Library. https://pandas.pydata.org/.Harris et al. Array programming with NumPy. Nature 585, 357–362 (2020).Article 

CAS 

PubMed 

PubMed Central 

Google Scholar 

Matplotlib: Python plotting — Matplotlib 3.4.2 documentation. https://matplotlib.org/.seaborn: statistical data visualization — seaborn 0.11.1 documentation. https://seaborn.pydata.org/.Buuren, Svan & Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 45, 1–67 (2011).Article 

Google Scholar 

Ravnskov, U. et al. LDL-C does not cause cardiovascular disease: a comprehensive review of the current literature. Expert Rev. Clin. Pharmacol. 11, 959–970 (2018).Article 

CAS 

PubMed 

Google Scholar 

Download referencesAcknowledgementsThis research has been conducted using the UK Biobank Resource under application numbers 12113, 24711 and 44972. We are grateful to the UK Biobank participants. UK Biobank was established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government, and the Northwest Regional Development Agency. It has also had funding from the Welsh Assembly Government and the British Heart Foundation. K.D. is supported by a PhD studentship from the National Productivity Investment Fund – MRC Doctoral Training Programme (grant no. MR/S502522/1). AFS is supported by BHF grants PG/18/5033837, PG/22/10989, the UCL BHF Research Accelerator AA/18/6/34223. AFS received additional support from the National Institute for Health Research University College London Hospitals Biomedical Research Centre. This work was funded by UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee EP/Z000211/1, by grant [R01 LM010098] from the National Institutes of Health (USA),by EU/EFPIA Innovative Medicines Initiative 2 Joint Undertaking BigData@Heart grant n° 116074, by the UKRI/NIHR Multimorbidity fund Mechanism and Therapeutics Research Collaborative MR/V033867/1, and by the Rosetrees Trust. This publication is part of the project “Computational medicine for cardiac disease” with file number 2023.022 of the research programme “Computing Time on National Computer Facilities” which is (partly) financed by the Dutch Research Council (NWO). This work is partially supported by Dutch Research Council (628.011.213). This publication is part of the project MyDigiTwin with project number 628.011.213 of the research programme “COMMIT2DATA – Big Data & Health” which is partly financed by the Dutch NWO. This study and its results have not been published previously. A preprint version has been deposited on medrxiv.Author informationAuthors and AffiliationsInstitute of Health Informatics, University College London, London, UKKatarzyna Dziopa & Folkert W. AsselbergsInstitute of Cardiovascular Science, Faculty of Population Health Sciences, University College London, London, UKKatarzyna Dziopa & Amand F. SchmidtDepartment of Cardiology, Amsterdam Cardiovascular Science, Amsterdam University Medical Centers, University of Amsterdam, Amsterdam, The NetherlandsKatarzyna Dziopa, Folkert W. Asselbergs & Amand F. SchmidtDepartment of Population Science and Experimental Medicine, University College London, London, UKNishi ChaturvediThe National Institute for Health Research UCL Hospitals Biomedical Research Centre, University College London, London, UKFolkert W. AsselbergsUCL BHF Research Accelerator Centre, London, UKAmand F. SchmidtAuthorsKatarzyna DziopaView author publicationsYou can also search for this author in

PubMed Google ScholarNishi ChaturvediView author publicationsYou can also search for this author in

PubMed Google ScholarFolkert W. AsselbergsView author publicationsYou can also search for this author in

PubMed Google ScholarAmand F. SchmidtView author publicationsYou can also search for this author in

PubMed Google ScholarContributionsA.F.S., F.W.A., N.C. contributed to the idea and design of the study. K.D. prepared the dataset for analysis and implemented feature selection methods. K.D. and A.F.S. conducted the data analysis and created the figures. K.D. wrote the manuscript with support from F.W.A., A.F.S., and N.C. F.W.A., N.C. and A.F.S. provided critical feedback on the analysis and its interpretation and commented on the drafted manuscript. K.D. is responsible for the integrity of the work as a whole.Corresponding authorCorrespondence to

Katarzyna Dziopa.Ethics declarations

Competing interests

N.C. serves on data safety and monitoring committees of clinical trials sponsored by AstraZeneca. A.F.S. has received funding from New Amsterdam Pharma for unrelated work. A.F.S. is an Editorial Board Member for Communications Medicine but was not involved in the editorial review or peer review, nor in the decision to publish this article. K.D. and F.W.A. declare no competing interests.

Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary informationSupplementary InformationDescription of Additional Supplementary FilesSupplementary Data 1Supplementary Data 2Supplementary Data 3Supplementary Data 4Supplementary Data 5Supplementary Data 6Supplementary Data 7Supplementary Data 8Supplementary SoftwareSupplementary Software InformationReporting SummaryRights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissionsAbout this articleCite this articleDziopa, K., Chaturvedi, N., Asselbergs, F.W. et al. Identifying and ranking non-traditional risk factors for cardiovascular disease prediction in people with type 2 diabetes.

Commun Med 5, 77 (2025). https://doi.org/10.1038/s43856-025-00785-yDownload citationReceived: 27 January 2024Accepted: 25 February 2025Published: 14 March 2025DOI: https://doi.org/10.1038/s43856-025-00785-yShare this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard

Provided by the Springer Nature SharedIt content-sharing initiative

Read full news in source page