Journal of Behavioral Data Science, 2026, 6 (1), 1-16.
DOI:https://doi.org/10.35566/jbds/thakur

Computational Approaches to Diabetes Risk Assessment: A Review of Data-Driven Techniques

Agrimaa Singh Thakur\(^{1}\) and Amit Verma\(^{2}\)
\(^{1}\) \(^{}\)Research Scholar, Department of Computer Science and Engineering,
Maharaja Agrasen University, Himachal Pradesh, India
agrimaa26@gmail.com
\(^{2}\) \(^{}\)Associate Professor, Department of Computer Science and Engineering,
Maharaja Agrasen University, Himachal Pradesh, India
verma0152@gmail.com

\(\star \)Corresponding author: agrimaa26@gmail.com
Abstract. Over 540 million people worldwide suffer from diabetes mellitus, making it a serious global health concern. The advancement of robust predictive models that surpass traditional risk assessment approaches has demonstrated significant potential due to machine learning techniques. This thorough analysis summarizes the state of the art in machine learning-based diabetes prediction systems by examining algorithmic approaches, dataset properties, and performance indicators. The analysis shows how advanced ensemble and deep learning techniques have replaced more conventional statistical methods in order to achieve better results. Critical drawbacks still exist, nonetheless, such as an excessive dependence on datasets with a restricted demographic, a lack of real-world validation, and inadequate model interpretability for clinical acceptability. Regulatory obstacles, population-specific dataset variability, and discrepancies between algorithmic performance and therapeutic impact are some of the main obstacles. In order to convert advancements into clinically useful systems, future priorities include creating representative datasets, putting explainable artificial intelligence (AI) into practice, and carrying out prospective clinical studies.

Keywords: Diabetes Prediction 🞄 Machine Learning 🞄 Healthcare Analytics 🞄 Predictive Modeling 🞄 Clinical Decision Support

1 Introduction

Diabetes mellitus is a chronic metabolic condition marked by persistent hyperglycemia brought on by insufficient insulin secretion, defective insulin action, or a combination of the two. It has developed into one of the 21st century’s most urgent worldwide health issues. The International Diabetes Federation’s latest projections paint an alarming picture: from 540 million affected individuals in 2023, the global burden is expected to escalate to 783 million by 2045, with potential costs exceeding $1 trillion annually (International Diabetes Federation20232025c). This exponential growth trajectory positions diabetes as not merely a medical condition but a socioeconomic crisis demanding immediate and innovative interventions.

The pathophysiological complexity of diabetes encompasses multiple types, each presenting distinct challenges. Type 1 diabetes, an autoimmune condition primarily affecting younger populations, results from pancreatic beta-cell destruction and necessitates ongoing insulin treatment. (American Diabetes Associationn.d.-bAtkinson, Eisenbarth, & Michels2014Knip & Simell2012). Type 2 diabetes, which makes up 90–95% of all cases, is primarily linked to obesity, a sedentary lifestyle, and genetic predisposition. It is caused by growing insulin resistance and relative insulin shortage (American Diabetes Associationn.d.-cDefronzo, Ferrannini, Zimmet, & Alberti2015). Gestational diabetes mellitus affects pregnant women previously undiagnosed with diabetes, presenting risks to both maternal and fetal health while predicting future Type 2 diabetes development (American Diabetes Associationn.d.-a).

The clinical manifestations of diabetes extend far beyond elevated blood glucose levels. The disease precipitates a cascade of complications that significantly impact quality of life and mortality rates. The primary cause of death for diabetics is still cardiovascular disease, and their risk of heart attacks and strokes is significantly higher than that of non-diabetics. When diabetic nephropathy develops into end-stage renal disease, dialysis or kidney transplants are frequently required. Diabetic neuropathy, or damage to the nerves, raises the risk of foot ulcers, which in extreme circumstances can lead to amputation. Similarly, the main cause of vision impairment and blindness in adults is still diabetic eye problems (retinopathy). These complications not only devastate individual lives but impose enormous economic burdens on healthcare systems worldwide.

India’s rise to prominence as the ”diabetes capital of the world” is a prime example of the scope of the worldwide problem. Nearly 17% of the world’s diabetes cases are in India, where there are over 90 million cases among people aged 20 to 79 (International Diabetes Federation2025a2025b). According to the World Health Organization, India will account for 58% of the rise in Type 2 diabetes incidence worldwide and will see a 90% increase in diabetes-related mortality by 2030 compared to 2017 levels (World Health Organization2016). These figures demonstrate the critical need for economically feasible, culturally appropriate, and population-specific prediction systems.

The combination of artificial intelligence and machine learning into diabetes prediction represents a paradigm shift toward personalized medicine and precision healthcare. By analyzing diverse datasets encompassing genetic profiles, lifestyle factors, medical histories, and real-time physiological data, these technologies enable the development of individualized treatment plans and risk assessments (Rajkomar, Dean, & Kohane2019). The potential extends beyond prediction to encompass continuous monitoring, treatment optimization, and complication prevention, fundamentally transforming diabetes care from reactive to proactive. However, the translation of machine learning advances into clinical practice faces significant challenges. Issues of model interpretability, regulatory approval, healthcare professional acceptance, and integration with existing clinical workflows present substantial barriers. Additionally, concerns regarding data privacy, algorithmic bias, and generalizability across diverse populations require careful consideration. The over-reliance on limited datasets, particularly the PIMA Indian Diabetes Dataset, raises questions about model applicability across different ethnic groups, geographic regions, and healthcare systems.

The use of machine learning techniques to predict diabetes has advanced significantly in recent years. From utilizing simple algorithm comparisons, researchers have progressed to applying sophisticated ensemble models and deep learning methods. This section provides a thorough analysis of important studies across time, with a focus on new research trends, performance improvements, and methodological advancements.

2.1 Foundational Studies

The early period of ML-based diabetes prediction was characterized by establishing baseline performance metrics and exploring the potential of traditional machine learning algorithms. Uloko et al. (2018) conducted one of the first comprehensive meta-analyses focusing on diabetes risk factors in Nigeria, examining 23 independent studies comprising 14,650 participants. Their work identified urbanization, lack of physical activity, aging, and poor dietary habits as primary risk factors, establishing the epidemiological foundation for subsequent predictive modeling efforts. Joshi, Pramila, and Chawan (2018) represented pioneering efforts in algorithmic comparison, incorporating Logistic Regression and Support Vector Machines for early diabetes prediction. Their study achieved 79% accuracy with SVM, demonstrating the potential of machine learning approaches while highlighting the need for performance improvements. This work established the benchmark for subsequent comparative studies and emphasized the importance of feature selection in predictive accuracy. Sneha and Gangil (2019) contributed to the field by thoroughly evaluating different algorithms on the PIMA Indian Diabetes Dataset using the WEKA software platform. Their comparison of Naive Bayes, Random Forest, and Decision Trees yielded significant insights: Random Forest achieved 98.20% specificity, Decision Tree reached 98.00% specificity, while Naive Bayes demonstrated 82.30% overall accuracy. The study’s emphasis on feature selection techniques for early identification established important methodological precedents for optimal classification performance enhancement. Sonar and JayaMalini (2019) expanded the algorithmic landscape by incorporating Artificial Neural Networks (ANN) alongside traditional approaches including Gaussian Naive Bayes, SVM, and Decision Trees. Their work demonstrated ANN’s superiority over conventional algorithms when applied to the PIMA dataset, achieving improved precision, recall, accuracy, and F1-score metrics. This study marked the beginning of neural network applications in diabetes prediction, setting the stage for subsequent deep learning developments.

2.2 Methodological Innovations

The intermediate period witnessed significant methodological sophistication, with researchers focusing on ensemble methods, advanced preprocessing techniques, and novel feature engineering approaches. Mujumdar and Vaidehi (2019) achieved a breakthrough with their integrated approach combining genetic susceptibility, lifestyle habits, and clinical measures. Their AdaBoost pipeline achieved exceptional 98.8% accuracy, while the Logistic Regression model maintained 96% classification accuracy, demonstrating the power of ensemble techniques in diabetes prediction. Saru and Subashree (2019) contributed to the methodological foundation through medical bioinformatics analysis, comparing Naive Bayes, Decision Trees, and K-Nearest Neighbor algorithms using UCI repository data. Their work established KNN’s superior accuracy performance while emphasizing the critical role of classifier selection for precise and timely diagnosis. The study highlighted machine learning’s potential for enhancing diabetes prediction algorithms beyond traditional statistical approaches. Kopitar, Kocbek, Cilar Budler, Sheikh, and Stiglic (2020) conducted one of the most comprehensive algorithmic comparisons of the period, evaluating LightGBM, Glmnet, XGBoost, and Random Forest against traditional regression analysis. Using 100 bootstrap resamples to simulate recurring information arrival, their study revealed algorithm ranking: XGBoost \(>\) Random Forest \(>\) LightGBM \(>\) Glmnet \(>\) Simple Regression. Significantly, LightGBM demonstrated remarkable stability in variable selection over time, establishing its value for longitudinal predictive modeling. Syed and Khan (2020) advanced the clinical applicability through their cross-sectional survey approach, employing binary logistic regression and Chi-Squared tests for Type 2 diabetes prediction. Their Decision Forest model achieved superior performance with mean F1 score of 0.8453 ± 0.0268, validated across NHANES and PIDD datasets. The deployment of their calibrated model as an API web service represented significant progress toward clinical translation.

2.3 Advanced Ensemble and Deep Learning Approaches

In the past few years, diabetes prediction models have become more refined, as researchers have combined advanced computational techniques to enhance accuracy and ensure stronger clinical applicability. Zhang, Wang, Niu, et al. (2020) utilized data from the Henan rural cohort study to examine machine learning performance in rural Chinese populations. Their evaluation of six algorithms (SVM, Random Forest, ANN, Gradient Boosting Machine, Classification and Regression Trees, and Logistic Regression) achieved moderate predictive performance with AUC values ranging 0.767-0.872, with GBM achieving the highest AUC. Importantly, their work identified novel risk factors including sweet taste preference and urinary symptoms that traditional models overlooked. Ahmed et al. (2022) demonstrated the power of comprehensive methodological integration, applying ensemble methods, deep learning, and feature engineering to health parameters and medical records. Their approach achieved remarkable 94.87% accuracy, representing substantial improvement over traditional methods through systematic integration of advanced techniques. Zhou, Xin, and Li (2023) achieved exceptional performance through their innovative combination of Boruta feature selection and ensemble learning techniques. Their systematic approach utilizing PIMA dataset with Boruta’s statistical significance-based feature selection and K-Means++ clustering achieved 98% accuracy. The integration of stacking ensemble methods for classification demonstrated superior performance compared to related methods, indicating significant potential for practical diabetes prevention and management. Doğru, Buyrukoglu, and Ari (2023) introduced hybrid super ensemble learning, combining meta-learning models with SVM across multiple datasets. Their approach achieved outstanding accuracy rates: 99.6% for early-stage diabetes prediction, 92% for PIMA dataset, and 98% for hospital datasets. The Chi-square test emerged as the optimal feature selection method, with GridSearch optimization of hyperparameters contributing to exceptional performance across diverse datasets.

2.4 Clinical Integration and Real-World Applications

Recently, studies have shifted their attention toward bridging the gap between research and practice, emphasizing the implementation of predictive models in real clinical settings and real-world healthcare environments. Su, Huang, Zhu, Lyu, and Ji (2023) employed federated learning techniques to overcome important privacy concerns, enabling multi-institutional collaboration without compromising patient data privacy. Their secure protocols for regression and tree-based models demonstrated effectiveness across XGBoost, LightGBM, Neural Networks, and Logistic Regression, validated on both PIMA and local datasets. Hennebelle, Materwala, and Ismail (2023) introduced HealthEdge, representing comprehensive IoT edge and cloud computing-based predictive modeling. Using data from Sylhet Diabetes Hospital in Bangladesh and PIDD, their Random Forest algorithm achieved 97% accuracy with 6% average predictive improvement, demonstrating the promising future of integrated predictive healthcare frameworks. Qi, Song, Liu, Zhang, and Wong (2023) presented the sophisticated KFPredict Ensemble Model, incorporating Recursive Feature Elimination and correlation coefficient analysis with multi-input neural networks. Their final stacking approach combining KF_NN with SVM, Random Forest, and KNN achieved 93.5% accuracy, 85% sensitivity, and 98% specificity, representing up to 18.18% and 14.93% improvement over single prediction methods and previous models respectively.

Current research is increasingly directed toward specialized domains and the integration of advanced technologies to improve diabetes prediction. These efforts emphasize addressing specific clinical challenges, enhancing diagnostic accuracy, and promoting more personalized and effective healthcare solutions. Aslan and Sabanci (2023) proposed novel deep learning approaches by converting numerical attributes in PIMA dataset to image data, enabling CNN models like ResNet18 and ResNet50 for diabetes prediction. Their investigation of fusion strategies combining deep features with SVM classification demonstrated the efficiency of image-based representations for early detection enhancement. Butunoi, Stolojescu-Crisan, and Negru (2024) developed sophisticated algorithms for Type 1 diabetes management, focusing on macrovascular complications and severe hyperglycemia/hypoglycemia episode prediction. Their analysis of GRU, LSTM, RNN architectures, and regression models using Dexcom G6 continuous glucose monitoring data highlighted the importance of accurate forecast models for daily diabetes management and long-term outcome improvement. Kokkorakis et al. (2023) addressed critical generalizability challenges through predictive model development across various ethnicities using UK Biobank data. The researchers developed logistic regression classifiers using training data exclusively from White participants, subsequently validating model performance across five additional ethnic populations and the Lifelines cohort. The models demonstrated robust discriminative capacity, yielding area under the receiver operating characteristic curve (AUROC) values of 0.901 for cross-sectional prevalence prediction and 0.873 for prospective eight-year incidence forecasting. These metrics indicate strong generalizability of the predictive framework across ethnically heterogeneous populations.

2.6 Comparative Analysis

A thorough comparison of the examined methods spanning datasets, algorithms, performance metrics, significant advancements, and constraints is given in Table 1. The development of diabetes prediction research in recent years is shown by this thorough analysis.

Table 1: Comprehensive Comparison of Diabetes Prediction Studies

Authors & Year

Dataset(s)

Algorithms

Best Performance

Key Innovation / Contribution

Joshi et al. (2018)

PIMA

Logistic Regression, SVM

79% accuracy (SVM)

Early algorithmic comparison

Sneha and Gangil (2019)

PIMA (WEKA)

Naive Bayes, Random Forest, Decision Tree

RF: 98.20% Specificity; NB: 82.30% Accuracy

Feature selection for early identification

Sonar and JayaMalini (2019)

PIMA

ANN, Gaussian NB, SVM, Decision Trees

ANN superior (Precision, Recall, F1)

First neural network application

Mujumdar and Vaidehi (2019)

PIMA

AdaBoost, Logistic Regression

AdaBoost: 98.8%; LR: 96%

Integrated genetic, lifestyle, clinical measures

Saru and Subashree (2019)

UCI Repository

Naive Bayes, Decision Trees, KNN

KNN superior

Medical bioinformatics approach

Kopitar et al. (2020)

Clinical (100 bootstrap resamples)

LightGBM, Glmnet, XGBoost, RF

XGBoost \(>\) RF \(>\) LightGBM \(>\) Glmnet

LightGBM stability in variable selection

Syed and Khan (2020)

NHANES, PIDD

Decision Forest, Binary LR, Chi-Square

F1: 0.8453 \(\pm \) 0.0268

Deployed as API web service

Zhang et al. (2020)

Henan Rural Cohort (China)

SVM, RF, ANN, GBM, CART, LR

GBM: AUC 0.872

Novel risk factors (sweet taste, urinary symptoms)

Alanazi and Mezher (2020)

Saudi Arabia Healthcare

Random Forest

AUROC 0.99

Population-specific risk factors

Ahmed et al. (2022)

Health parameters and records

Ensemble, Deep Learning, Feature Eng.

94.87% accuracy

Comprehensive methodological integration

Bhat, Selvam, Ansari, Ansari, and Rahman (2022)

North Kashmir (\(>\)1,000 records)

Random Forest

98% Accuracy

Local demographic characteristics

Zhou et al. (2023)

PIMA

Boruta, K-Means++, Stacking

98% Accuracy

Statistical significance-based features

Doğru et al. (2023)

Multiple datasets

Hybrid super ensemble, SVM, GridSearch

99.6% (early-stage); 92% (PIMA)

Hybrid ensemble across diverse datasets

Su et al. (2023)

PIMA, Local dataset

Federated Learning: XGBoost, LightGBM, NN, LR

Effective across algorithms

Privacy-preserving multi-institutional

Hennebelle et al. (2023)

Sylhet Hospital (Bangladesh), PIDD

RF (IoT edge + cloud)

97% Accuracy (6% improvement)

HealthEdge: IoT-edge-cloud framework

Qi et al. (2023)

PIMA

KFPredict: RFE, multi-input NN, stacking

93.5% Accuracy; 85% Sensitivity; 98% Specificity

18.18% improvement over single methods

Aslan and Sabanci (2023)

PIMA (as images)

CNN (ResNet18/50), Deep features + SVM

Efficient early detection

Numerical-to-image conversion for CNN

Kokkorakis et al. (2023)

UK Biobank (631,748 prev; 67,083 inc)

Logistic Regression (cross-ethnic)

AUC 0.901 (prev); 0.873 (inc)

Multi-ethnic validation (5 groups)

Butunoi et al. (2024)

Dexcom G6 CGM (Type 1)

GRU, LSTM, RNN, Regression

Varies by episode

Type 1 focus; complications prediction

With accuracy rising from 79% in early research to 99.6% in more current hybrid ensemble techniques, the comparison shows notable performance gains. In terms of methodology, the discipline has advanced from basic classifiers to deep learning, federated approaches, and complex ensemble techniques. Even while the PIMA dataset is still the most popular, contemporary research is using multi-ethnic and population-specific datasets more and more to improve generalizability. Notwithstanding these developments, significant obstacles still exist, including deployment complexity, model interpretability issues, inadequate clinical validation, and demographic restrictions in training data. To convert computational advancements into therapeutic impact, these gaps must yet be filled.

3 Datasets in Diabetes Prediction: A Comprehensive Analysis

The foundation of any successful machine learning application lies in the quality, diversity, and representativeness of the underlying datasets. In diabetes prediction research, dataset characteristics significantly influence model performance, generalizability, and clinical applicability. This section provides an exhaustive analysis of datasets commonly employed in diabetes prediction studies, examining their strengths, limitations, and impact on model development.

3.1 Primary Datasets in Diabetes Prediction Research

 

PIMA Indian Diabetes Dataset (PIDD)

The PIMA Indian Diabetes Dataset stands as the most frequently utilized resource in diabetes prediction research, appearing in over 60% of published studies (UCI Machine Learning Repositoryn.d.). These data were originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases and include the medical records of 768 female patients of Pima Indian ethnicity who were 21 years of age or older. The dataset comprises eight numerical attributes across 768 instances with binary classification outcomes for diabetic versus non-diabetic status. The class distribution shows approximately 35% positive cases representing diabetic patients, while the population consists exclusively of Pima Indian women ranging in age from 21 to 81 years. The feature characteristics include pregnancies ranging from zero to seventeen occurrences, plasma glucose concentration measured at two hours during oral glucose tolerance testing with values spanning zero to 199 mg/dL, diastolic blood pressure measurements in mmHg ranging from zero to 122, and triceps skinfold thickness measurements in millimeters with values from zero to 99. The dataset provides several advantages including its establishment as a well-recognized benchmark enables direct comparison across studies, clean and preprocessed data with minimal missing values, balanced feature representation covering key diabetes risk factors, extensive validation across multiple machine learning algorithms, and open-source availability facilitating reproducible research. However, significant limitations affect the dataset’s applicability. The demographic diversity remains severely restricted through single ethnic group representation and gender-specific sampling, while the small sample size may limit model robustness and generalizability. Potential genetic homogeneity reduces applicability across broader populations, temporal constraints from data collection within specific time periods affect relevance, and geographic specificity limits global applicability of developed models.

Framingham Heart Study Dataset

The Framingham Heart Study represents one of the most comprehensive longitudinal cardiovascular research initiatives, initiated in 1948 and continuing today. For diabetes prediction applications, researchers utilize subsets of this extensive database containing approximately 4,240 records with multiple clinical, demographic, and lifestyle variables collected through prospective longitudinal cohort design (Framingham Heart Studyn.d.).

The population consists predominantly of Caucasian residents of Framingham, Massachusetts, with data collection spanning multiple decades enabling temporal relationship analysis. Demographic characteristics like age and gender, cardiovascular risk factors like blood pressure and cholesterol, lifestyle factors like smoking and physical activity, anthropometric measurements like body mass index (BMI) and waist circumference, biochemical parameters like glucose and lipid profiles, and comprehensive family history data are all important factors in predicting diabetes.

Limitations include predominantly Caucasian population composition limiting ethnic diversity and generalizability, geographic specificity from single US location restricting broader applicability, potential cohort effects from long-term study design affecting temporal validity, complex data structure requiring sophisticated preprocessing techniques, and limited representation of contemporary lifestyle factors affecting current relevance.

CDC BRFSS Diabetes Health Indicators Dataset

The Behavioral Risk Factor Surveillance System represents the world’s largest continuously conducted health survey system, providing population-level diabetes risk factor data through annual cross-sectional telephone surveys. The dataset contains over 253,680 records with 21 or more health behavior and demographic variables, covering all 50 US states, District of Columbia, and territories with representative adult US population sampling (Centers for Disease Control and Preventionn.d.-a).

Key features encompass general health status indicators, BMI and physical activity measures, healthcare access and utilization patterns, demographics including age, education, and income levels, behavioral risk factors such as smoking and alcohol consumption, chronic disease indicators, and preventive health behaviors. The massive sample size enables population-level analysis and subgroup investigations, while geographic diversity across US states provides regional variation analysis capabilities. Standardized data collection protocols ensure consistency and comparability, contemporary data reflects current health trends and behaviors, and complex survey design accounts for population representation through appropriate weighting procedures. However, limitations include self-reported data with potential recall and social desirability bias, limited clinical laboratory values reducing diagnostic precision, US- specific population characteristics limiting global generalizability, complex survey weights requiring specialized statistical analysis techniques, and potential temporal inconsistencies across different survey years.

NHANES Dataset (National Health and Nutrition Examination Survey)

NHANES provides comprehensive health and nutritional status information for the US population through cross-sectional surveys with continuous data collection. The dataset varies by cycle with approximately 5,000 participants per year, incorporating interview, examination, and laboratory components representing the US civilian population (Centers for Disease Control and Preventionn.d.-b).

Key features for diabetes prediction include laboratory glucose and insulin measurements, HbA1c levels and other biomarkers, anthropometric measurements such as BMI and waist circumference, blood pressure and cardiovascular indicators, dietary assessment data, socioeconomic and demographic variables, and physical activity and lifestyle factors. The comprehensive clinical and laboratory data provides high-quality biomarker measurements, while standardized examination protocols ensure data consistency and reliability. Representative population sampling enables generalization to broader US population, and continuous data collection enables trend analysis over time. However, limitations include complex survey design requiring weighted analysis techniques, limited sample size compared to BRFSS reducing power for subgroup analyses, US-specific population characteristics limiting international applicability, costly data collection procedures limiting global replication, and potential selection bias in examination participation affecting representativeness. Figure 1 reveals a significant imbalance in dataset usage across diabetes prediction research. The analysis shows that studies overwhelmingly favor the PIMA Indian Diabetes Dataset, despite the availability of larger datasets with more diverse ethnic representation and longitudinal follow-up data. This narrow focus on a single dataset raises important questions about whether predictive models can reliably generalize to broader populations and clinical settings.

PIC

Figure 1: Dataset distribution in diabetes prediction studies.

3.2 Emerging and Specialized Datasets

Local clinical datasets have gained increasing prominence in recent studies to improve regional applicability and clinical relevance. The North Kashmir District Dataset utilized by Bhat et al. (2022) contains over 1,000 patient records from Bandipora district, incorporating local demographic and clinical characteristics that achieved 98% accuracy with Random Forest algorithms, demonstrating community-level prediction effectiveness and regional specificity advantages. The Saudi Arabia Healthcare Dataset employed by Alanazi and Mezher (2020) draws from Security Force Primary Health Care Centre data, incorporating population-specific risk factors, regional genetic and lifestyle considerations, and achieved superior performance with Random Forest algorithms showing AUROC values of 0.99. Multicultural and international datasets address generalizability challenges across diverse populations. The UK Biobank Diabetes Data utilized by Kokkorakis et al. (2023) encompasses 631,748 participants for prevalence prediction and 67,083 for incidence prediction, providing multiple ethnic group representation through questionnaire-based feature collection and cross-ethnic validation capabilities. The Chinese Rural Cohort employed by Zhang et al. (2020) draws from the Henan Rural Cohort Study with 252,176 follow-up records, representing rural Chinese population characteristics while enabling novel risk factor identification and cultural and dietary factor integration in predictive modeling.

3.3 Data Quality and Preprocessing Challenges

Diabetes prediction datasets consistently exhibit missing data patterns requiring sophisticated handling strategies. The PIMA Dataset demonstrates characteristic missing value patterns with glucose showing 0.6% true zeros that are physiologically implausible, blood pressure exhibiting 4.6% zero values, skin thickness displaying 29.6% zero values, insulin measurements containing 48.7% zero values, and BMI showing 1.4% zero values.

Common imputation strategies include mean or median imputation providing simple but potentially distribution-distorting solutions, K-Nearest Neighbors imputation preserving local data structure, multiple imputations accounting for uncertainty in missing values, and domain-specific rules applying clinical knowledge-based value assignment. Each approach presents trade-offs between computational complexity, statistical validity, and clinical interpretability.

Class imbalance issues affect most diabetes datasets with diabetic cases typically representing 20-35% of samples, requiring specialized handling techniques. Synthetic Minority Oversampling Technique generates synthetic examples of minority class instances, Adaptive Synthetic Sampling focuses on difficult-to-learn minority examples, random under sampling reduces majority class representation, cost- sensitive learning approaches adjust misclassification penalties, and ensemble methods with balanced sampling combine multiple models with different class distributions.

Feature standardization and normalization require various preprocessing approaches depending on dataset characteristics and algorithmic requirements. Numerical feature scaling includes Min-Max normalization scaling features to zero-one range, Z-score standardization creating zero mean and unit variance distributions, robust scaling using median and interquartile range to handle outliers, and quantile transformation mapping to uniform or normal distributions.

Categorical variable encoding encompasses multiple strategies including one-hot encoding, which transforms nominal variables into binary indicator variables; label encoding, which assigns numerical values to ordinal variables while preserving ranking structure; target encoding, which replaces high-cardinality categories with corresponding target variable statistics; and embedding techniques for deep learning applications, which create dense vector representations capturing complex categorical relationships while reducing dimensionality.

3.4 Dataset Limitations and Generalizability Challenges

Extensive missing values (especially in the PIMA dataset, which has 48.7% missing insulin values and 29.6% missing skin thickness), class imbalance (diabetic cases typically make up only 20–35% of samples), and the requirement for suitable feature scaling and categorical encoding strategies are some of the major preprocessing challenges faced by diabetes prediction datasets. Demographic bias plagues current diabetes prediction research, with gaps in socioeconomic variety and ethnic representation resulting from a preponderance of Western datasets. Model generalizability across various populations and healthcare contexts is further limited by temporal considerations, such as changing treatment procedures and lifestyles, as well as regional differences in healthcare systems and cultural views.

4 Dataset Design and Real-World Clinical Utility

The nature and composition of training data fundamentally shape how well predictive models perform in actual healthcare settings. Studies relying on limited or narrow datasets—such as the frequently cited PIMA Indian Diabetes Database—often demonstrate performance levels that don’t translate to broader populations. These smaller datasets, while useful for testing algorithmic approaches, suffer from insufficient variation in ethnicity, sex, and geographical origin, which undermines their practical value in diverse clinical environments.

More promising results emerge from models trained on expansive, heterogeneous data sources like NHANES, BRFSS, UK Biobank, and multi-site hospital registries. These platforms offer several advantages: they capture wider demographic ranges, incorporate key diagnostic markers including HbA1c and glucose measurements, and in some cases provide longitudinal tracking that supports early risk identification rather than after-the-fact classification.

The evidence points clearly toward a design imperative: models destined for clinical implementation must be built on datasets that reflect actual patient diversity, include medically relevant biomarkers, and ideally capture health trajectories over time rather than single snapshots.

5 Evaluation Framework for Diabetes Risk Prediction: Methodological Rigor and Clinical Relevance

Despite frequently impressive performance claims, comparing diabetes prediction models across different studies proves difficult due to varied assessment approaches and incomplete metric reporting. Studies often highlight singular measures like accuracy or area under the ROC curve while neglecting critical aspects such as validation methodology, outcome distribution imbalances, and the relative costs of different error types—factors that substantially affect practical utility.

Ensuring model reliability demands validation approaches that guard against overoptimistic estimates and information leakage. Proper temporal or random data partitioning, population-representative cross-validation techniques, and testing on entirely separate datasets from different institutions or regions serve as fundamental safeguards. Testing performance on external populations—particularly those from distinct demographic or healthcare contexts—remains surprisingly rare despite being crucial for deployment readiness.

Given the typical scarcity of diabetes cases in screening populations and the differential consequences of false predictions, single-metric assessments prove inadequate. Comprehensive evaluation should encompass multiple dimensions: the ability to correctly identify true cases (sensitivity), positive prediction accuracy (precision), harmonic performance balance (F1-score), and probability calibration. The alignment between predicted probabilities and actual outcomes becomes especially vital for screening programs and clinical decision systems, where miscalibrated estimates may trigger inappropriate treatment decisions.

Performance metrics alone cannot determine clinical value or fairness. Decision curve methodology quantifies net benefit across various probability thresholds for intervention, while disaggregated performance analysis across patient subgroups reveals potential inequities tied to age, gender, or racial background. These assessments prove indispensable for ensuring models serve diverse populations equitably.

Synthesizing these requirements yields a recommended evaluation framework for future research:

Widespread adoption of these evaluation principles would enhance methodological transparency and accelerate the development of diabetes prediction systems suitable for real-world healthcare implementation.

6 Conclusion

With ensemble methods, deep learning models, and standard algorithms routinely surpassing established risk assessment tools, machine learning technologies have shown great potential for diabetes prediction. Opportunities for improved early diagnosis and individualized risk classification are presented by these developments. Critical obstacles, such as a lack of uniform evaluation frameworks, a lack of dataset diversity across global populations, and limited model interpretability that prevents clinical acceptance, restrict clinical translation.

Future research priorities must address these gaps through development of globally representative datasets, implementation of explainable AI methodologies, and rigorous prospective clinical validation studies. Success will ultimately depend on interdisciplinary collaboration to ensure algorithmic innovations translate into meaningful improvements in patient care and population health outcomes.

References

   Ahmed, U., Issa, G., Aftab, S., Farhan Khan, M., Said, R., Ghazal, T., … Khan, M. (2022). Prediction of diabetes empowered with fused machine learning. IEEE Access. doi: https://doi.org/10.1109/ACCESS.2022.3142097

   Alanazi, A., & Mezher, M. (2020). Using machine learning algorithms for prediction of diabetes mellitus. In Proceedings of iccit (pp. 1–3). doi: https://doi.org/10.1109/ICCIT-144147971.2020.9213708

   American Diabetes Association.  (n.d.-a). Gestational diabetes. https://www.diabetes.org/diabetes/gestational-diabetes.

   American Diabetes Association.  (n.d.-b). Type 1 diabetes. https://www.diabetes.org/diabetes/type-1.

   American Diabetes Association.  (n.d.-c). Type 2 diabetes. https://www.diabetes.org/diabetes/type-2.

   Aslan, M. F., & Sabanci, K. (2023). A novel proposal for deep learning-based diabetes prediction: Converting clinical data to image data. Diagnostics, 13(4), 796. doi: https://doi.org/10.3390/diagnostics13040796

   Atkinson, M. A., Eisenbarth, G. S., & Michels, A. W. (2014). Type 1 diabetes. Lancet, 383(9911), 69–82. doi: https://doi.org/10.1016/S0140-6736(13)60591-7

   Bhat, S. S., Selvam, V., Ansari, G. A., Ansari, M. D., & Rahman, M. H.  (2022). Prevalence and early prediction of diabetes using machine learning in north kashmir: A case study of district bandipora. Computational Intelligence and Neuroscience, 2022, 2789760. doi: https://doi.org/10.1155/2022/2789760

   Butunoi, B.-P., Stolojescu-Crisan, C., & Negru, V.  (2024). Blood glucose prediction in type 1 diabetes based on long short-term memory. In Recent advances in artificial intelligence (pp. 1–10). doi: https://doi.org/10.1007/978-3-031-70259-4_35

   Centers for Disease Control and Prevention. (n.d.-a). Behavioral risk factor surveillance system (brfss). https://www.cdc.gov/brfss/annual_data/annual_data.htm.

   Centers for Disease Control and Prevention. (n.d.-b). National health and nutrition examination survey (nhanes). https://wwwn.cdc.gov/nchs/nhanes/.

   Defronzo, R. A., Ferrannini, E., Zimmet, P., & Alberti, G. (2015). International textbook of diabetes mellitus. John Wiley & Sons.

   Doğru, A., Buyrukoglu, S., & Ari, M. (2023). A hybrid super ensemble learning model for the early-stage prediction of diabetes risk. Medical & Biological Engineering & Computing, 61. doi: https://doi.org/10.1007/s11517-022-02749-z

   Framingham Heart Study.  (n.d.). Framingham heart study dataset. https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset.

   Hennebelle, A., Materwala, H., & Ismail, L. (2023). Healthedge: A machine learning-based smart healthcare framework for prediction of type 2 diabetes in an integrated iot, edge, and cloud computing system. Procedia Computer Science, 220, 331–338. doi: https://doi.org/10.1016/j.procs.2023.03.043

   International Diabetes Federation. (2023). Diabetes facts and figures. Retrieved from https://idf.org/ (Accessed 2025)

   International Diabetes Federation. (2025a). Country data: India. Retrieved from https://diabetesatlas.org/ (Accessed 2025)

   International Diabetes Federation. (2025b). Data by location. Retrieved from https://diabetesatlas.org/ (Accessed 2025)

   International Diabetes Federation. (2025c). Idf diabetes atlas reports. Retrieved from https://diabetesatlas.org/ (Accessed 2025)

   Joshi, T. N., Pramila, M., & Chawan, P. (2018). Logistic regression and svm based diabetes prediction system. International Journal of Computer Applications, 180(20), 1–5.

   Knip, M., & Simell, O. (2012). Environmental triggers of type 1 diabetes. Cold Spring Harbor Perspectives in Medicine, 2(7), a007690. doi: https://doi.org/10.1101/cshperspect.a007690

   Kokkorakis, M., et al. (2023). Effective questionnaire-based prediction models for type 2 diabetes across several ethnicities: a model development and validation study. EClinicalMedicine, 64, 102235. doi: https://doi.org/10.1016/j.eclinm.2023.102235

   Kopitar, L., Kocbek, P., Cilar Budler, L., Sheikh, A., & Stiglic, G. (2020). Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Scientific Reports, 10. doi: https://doi.org/10.1038/s41598-020-68771-z

   Mujumdar, A., & Vaidehi, V. (2019). Diabetes prediction using machine learning algorithms. Procedia Computer Science, 165, 292–299. doi: https://doi.org/10.1016/j.procs.2020.01.047

   Qi, H., Song, X., Liu, S., Zhang, Y., & Wong, K. K. L. (2023). Kfpredict: An ensemble learning prediction framework for diabetes based on fusion of key features. Computer Methods and Programs in Biomedicine, 231, 107378. doi: https://doi.org/10.1016/j.cmpb.2023.107378

   Rajkomar, A., Dean, J., & Kohane, I.  (2019). Machine learning in medicine. New England Journal of Medicine, 380(14), 1347–1358. doi: https://doi.org/10.1056/NEJMra1814259

   Saru, S., & Subashree, S. (2019). Analysis and prediction of diabetes using machine learning. International Journal of Emerging Technology and Innovative Engineering, 5(4).

   Sneha, N., & Gangil, T. (2019). Analysis of diabetes mellitus for early prediction using optimal features selection. Journal of Big Data, 6, 13. doi: https://doi.org/10.1186/s40537-019-0175-6

   Sonar, P., & JayaMalini, K. (2019). Diabetes prediction using different machine learning approaches. In Proceedings of iccmc (pp. 367–371). doi: https://doi.org/10.1109/ICCMC.2019.8819841

   Su, Y., Huang, C., Zhu, W., Lyu, X., & Ji, F. (2023). Multi-party diabetes mellitus risk prediction based on secure federated learning. Biomedical Signal Processing and Control, 85, 104881. doi: https://doi.org/10.1016/j.bspc.2023.104881

   Syed, A. H., & Khan, T. (2020). Machine learning-based application for predicting risk of type 2 diabetes mellitus (t2dm) in saudi arabia: A retrospective cross-sectional study. IEEE Access, 8, 199539–199561. doi: https://doi.org/10.1109/ACCESS.2020.3035026

   UCI Machine Learning Repository. (n.d.). Pima indians diabetes database. https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.

   Uloko, A. E., Musa, B. M., Ramalan, M. A., Gezawa, I. D., Puepet, F. H., Uloko, A. T., … Sada, K. B. (2018). Prevalence and risk factors for diabetes mellitus in nigeria: A systematic review and meta-analysis. Diabetes Therapy, 9(3), 1307–1316. doi: https://doi.org/10.1007/s13300-018-0441-1

   World Health Organization. (2016). Global report on diabetes.

   Zhang, L., Wang, Y., Niu, M., et al.  (2020). Machine learning for characterizing risk of type 2 diabetes mellitus in a rural chinese population: the henan rural cohort study. Scientific Reports, 10, 4406. doi: https://doi.org/10.1038/s41598-020-61123-x

   Zhou, H., Xin, Y., & Li, S. (2023). A diabetes prediction model based on boruta feature selection and ensemble learning. BMC Bioinformatics, 24, 224. doi: https://doi.org/10.1186/s12859-023-05300-5