CDC Diabetes-Risk Analytics POC
Academic POCAn academic ML proof-of-concept classifying three-class diabetes risk on a large CDC public-health dataset.
Stack
- R (nnet, class, rpart)
- Python (pandas, matplotlib, python-pptx)
A solo, end-to-end academic analytics project on the CDC Diabetes Health Indicators dataset (BRFSS 2015, 253,680 records): three classification models — multinomial logistic regression, K-nearest-neighbors, and a tuned/pruned decision tree — trained and compared on the same held-out 30% test split with a fixed seed.
The honest headline is the point: the best model (a tuned decision tree) reached 84.75% test accuracy — only marginally above the dataset's 84.2% majority-class base rate. The writeup says so plainly and foregrounds per-class confusion-matrix evidence instead of the headline number, because accuracy alone is misleading on a dataset this imbalanced.
The useful finding is operational, not the accuracy figure: the tree isolates a small, clean 'rule-out' group and a small high-risk leaf (BMI ≥ 35, high cholesterol, poor self-rated health, high blood pressure) with a ~60% diabetes rate — roughly five times the population base rate — which is exactly the kind of tractable population a screening program could prioritize.
Measured results
Barely above the 84.2% majority-class base rate — the per-class evidence, not this number, is the real result.
Source: Project's model-comparison output and final report — measured on a held-out 30% test split (seed 401).
Documents