CDC Diabetes-Risk Analytics POC

Academic POC

An academic ML proof-of-concept classifying three-class diabetes risk on a large CDC public-health dataset.

Stack

R (nnet, class, rpart)
Python (pandas, matplotlib, python-pptx)

A solo, end-to-end academic analytics project on the CDC Diabetes Health Indicators dataset (BRFSS 2015, 253,680 records): multinomial logistic regression, K-nearest-neighbors, and a tuned/pruned decision tree compared on the same held-out 30% test split with a fixed seed.

The honest headline is the point: the best model (a tuned decision tree) reached 84.75% test accuracy, only marginally above the dataset's 84.2% majority-class base rate. The writeup says so plainly and foregrounds per-class confusion-matrix evidence instead of the headline number, because accuracy alone is misleading on a dataset this imbalanced.

The useful finding is operational, not the accuracy figure: the tree isolates a small, clean 'rule-out' group and a small high-risk leaf (BMI ≥ 35, high cholesterol, poor self-rated health, high blood pressure) with a ~60% diabetes rate, roughly five times the population base rate, which is exactly the kind of tractable population a screening program could prioritize.

Measured results

Best-model test accuracy (tuned decision tree)0.8475

Barely above the 84.2% majority-class base rate, so the per-class evidence, not this number, is the real result.

Source: Project's model-comparison output and final report, measured on a held-out 30% test split (seed 401).

Documents

Read the full report (PDF)Presentation slide deck (PPTX)

Academic coursework, happy to walk through the code and the analysis on request.