🧠 AI with Python – 📉 Predict Diabetes Progression

Posted on: November 18, 2025

Description:

Predicting disease progression is a valuable use case in medical data science.

In this project, we use Linear Regression to estimate diabetes progression based on physiological measurements—such as BMI, blood pressure, and serum test values.

This example uses the Diabetes dataset from scikit-learn, which is widely used for demonstrating regression models.

Understanding the Problem

Medical datasets often contain continuous numerical measurements.

Unlike classification tasks, here the goal is to predict a continuous outcome: diabetes progression one year after baseline.

Linear Regression helps us understand:

how each feature influences disease progression
the relationship between physiological indicators and future health
how well a simple linear model can fit real-world medical data

This project shows how to build, evaluate, and visualize a regression model end-to-end.

1. Load and Explore the Dataset

The Diabetes dataset contains:

442 samples
10 medical predictor features
1 continuous target representing progression score

from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

df = pd.DataFrame(X, columns=diabetes.feature_names)
df["target"] = y
df.head()

Each row represents one patient’s measurements and progress outcome.

2. Train/Test Split

We reserve 20% of the data for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

3. Train a Linear Regression Model

Linear Regression finds the best-fit line (hyperplane) that minimizes prediction error.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

4. Evaluate the Model

Two key regression metrics:

Mean Squared Error (MSE) → average squared difference between predictions & actual values
R² Score → how much variance in progression the model can explain

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)

5. Visualize Predictions

A scatter plot helps compare:

real progression values
predicted values

plt.figure(figsize=(7,5))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Progression")
plt.ylabel("Predicted Progression")
plt.title("Diabetes Progression: Actual vs Predicted")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "r--")
plt.show()

The closer the points lie to the dashed line, the better the model.

6. Inspect Feature Importance

Coefficients show which medical attributes contribute more strongly:

coef_df = pd.DataFrame({
    "feature": diabetes.feature_names,
    "coefficient": model.coef_
}).sort_values(by="coefficient", ascending=False)

coef_df

This helps interpret the model and validate medical relevance.

Key Takeaways

Linear Regression is a foundational ML tool for predicting continuous values, useful in healthcare, finance, forecasting, and more.
R² Score shows how much variance the model explains—higher means better fit.
Visualization of predictions makes it easy to diagnose underfitting or overfitting.
Coefficient analysis reveals which features have the strongest relationship with disease progression.
The Diabetes dataset is ideal for learning regression because it is real-world, numeric, and interpretable.

Conclusion

This project demonstrates how Linear Regression can be used to analyze medical datasets and predict health outcomes.

With just a few steps, we built a complete regression workflow—from loading data to evaluating and interpreting model performance.

This foundation prepares you for more advanced regression techniques like Ridge, Lasso, ElasticNet, and tree-based models.

Code Snippet:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

print("Features shape:", X.shape)
print("Target shape:", y.shape)

df = pd.DataFrame(X, columns=diabetes.feature_names)
df["target"] = y
df.head()


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)


plt.figure(figsize=(7,5))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Progression")
plt.ylabel("Predicted Progression")
plt.title("Diabetes Progression: Actual vs Predicted")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "r--")
plt.show()


coef_df = pd.DataFrame({
    "feature": diabetes.feature_names,
    "coefficient": model.coef_
}).sort_values(by="coefficient", ascending=False)

coef_df

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – 📉 Predict Diabetes Progression

Description:

Understanding the Problem

1. Load and Explore the Dataset

2. Train/Test Split

3. Train a Linear Regression Model

4. Evaluate the Model

5. Visualize Predictions

6. Inspect Feature Importance

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – 📉 Predict Diabetes Progression

Description:

Understanding the Problem

1. Load and Explore the Dataset

2. Train/Test Split

3. Train a Linear Regression Model

4. Evaluate the Model

5. Visualize Predictions

6. Inspect Feature Importance

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – 📊 Reliability Diagrams

🧠 AI with Python – 📈 Model Calibration Curves

🧠 AI with Python – 🔥 Feature Interaction Heatmaps

7-Day AI Crash Course

Comments