AW Dev Rethought

⚖️ There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies - C.A.R. Hoare

🧠 AI with Python – 📉 Predict Diabetes Progression


Description:

Predicting disease progression is a valuable use case in medical data science.

In this project, we use Linear Regression to estimate diabetes progression based on physiological measurements—such as BMI, blood pressure, and serum test values.

This example uses the Diabetes dataset from scikit-learn, which is widely used for demonstrating regression models.


Understanding the Problem

Medical datasets often contain continuous numerical measurements.

Unlike classification tasks, here the goal is to predict a continuous outcome: diabetes progression one year after baseline.

Linear Regression helps us understand:

  • how each feature influences disease progression
  • the relationship between physiological indicators and future health
  • how well a simple linear model can fit real-world medical data

This project shows how to build, evaluate, and visualize a regression model end-to-end.


1. Load and Explore the Dataset

The Diabetes dataset contains:

  • 442 samples
  • 10 medical predictor features
  • 1 continuous target representing progression score
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

df = pd.DataFrame(X, columns=diabetes.feature_names)
df["target"] = y
df.head()

Each row represents one patient’s measurements and progress outcome.


2. Train/Test Split

We reserve 20% of the data for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

3. Train a Linear Regression Model

Linear Regression finds the best-fit line (hyperplane) that minimizes prediction error.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

4. Evaluate the Model

Two key regression metrics:

  • Mean Squared Error (MSE) → average squared difference between predictions & actual values
  • R² Score → how much variance in progression the model can explain
from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)

5. Visualize Predictions

A scatter plot helps compare:

  • real progression values
  • predicted values
plt.figure(figsize=(7,5))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Progression")
plt.ylabel("Predicted Progression")
plt.title("Diabetes Progression: Actual vs Predicted")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "r--")
plt.show()

The closer the points lie to the dashed line, the better the model.


6. Inspect Feature Importance

Coefficients show which medical attributes contribute more strongly:

coef_df = pd.DataFrame({
    "feature": diabetes.feature_names,
    "coefficient": model.coef_
}).sort_values(by="coefficient", ascending=False)

coef_df

This helps interpret the model and validate medical relevance.


Key Takeaways

  1. Linear Regression is a foundational ML tool for predicting continuous values, useful in healthcare, finance, forecasting, and more.
  2. R² Score shows how much variance the model explains—higher means better fit.
  3. Visualization of predictions makes it easy to diagnose underfitting or overfitting.
  4. Coefficient analysis reveals which features have the strongest relationship with disease progression.
  5. The Diabetes dataset is ideal for learning regression because it is real-world, numeric, and interpretable.

Conclusion

This project demonstrates how Linear Regression can be used to analyze medical datasets and predict health outcomes.

With just a few steps, we built a complete regression workflow—from loading data to evaluating and interpreting model performance.

This foundation prepares you for more advanced regression techniques like Ridge, Lasso, ElasticNet, and tree-based models.


Code Snippet:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

print("Features shape:", X.shape)
print("Target shape:", y.shape)

df = pd.DataFrame(X, columns=diabetes.feature_names)
df["target"] = y
df.head()


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R² Score:", r2)


plt.figure(figsize=(7,5))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Progression")
plt.ylabel("Predicted Progression")
plt.title("Diabetes Progression: Actual vs Predicted")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "r--")
plt.show()


coef_df = pd.DataFrame({
    "feature": diabetes.feature_names,
    "coefficient": model.coef_
}).sort_values(by="coefficient", ascending=False)

coef_df

Link copied!

Comments

Add Your Comment

Comment Added!