🧠 AI with Python – 🫀 Heart Disease Prediction (UCI Dataset)
Posted on: January 13, 2026
Description:
Heart disease remains one of the leading causes of mortality worldwide.
Early detection and risk assessment play a crucial role in improving patient outcomes — and this is where machine learning can assist healthcare professionals.
In this project, we build a heart disease prediction model using the UCI Heart Disease dataset, demonstrating how supervised learning can be applied to real-world medical data.
Understanding the Problem
Medical datasets present unique challenges:
- features have different scales (age, cholesterol, blood pressure)
- false negatives are costly (missing a high-risk patient)
- interpretability is critical
- accuracy alone is not sufficient
The goal is not to replace clinicians, but to support decision-making with data-driven risk signals.
1. Loading the Heart Disease Dataset
We begin by loading a UCI-style heart disease dataset stored in CSV format.
import pandas as pd
df = pd.read_csv("heart.csv")
df.head()
The dataset includes clinical features such as age, sex, chest pain type, cholesterol, maximum heart rate, and a binary target indicating disease presence.
2. Inspecting Class Balance and Features
Before modeling, it’s important to understand the dataset structure.
print(df.info())
print(df["target"].value_counts())
This helps confirm that the dataset is suitable for binary classification and reveals whether class imbalance is present.
3. Train/Test Split with Stratification
We separate features and labels while preserving class distribution.
from sklearn.model_selection import train_test_split
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3,
stratify=y,
random_state=42
)
Stratification ensures fair evaluation, especially in medical datasets.
4. Feature Scaling
Clinical features often exist on different numeric scales.
Scaling ensures that no single feature dominates model learning.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
This step is especially important for linear models like Logistic Regression.
5. Training a Baseline Medical Model
We use Logistic Regression, a widely accepted baseline in medical ML due to its interpretability.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
Logistic Regression provides clear probability estimates that clinicians can reason about.
6. Evaluating Medical Predictions
Model evaluation focuses on metrics relevant to healthcare decision-making.
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Recall and precision are more important than accuracy in medical risk prediction.
7. ROC–AUC for Risk Separation
ROC–AUC measures how well the model separates patients with and without heart disease.
from sklearn.metrics import roc_auc_score
y_proba = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print("ROC–AUC:", auc)
This metric is threshold-independent and widely used in healthcare ML studies.
Key Takeaways
- Heart disease prediction is a high-impact medical ML application.
- Logistic Regression provides an interpretable and reliable baseline.
- Feature scaling is essential for clinical datasets.
- ROC–AUC and recall are more meaningful than accuracy in healthcare.
- ML models should assist — not replace — clinical judgment.
Conclusion
Heart disease prediction highlights how machine learning can deliver real-world value when applied responsibly.
By combining careful preprocessing, interpretable models, and appropriate evaluation metrics, ML systems can support early detection and informed medical decisions.
This project demonstrates a complete end-to-end healthcare ML workflow, making it a strong addition to the AI with Python – Real-World Mini Projects (Advanced) series.
Code Snippet:
import pandas as pd
df = pd.read_csv("heart.csv")
df.head()
print(df.info())
print(df["target"].value_counts())
from sklearn.model_selection import train_test_split
X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3,
stratify=y,
random_state=42
)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
from sklearn.metrics import roc_auc_score
y_proba = model.predict_proba(X_test_scaled)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print("ROC–AUC:", auc)
No comments yet. Be the first to comment!