⚡️ Saturday ML Spark – 🧠 Feature Selection with Mutual Information
Posted on: May 16, 2026
Description:
Programmer: python_scripts (Abhijith Warrier)
PYTHON SCRIPT TO SELECT IMPORTANT FEATURES USING MUTUAL INFORMATION FOR MACHINE LEARNING. ⚡️🧠📊
This script demonstrates how to use Mutual Information (MI) to measure the relationship between features and the target variable, helping us identify the most informative features for model training.
📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
🧩 Load Dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42,
stratify=y
)
🚨 Baseline Model (All Features)
baseline_model = RandomForestClassifier(random_state=42)
baseline_model.fit(X_train, y_train)
baseline_pred = baseline_model.predict(X_test)
print("Baseline Accuracy:", accuracy_score(y_test, baseline_pred))
🧠 Compute Mutual Information Scores
Mutual Information measures how much information a feature provides about the target.
mi_scores = mutual_info_classif(X_train, y_train, random_state=42)
mi_df = pd.DataFrame({
"Feature": X.columns,
"MI Score": mi_scores
})
mi_df = mi_df.sort_values(by="MI Score", ascending=False)
print(mi_df)
📊 Visualise Feature Importance
plt.figure(figsize=(10, 6))
plt.barh(mi_df["Feature"], mi_df["MI Score"])
plt.xlabel("Mutual Information Score")
plt.title("Feature Importance using Mutual Information")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
🔍 Select Top Features
We keep only the most informative features.
top_features = mi_df["Feature"].head(10).tolist()
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]
🤖 Train Model with Selected Features
selected_model = RandomForestClassifier(random_state=42)
selected_model.fit(X_train_selected, y_train)
selected_pred = selected_model.predict(X_test_selected)
print("Selected Features Accuracy:", accuracy_score(y_test, selected_pred))
🔍 Why Feature Selection Matters
Feature selection helps by:
- reducing noise
- improving training speed
- simplifying models
- improving interpretability
- reducing overfitting
Not all features contribute equally to prediction quality.
🧠 Key Takeaways
- Mutual Information measures dependency between features and target.
- Higher MI score → more informative feature.
- Feature selection can simplify models significantly.
- Removing weak features may improve generalisation.
- A powerful technique for tabular machine learning.
Conclusion
Feature selection is an important step in building efficient and reliable machine learning systems. By using Mutual Information, we can identify the most useful features and focus the model on the strongest predictive signals.
This continues the Feature Engineering track in Saturday ML Spark ⚡️, helping you move beyond raw data into smarter feature design and selection.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 🧩 Load Dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42,
stratify=y
)
# =========================================================
# 🚨 Baseline Model – All Features
# =========================================================
baseline_model = RandomForestClassifier(random_state=42)
baseline_model.fit(X_train, y_train)
baseline_pred = baseline_model.predict(X_test)
print("Baseline Accuracy:", accuracy_score(y_test, baseline_pred))
# =========================================================
# 🧠 Compute Mutual Information Scores
# =========================================================
mi_scores = mutual_info_classif(
X_train,
y_train,
random_state=42
)
mi_df = pd.DataFrame({
"Feature": X.columns,
"MI Score": mi_scores
}).sort_values(by="MI Score", ascending=False)
print("\nMutual Information Scores:")
print(mi_df)
# =========================================================
# 📊 Visualize Feature Importance
# =========================================================
plt.figure(figsize=(10, 6))
plt.barh(
mi_df["Feature"],
mi_df["MI Score"]
)
plt.xlabel("Mutual Information Score")
plt.title("Feature Importance using Mutual Information")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# =========================================================
# 🔍 Select Top Features
# =========================================================
top_features = mi_df["Feature"].head(10).tolist()
X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]
print("\nTop Selected Features:")
print(top_features)
# =========================================================
# 🤖 Train Model with Selected Features
# =========================================================
selected_model = RandomForestClassifier(random_state=42)
selected_model.fit(X_train_selected, y_train)
selected_pred = selected_model.predict(X_test_selected)
print("\nSelected Features Accuracy:", accuracy_score(y_test, selected_pred))
No comments yet. Be the first to comment!