AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

⚡️ Saturday ML Spark – 🧠 Feature Selection with Mutual Information


Description:

Programmer: python_scripts (Abhijith Warrier)

PYTHON SCRIPT TO SELECT IMPORTANT FEATURES USING MUTUAL INFORMATION FOR MACHINE LEARNING. ⚡️🧠📊

This script demonstrates how to use Mutual Information (MI) to measure the relationship between features and the target variable, helping us identify the most informative features for model training.


📦 Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

🧩 Load Dataset

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

✂️ Split Data

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

🚨 Baseline Model (All Features)

baseline_model = RandomForestClassifier(random_state=42)

baseline_model.fit(X_train, y_train)

baseline_pred = baseline_model.predict(X_test)

print("Baseline Accuracy:", accuracy_score(y_test, baseline_pred))

🧠 Compute Mutual Information Scores

Mutual Information measures how much information a feature provides about the target.

mi_scores = mutual_info_classif(X_train, y_train, random_state=42)

mi_df = pd.DataFrame({
    "Feature": X.columns,
    "MI Score": mi_scores
})

mi_df = mi_df.sort_values(by="MI Score", ascending=False)

print(mi_df)

📊 Visualise Feature Importance

plt.figure(figsize=(10, 6))

plt.barh(mi_df["Feature"], mi_df["MI Score"])

plt.xlabel("Mutual Information Score")
plt.title("Feature Importance using Mutual Information")

plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()

🔍 Select Top Features

We keep only the most informative features.

top_features = mi_df["Feature"].head(10).tolist()

X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

🤖 Train Model with Selected Features

selected_model = RandomForestClassifier(random_state=42)

selected_model.fit(X_train_selected, y_train)

selected_pred = selected_model.predict(X_test_selected)

print("Selected Features Accuracy:", accuracy_score(y_test, selected_pred))

🔍 Why Feature Selection Matters

Feature selection helps by:

  • reducing noise
  • improving training speed
  • simplifying models
  • improving interpretability
  • reducing overfitting

Not all features contribute equally to prediction quality.


🧠 Key Takeaways

  1. Mutual Information measures dependency between features and target.
  2. Higher MI score → more informative feature.
  3. Feature selection can simplify models significantly.
  4. Removing weak features may improve generalisation.
  5. A powerful technique for tabular machine learning.

Conclusion

Feature selection is an important step in building efficient and reliable machine learning systems. By using Mutual Information, we can identify the most useful features and focus the model on the strongest predictive signals.

This continues the Feature Engineering track in Saturday ML Spark ⚡️, helping you move beyond raw data into smarter feature design and selection.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


# 🧩 Load Dataset
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# 🚨 Baseline Model – All Features
# =========================================================

baseline_model = RandomForestClassifier(random_state=42)
baseline_model.fit(X_train, y_train)

baseline_pred = baseline_model.predict(X_test)

print("Baseline Accuracy:", accuracy_score(y_test, baseline_pred))


# =========================================================
# 🧠 Compute Mutual Information Scores
# =========================================================

mi_scores = mutual_info_classif(
    X_train,
    y_train,
    random_state=42
)

mi_df = pd.DataFrame({
    "Feature": X.columns,
    "MI Score": mi_scores
}).sort_values(by="MI Score", ascending=False)

print("\nMutual Information Scores:")
print(mi_df)


# =========================================================
# 📊 Visualize Feature Importance
# =========================================================

plt.figure(figsize=(10, 6))

plt.barh(
    mi_df["Feature"],
    mi_df["MI Score"]
)

plt.xlabel("Mutual Information Score")
plt.title("Feature Importance using Mutual Information")
plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()


# =========================================================
# 🔍 Select Top Features
# =========================================================

top_features = mi_df["Feature"].head(10).tolist()

X_train_selected = X_train[top_features]
X_test_selected = X_test[top_features]

print("\nTop Selected Features:")
print(top_features)


# =========================================================
# 🤖 Train Model with Selected Features
# =========================================================

selected_model = RandomForestClassifier(random_state=42)
selected_model.fit(X_train_selected, y_train)

selected_pred = selected_model.predict(X_test_selected)

print("\nSelected Features Accuracy:", accuracy_score(y_test, selected_pred))

Link copied!

Comments

Add Your Comment

Comment Added!