🧠 AI with Python – 🔎 Feature Selection using SelectKBest

Posted On: October 2, 2025

Description:

Introduction

When working with machine learning models, not all features in your dataset are equally useful.

Some features may be redundant, irrelevant, or even harmful to model performance.

Feature selection helps reduce the dataset to only the most informative features. This improves:

Model interpretability.
Training speed.
Generalization (reducing overfitting).

In this post, we’ll use scikit-learn’s SelectKBest with the ANOVA F-test to keep the top 10 features from the Breast Cancer Wisconsin dataset, and train a Logistic Regression classifier on them.

Loading the Dataset

We’ll use the Breast Cancer dataset, which has 30 numerical features describing tumor characteristics.

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

Splitting the Data

We split into training and testing sets while preserving class balance.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

Applying SelectKBest

SelectKBest ranks each feature using a statistical test (here, ANOVA F-test) and keeps only the top-k.

We’ll select the 10 best features.

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X_train, y_train)

Building a Pipeline

To avoid data leakage, feature selection should be part of a Pipeline.

That way, the transformation is learned on the training set and applied consistently to the test set.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("select", SelectKBest(score_func=f_classif, k=10)),
    ("clf", LogisticRegression(max_iter=200, random_state=42))
])
pipe.fit(X_train, y_train)

Evaluating the Model

Once trained, we evaluate with accuracy and a classification report.

from sklearn.metrics import accuracy_score, classification_report

y_pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))

Inspecting the Selected Features

We can also check which features were chosen and their F-scores.

import pandas as pd

selector = pipe.named_steps["select"]
mask = selector.get_support()
scores = selector.scores_

report_df = pd.DataFrame({
    "feature": feature_names,
    "f_score": scores,
    "selected": mask
}).sort_values("f_score", ascending=False)

report_df.head(10)

Visualizing Feature Importance

To better understand the results, we plot the F-scores of the top 10 selected features.

top_features = report_df[report_df["selected"]]
top_features.plot(kind="barh", x="feature", y="f_score", legend=False, figsize=(8, 5))

Key Takeaways

SelectKBest allows us to keep only the top-k features, improving model simplicity and reducing overfitting.
ANOVA F-test is suitable for classification tasks with numerical features.
Always use feature selection inside a Pipeline to prevent leakage.
Try different values of k and compare performance with cross-validation.

Code Snippet:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = np.array(data.feature_names)  # numpy array for easy indexing

print(f"X shape: {X.shape} | y shape: {y.shape}")
print("Classes:", list(data.target_names))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

k = 10  # choose how many features to keep

pipe = Pipeline(steps=[
    ("scaler", StandardScaler()),  # scale numeric features (helps LR)
    ("select", SelectKBest(score_func=f_classif, k=k)),  # select top-k by ANOVA F
    ("clf", LogisticRegression(max_iter=200, random_state=42))
])

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}\n")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))

# Access the fitted SelectKBest step
selector = pipe.named_steps["select"]

# Boolean mask of selected features in ORIGINAL order
mask = selector.get_support()

# Scores for all features (align with feature_names)
scores = selector.scores_

# Build a table of (feature, score, selected) and sort by score
report_df = pd.DataFrame({
    "feature": feature_names,
    "f_score": scores,
    "selected": mask
}).sort_values("f_score", ascending=False)

print("Top features by ANOVA F-score:")
report_df.head(15)

topk_df = report_df[report_df["selected"]].copy()
plt.figure(figsize=(9, 5))
plt.barh(topk_df["feature"], topk_df["f_score"])
plt.gca().invert_yaxis()
plt.xlabel("ANOVA F-score")
plt.title(f"Top {k} Features Selected by SelectKBest (f_classif)")
plt.tight_layout()
plt.show()

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – 🔎 Feature Selection using SelectKBest

Description:

Introduction

Loading the Dataset

Splitting the Data

Applying SelectKBest

Building a Pipeline

Evaluating the Model

Inspecting the Selected Features

Visualizing Feature Importance

Key Takeaways

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – 🔎 Feature Selection using SelectKBest

Description:

Introduction

Loading the Dataset

Splitting the Data

Applying SelectKBest

Building a Pipeline

Evaluating the Model

Inspecting the Selected Features

Visualizing Feature Importance

Key Takeaways

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – 📊 SHAP Values for Model Interpretability

🧠 AI with Python – 📊 Multi-class Classification Evaluation with classification_report

🧠 AI with Python – 📊 Precision, Recall, and F1-Score Explained

Comments