🧠 AI with Python – 🔎 Feature Selection using SelectKBest
Posted On: October 2, 2025
Description:
Introduction
When working with machine learning models, not all features in your dataset are equally useful.
Some features may be redundant, irrelevant, or even harmful to model performance.
Feature selection helps reduce the dataset to only the most informative features. This improves:
- Model interpretability.
- Training speed.
- Generalization (reducing overfitting).
In this post, we’ll use scikit-learn’s SelectKBest with the ANOVA F-test to keep the top 10 features from the Breast Cancer Wisconsin dataset, and train a Logistic Regression classifier on them.
Loading the Dataset
We’ll use the Breast Cancer dataset, which has 30 numerical features describing tumor characteristics.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
Splitting the Data
We split into training and testing sets while preserving class balance.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
Applying SelectKBest
SelectKBest ranks each feature using a statistical test (here, ANOVA F-test) and keeps only the top-k.
We’ll select the 10 best features.
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X_train, y_train)
Building a Pipeline
To avoid data leakage, feature selection should be part of a Pipeline.
That way, the transformation is learned on the training set and applied consistently to the test set.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
("scaler", StandardScaler()),
("select", SelectKBest(score_func=f_classif, k=10)),
("clf", LogisticRegression(max_iter=200, random_state=42))
])
pipe.fit(X_train, y_train)
Evaluating the Model
Once trained, we evaluate with accuracy and a classification report.
from sklearn.metrics import accuracy_score, classification_report
y_pred = pipe.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))
Inspecting the Selected Features
We can also check which features were chosen and their F-scores.
import pandas as pd
selector = pipe.named_steps["select"]
mask = selector.get_support()
scores = selector.scores_
report_df = pd.DataFrame({
"feature": feature_names,
"f_score": scores,
"selected": mask
}).sort_values("f_score", ascending=False)
report_df.head(10)
Visualizing Feature Importance
To better understand the results, we plot the F-scores of the top 10 selected features.
top_features = report_df[report_df["selected"]]
top_features.plot(kind="barh", x="feature", y="f_score", legend=False, figsize=(8, 5))
Key Takeaways
- SelectKBest allows us to keep only the top-k features, improving model simplicity and reducing overfitting.
- ANOVA F-test is suitable for classification tasks with numerical features.
- Always use feature selection inside a Pipeline to prevent leakage.
- Try different values of k and compare performance with cross-validation.
Code Snippet:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = np.array(data.feature_names) # numpy array for easy indexing
print(f"X shape: {X.shape} | y shape: {y.shape}")
print("Classes:", list(data.target_names))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
k = 10 # choose how many features to keep
pipe = Pipeline(steps=[
("scaler", StandardScaler()), # scale numeric features (helps LR)
("select", SelectKBest(score_func=f_classif, k=k)), # select top-k by ANOVA F
("clf", LogisticRegression(max_iter=200, random_state=42))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}\n")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))
# Access the fitted SelectKBest step
selector = pipe.named_steps["select"]
# Boolean mask of selected features in ORIGINAL order
mask = selector.get_support()
# Scores for all features (align with feature_names)
scores = selector.scores_
# Build a table of (feature, score, selected) and sort by score
report_df = pd.DataFrame({
"feature": feature_names,
"f_score": scores,
"selected": mask
}).sort_values("f_score", ascending=False)
print("Top features by ANOVA F-score:")
report_df.head(15)
topk_df = report_df[report_df["selected"]].copy()
plt.figure(figsize=(9, 5))
plt.barh(topk_df["feature"], topk_df["f_score"])
plt.gca().invert_yaxis()
plt.xlabel("ANOVA F-score")
plt.title(f"Top {k} Features Selected by SelectKBest (f_classif)")
plt.tight_layout()
plt.show()
No comments yet. Be the first to comment!