⚡ Saturday ML Sparks – Random Forest Feature Importance 🌲🧠
Posted on: November 8, 2025
Description:
In machine learning, not all features contribute equally to predictions. Some have a significant impact, while others add little to no value.
Random Forests, being ensembles of decision trees, allow us to quantify the importance of each feature — helping us understand which inputs truly drive model performance.
In this post, we’ll use the Wine dataset from scikit-learn to explore feature importance using a Random Forest Classifier.
About the Dataset
The Wine dataset consists of results from chemical analyses of different wine cultivars.
It includes 13 numerical features such as alcohol, ash, flavanoids, and color intensity, used to classify wines into three categories.
We begin by loading the dataset and splitting it into training and testing subsets.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Training the Random Forest Model
A Random Forest Classifier combines multiple decision trees, each trained on random subsets of data and features.
This ensemble approach improves accuracy and reduces overfitting compared to a single tree.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Once trained, the model can estimate how much each feature contributes to reducing impurity (or error) across trees.
Extracting Feature Importances
Scikit-learn provides a convenient attribute, feature_importances_, which gives the relative importance of each input feature.
We’ll organize and sort these scores to visualize which features had the strongest influence.
importances = model.feature_importances_
df_importances = pd.DataFrame({
'Feature': X.columns,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
Visualizing the Results
A horizontal bar plot helps us quickly identify the most influential features in the dataset.
Features with higher scores are those the model relies on most when making predictions.
plt.figure(figsize=(8, 5))
plt.barh(df_importances['Feature'], df_importances['Importance'], color='green')
plt.gca().invert_yaxis()
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.tight_layout()
plt.show()
Sample Output
| Feature | Importance |
|---|---|
| proline | 0.16 |
| flavanoids | 0.14 |
| color_intensity | 0.12 |
| hue | 0.09 |
| alcohol | 0.08 |
The results show that certain chemical properties (like proline and flavanoids) play a dominant role in differentiating wine types.
Key Takeaways
- Feature importance offers a simple yet effective way to interpret Random Forest models.
- Helps in dimensionality reduction — focusing on the most useful variables.
- Provides transparency and explainability, essential for trust in ML applications.
- A crucial step before applying advanced optimization or model deployment.
Conclusion
Understanding feature importance isn’t just about numbers — it’s about insights.
By identifying which features influence predictions most, we can refine datasets, simplify models, and make data-driven decisions that align with business or scientific objectives.
Code Snippet:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Load dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Get feature importances
importances = model.feature_importances_
features = X.columns
# Combine into DataFrame and sort
df_importances = pd.DataFrame({'Feature': features, 'Importance': importances})
df_importances = df_importances.sort_values(by='Importance', ascending=False)
df_importances.head()
# Plot feature importances
plt.figure(figsize=(8, 5))
plt.barh(df_importances['Feature'], df_importances['Importance'], color='green')
plt.gca().invert_yaxis()
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.tight_layout()
plt.show()
No comments yet. Be the first to comment!