AW Dev Rethought

⚖️ There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies - C.A.R. Hoare

⚡ Saturday ML Sparks – Random Forest Feature Importance 🌲🧠


Description:

In machine learning, not all features contribute equally to predictions. Some have a significant impact, while others add little to no value.

Random Forests, being ensembles of decision trees, allow us to quantify the importance of each feature — helping us understand which inputs truly drive model performance.

In this post, we’ll use the Wine dataset from scikit-learn to explore feature importance using a Random Forest Classifier.


About the Dataset

The Wine dataset consists of results from chemical analyses of different wine cultivars.

It includes 13 numerical features such as alcohol, ash, flavanoids, and color intensity, used to classify wines into three categories.

We begin by loading the dataset and splitting it into training and testing subsets.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the Random Forest Model

A Random Forest Classifier combines multiple decision trees, each trained on random subsets of data and features.

This ensemble approach improves accuracy and reduces overfitting compared to a single tree.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Once trained, the model can estimate how much each feature contributes to reducing impurity (or error) across trees.


Extracting Feature Importances

Scikit-learn provides a convenient attribute, feature_importances_, which gives the relative importance of each input feature.

We’ll organize and sort these scores to visualize which features had the strongest influence.

importances = model.feature_importances_
df_importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

Visualizing the Results

A horizontal bar plot helps us quickly identify the most influential features in the dataset.

Features with higher scores are those the model relies on most when making predictions.

plt.figure(figsize=(8, 5))
plt.barh(df_importances['Feature'], df_importances['Importance'], color='green')
plt.gca().invert_yaxis()
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

Sample Output

Feature Importance
proline 0.16
flavanoids 0.14
color_intensity 0.12
hue 0.09
alcohol 0.08

The results show that certain chemical properties (like proline and flavanoids) play a dominant role in differentiating wine types.


Key Takeaways

  • Feature importance offers a simple yet effective way to interpret Random Forest models.
  • Helps in dimensionality reduction — focusing on the most useful variables.
  • Provides transparency and explainability, essential for trust in ML applications.
  • A crucial step before applying advanced optimization or model deployment.

Conclusion

Understanding feature importance isn’t just about numbers — it’s about insights.

By identifying which features influence predictions most, we can refine datasets, simplify models, and make data-driven decisions that align with business or scientific objectives.


Code Snippet:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


# Load dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


# Get feature importances
importances = model.feature_importances_
features = X.columns

# Combine into DataFrame and sort
df_importances = pd.DataFrame({'Feature': features, 'Importance': importances})
df_importances = df_importances.sort_values(by='Importance', ascending=False)
df_importances.head()


# Plot feature importances
plt.figure(figsize=(8, 5))
plt.barh(df_importances['Feature'], df_importances['Importance'], color='green')
plt.gca().invert_yaxis()
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

Link copied!

Comments

Add Your Comment

Comment Added!