🧠 AI with Python – 🌲 RandomForestClassifier for Tabular Data


Description:

Introduction

Random Forest is one of the most widely used machine learning algorithms for tabular data.

It’s an ensemble of decision trees, combining their outputs to produce robust predictions.

In this post, we’ll train a RandomForestClassifier using scikit-learn on the Breast Cancer dataset, evaluate it with accuracy and classification metrics, and interpret which features were most important.


Loading the Dataset

We’ll use the Breast Cancer Wisconsin dataset built into scikit-learn.

It contains 30 numerical features describing tumor characteristics and a binary target (malignant or benign).

from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names

Train/Test Split

To evaluate model generalization, we split the dataset into training and testing sets.

We use stratification to ensure class proportions remain balanced.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

Training the RandomForestClassifier

A Random Forest trains multiple decision trees and averages their predictions.

Key benefits:

  • Handles non-linear relationships.
  • Requires little preprocessing (no scaling needed).
  • Reduces overfitting compared to single trees.
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    n_jobs=-1,
    random_state=42
)
rf.fit(X_train, y_train)

Evaluating the Model

We assess performance using accuracy and a classification report.

from sklearn.metrics import accuracy_score, classification_report

y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred, target_names=target_names))

Confusion Matrix

A confusion matrix helps visualize where the model performs well and where it misclassifies.

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=target_names, cmap="Blues")
plt.show()

Feature Importance

Random Forests allow us to see which features drive predictions the most.

We’ll plot the top 10 most important features.

import pandas as pd

importances = rf.feature_importances_
imp_df = pd.DataFrame({"feature": feature_names, "importance": importances})
imp_top10 = imp_df.sort_values("importance", ascending=False).head(10)
imp_top10

Sample Output

Accuracy: ~97% (may vary slightly due to randomness)

Top Features (example):

  • worst perimeter
  • mean concave points
  • worst radius
  • mean radius
  • mean perimeter

These features strongly influence whether a tumor is classified as malignant or benign.


Key Takeaways

  • Random Forests are robust, accurate, and easy to use.
  • They provide feature importance scores for interpretability.
  • Work well out-of-the-box on most tabular datasets.
  • Useful for both classification and regression tasks.

Code Snippet:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
target_names = data.target_names

print(f"Features shape: {X.shape} | Target shape: {y.shape}")
print("Classes:", list(target_names))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

rf = RandomForestClassifier(
    n_estimators=300,  # solid default for stability
    max_depth=None,  # grow trees fully; tune if overfitting
    n_jobs=-1,  # use all CPU cores
    random_state=42
)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}\n")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=target_names))

ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, display_labels=target_names, cmap="Blues"
)
plt.title("RandomForestClassifier — Confusion Matrix")
plt.tight_layout()
plt.show()

importances = rf.feature_importances_
imp_df = pd.DataFrame({"feature": feature_names, "importance": importances})
imp_top10 = imp_df.sort_values("importance", ascending=False).head(10)

plt.figure(figsize=(8, 5))
plt.barh(imp_top10["feature"], imp_top10["importance"])
plt.gca().invert_yaxis()
plt.xlabel("Importance")
plt.title("Top 10 Feature Importances — Random Forest")
plt.tight_layout()
plt.show()

imp_top10

Link copied!

Comments

Add Your Comment

Comment Added!