AW Dev Rethought

⚖️ There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies - C.A.R. Hoare

🧠 AI with Python – 📉 Customer Churn Prediction (RF + SMOTE)


Description:

Customer churn is one of the most impactful problems businesses face today. Retaining an existing customer is often far cheaper than acquiring a new one, which makes early churn prediction a critical business capability.

In this project, we build a customer churn prediction model using Random Forest, while addressing a common real-world challenge: class imbalance. To handle this, we use SMOTE (Synthetic Minority Over-sampling Technique) to ensure the model learns churn patterns effectively.


Understanding the Problem

In churn datasets, the number of customers who do not churn is usually much higher than those who do. This imbalance creates two major issues:

  • Models become biased toward the majority (non-churn) class
  • Accuracy appears high even when churned customers are poorly detected

To solve this, we must:

  1. Balance the dataset
  2. Use evaluation metrics that focus on minority class performance

1. Preparing a Churn-Like Dataset

For demonstration, we simulate a churn-style dataset with imbalance.

from sklearn.datasets import make_classification
import pandas as pd

X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.75, 0.25],
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
df["churn"] = y

This structure closely mirrors real churn datasets where churned users form a minority.


2. Train/Test Split with Stratification

We preserve class proportions while splitting the data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("churn", axis=1),
    df["churn"],
    test_size=0.3,
    stratify=df["churn"],
    random_state=42
)

Stratification prevents accidental class skew in the test set.


3. Handling Class Imbalance with SMOTE

SMOTE generates synthetic examples for the minority class.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

This helps the model learn decision boundaries for churned customers.


4. Training a Random Forest Model

Random Forest is well-suited for tabular business data with non-linear relationships.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42
)

model.fit(X_resampled, y_resampled)

Ensemble methods like Random Forest are robust and handle feature interactions naturally.


5. Evaluating the Churn Model

Evaluation focuses on minority class performance.

from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Metrics like recall and F1-score are more important than raw accuracy for churn prediction.


6. Interpreting Feature Importance

Understanding why customers churn is as important as predicting it.

import pandas as pd

importances = pd.Series(
    model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print(importances.head())

This insight helps teams design targeted retention strategies.


Key Takeaways

  1. Customer churn prediction is inherently an imbalanced classification problem.
  2. SMOTE helps models learn patterns from minority churned users.
  3. Random Forest performs strongly on tabular customer data.
  4. Recall and F1-score are critical metrics for churn use cases.
  5. Feature importance bridges ML predictions with business decisions.

Conclusion

Customer churn prediction is a classic example of how machine learning directly supports business outcomes.

By combining SMOTE for imbalance handling and Random Forest for modeling, we build a practical, real-world churn prediction system.

This project demonstrates the transition from academic ML examples to production-relevant ML workflows, making it a strong addition to any applied machine learning toolkit.


Code Snippet:

import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.75, 0.25],   # churn imbalance
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
df["churn"] = y


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("churn", axis=1),
    df["churn"],
    test_size=0.3,
    stratify=df["churn"],
    random_state=42
)


from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)


from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    random_state=42
)

model.fit(X_resampled, y_resampled)


from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


import pandas as pd

importances = pd.Series(
    model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print(importances.head())

Link copied!

Comments

Add Your Comment

Comment Added!