🧠 AI with Python – 📉 Customer Churn Prediction (RF + SMOTE)
Posted on: January 6, 2026
Description:
Customer churn is one of the most impactful problems businesses face today. Retaining an existing customer is often far cheaper than acquiring a new one, which makes early churn prediction a critical business capability.
In this project, we build a customer churn prediction model using Random Forest, while addressing a common real-world challenge: class imbalance. To handle this, we use SMOTE (Synthetic Minority Over-sampling Technique) to ensure the model learns churn patterns effectively.
Understanding the Problem
In churn datasets, the number of customers who do not churn is usually much higher than those who do. This imbalance creates two major issues:
- Models become biased toward the majority (non-churn) class
- Accuracy appears high even when churned customers are poorly detected
To solve this, we must:
- Balance the dataset
- Use evaluation metrics that focus on minority class performance
1. Preparing a Churn-Like Dataset
For demonstration, we simulate a churn-style dataset with imbalance.
from sklearn.datasets import make_classification
import pandas as pd
X, y = make_classification(
n_samples=5000,
n_features=10,
n_informative=6,
n_redundant=2,
weights=[0.75, 0.25],
random_state=42
)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
df["churn"] = y
This structure closely mirrors real churn datasets where churned users form a minority.
2. Train/Test Split with Stratification
We preserve class proportions while splitting the data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df.drop("churn", axis=1),
df["churn"],
test_size=0.3,
stratify=df["churn"],
random_state=42
)
Stratification prevents accidental class skew in the test set.
3. Handling Class Imbalance with SMOTE
SMOTE generates synthetic examples for the minority class.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
This helps the model learn decision boundaries for churned customers.
4. Training a Random Forest Model
Random Forest is well-suited for tabular business data with non-linear relationships.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=200,
max_depth=8,
random_state=42
)
model.fit(X_resampled, y_resampled)
Ensemble methods like Random Forest are robust and handle feature interactions naturally.
5. Evaluating the Churn Model
Evaluation focuses on minority class performance.
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Metrics like recall and F1-score are more important than raw accuracy for churn prediction.
6. Interpreting Feature Importance
Understanding why customers churn is as important as predicting it.
import pandas as pd
importances = pd.Series(
model.feature_importances_,
index=X_train.columns
).sort_values(ascending=False)
print(importances.head())
This insight helps teams design targeted retention strategies.
Key Takeaways
- Customer churn prediction is inherently an imbalanced classification problem.
- SMOTE helps models learn patterns from minority churned users.
- Random Forest performs strongly on tabular customer data.
- Recall and F1-score are critical metrics for churn use cases.
- Feature importance bridges ML predictions with business decisions.
Conclusion
Customer churn prediction is a classic example of how machine learning directly supports business outcomes.
By combining SMOTE for imbalance handling and Random Forest for modeling, we build a practical, real-world churn prediction system.
This project demonstrates the transition from academic ML examples to production-relevant ML workflows, making it a strong addition to any applied machine learning toolkit.
Code Snippet:
import pandas as pd
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=5000,
n_features=10,
n_informative=6,
n_redundant=2,
weights=[0.75, 0.25], # churn imbalance
random_state=42
)
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
df["churn"] = y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df.drop("churn", axis=1),
df["churn"],
test_size=0.3,
stratify=df["churn"],
random_state=42
)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=200,
max_depth=8,
random_state=42
)
model.fit(X_resampled, y_resampled)
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
import pandas as pd
importances = pd.Series(
model.feature_importances_,
index=X_train.columns
).sort_values(ascending=False)
print(importances.head())
No comments yet. Be the first to comment!