🧠 AI with Python – 🍷🌲 Predict Wine Quality using RandomForestClassifier
Posted on: November 4, 2025
Description:
Wine tasting might be an art — but what if we let machine learning do the judging? 🍷
In this post, we’ll use a Random Forest Classifier to predict the quality of wine based on its chemical properties such as acidity, residual sugar, and alcohol content.
The goal is to explore how ensemble learning can classify wines into quality tiers — demonstrating the power of Random Forests in handling structured data.
Why Random Forests?
Random Forests are ensemble models that combine multiple decision trees to improve accuracy and generalization.
Each tree learns from a random subset of features and data samples, reducing overfitting and improving robustness.
Advantages of Random Forests:
- Handle both classification and regression tasks
- Tolerate missing values and noise
- Provide built-in feature importance scores
- Require little to no feature scaling
These qualities make them an excellent choice for tabular datasets like wine quality analysis.
Dataset Overview
The original Wine Quality dataset (from the UCI Machine Learning Repository) contains two variants — red and white wines.
Each record includes physicochemical properties (inputs) and a quality score (output), rated from 3 to 8.
If the dataset isn’t available locally, the script uses scikit-learn’s load_wine() dataset as a fallback.
This ensures the notebook runs smoothly in any environment.
from sklearn.datasets import load_wine
import pandas as pd
wine = load_wine(as_frame=True)
df = wine.frame
X = df.drop(columns=["target"])
y = df["target"]
Training the Random Forest Model
Once the data is ready, we train a RandomForestClassifier with multiple trees (estimators) to make collective decisions.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1)
model.fit(X_train, y_train)
A high accuracy score indicates that the model effectively captures the relationship between chemical features and wine quality.
Feature Importance Visualization
One of the major strengths of Random Forests is their ability to rank features by importance.
This tells us which chemical properties contribute the most to quality prediction.
import pandas as pd
import matplotlib.pyplot as plt
importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
importances.plot(kind="barh", figsize=(8, 5), color="maroon")
plt.title("Top Feature Importances – Wine Quality Prediction")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()
Interpretation:
- Features like alcohol content, volatile acidity, and sulphates often rank high.
- Higher alcohol and balanced acidity typically correlate with better quality wines.
- These insights help wineries and researchers understand what makes a wine score higher.
Key Takeaways
- Random Forests provide high accuracy and interpretability for structured datasets.
- No feature scaling is needed, simplifying the preprocessing workflow.
- The feature importance plot provides actionable insights into real-world properties.
- You can easily adapt the same model for wine quality regression using RandomForestRegressor.
Conclusion
Machine learning isn’t just for numbers and pixels — it can even evaluate taste. 🍷
By using a Random Forest Classifier, we can predict wine quality with impressive accuracy and understand which factors make a wine “better.”
This combination of accuracy, interpretability, and ease of use makes Random Forests one of the most practical algorithms in any ML toolkit.
Code Snippet:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay
# Try local UCI CSVs first
csv_candidates = ["winequality-red.csv", "winequality-white.csv"]
csv_path = next((p for p in csv_candidates if os.path.exists(p)), None)
if csv_path:
df = pd.read_csv(csv_path, sep=";") # UCI CSVs are semicolon-separated
print(f"✅ Using local dataset: {csv_path}")
# UCI winequality target is typically an integer 3–8 (quality score). We'll model it as multiclass classification.
X = df.drop(columns=["quality"])
y = df["quality"].astype(int)
feature_names = X.columns.tolist()
dataset_name = f"UCI Wine Quality ({'red' if 'red' in csv_path else 'white'})"
else:
print("⚠️ Local UCI CSV not found. Falling back to scikit-learn `load_wine()`.")
wine = load_wine(as_frame=True)
df = wine.frame.copy()
# Use built-in class (0/1/2) as quality tiers for a drop-in classification demo
X = df.drop(columns=["target"])
y = df["target"]
feature_names = X.columns.tolist()
dataset_name = "sklearn load_wine (class tiers as quality)"
df.head()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
rf = RandomForestClassifier(
n_estimators=300, # more trees → stabler estimates
max_depth=None, # let trees grow; forest reduces overfitting
min_samples_leaf=1,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Dataset: {dataset_name}")
print(f"Accuracy: {acc:.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred))
disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, cmap="Blues")
disp.ax_.set_title("Confusion Matrix — RandomForest (Wine Quality)")
plt.tight_layout()
plt.show()
importances = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
plt.figure(figsize=(8, 5))
importances.head(15).plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Top Feature Importances — RandomForest")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
importances.head(15)
No comments yet. Be the first to comment!