🧠 AI with Python – 📊 Multi-class Classification Evaluation with classification_report
Posted On: October 23, 2025
Description:
When working on classification problems, evaluating how well your model performs is just as important as training it.
While accuracy gives a single overall measure, it doesn’t reveal how the model performs on each individual class — especially when classes are imbalanced or overlapping.
The classification_report from sklearn.metrics provides a detailed, class-by-class breakdown of performance using Precision, Recall, F1-Score, and Support — all in one place.
Understanding Multiclass Evaluation
In binary classification, we usually look at metrics like accuracy, precision, recall, and F1-score for two classes.
However, in multiclass classification, these metrics are calculated for each class separately and then averaged (macro, micro, or weighted).
This allows you to see which classes your model predicts well and which ones need improvement.
Dataset and Model Setup
We’ll use the Iris dataset, one of the most common multiclass datasets in machine learning.
It contains three flower classes — setosa, versicolor, and virginica — and four features describing each flower’s dimensions.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a multiclass classifier
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
Generating the Classification Report
Once the model is trained, you can evaluate it with just a few lines of code using classification_report().
from sklearn.metrics import classification_report
# Make predictions
y_pred = model.predict(X_test)
# Generate report
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n")
print(report)
Interpreting the Report
The classification report provides the following metrics for each class:
| Metric | Meaning |
|---|---|
| Precision | Out of all samples predicted for a class, how many were correct. |
| Recall | Out of all actual samples of a class, how many were identified correctly. |
| F1-Score | The harmonic mean of Precision and Recall — balances both metrics. |
| Support | The number of true samples in the dataset for that class. |
Sample Output
precision recall f1-score support
setosa 1.00 1.00 1.00 16
versicolor 1.00 0.93 0.96 14
virginica 0.93 1.00 0.96 15
accuracy 0.97 45
macro avg 0.98 0.98 0.97 45
weighted avg 0.97 0.97 0.97 45
This table gives a detailed look at how well the model performs for each class individually, along with averaged metrics at the bottom.
Why This Matters
- Class-wise evaluation helps you identify specific weaknesses in your model.
- Macro average treats all classes equally — useful when each class is equally important.
- Weighted average considers class frequency — useful when dealing with imbalanced datasets.
The classification_report provides clarity, making it an essential step in any classification workflow.
Key Takeaways
- The classification_report is a comprehensive summary of your model’s performance across all classes.
- Use it to evaluate models beyond simple accuracy, especially for multiclass problems.
- Each metric (Precision, Recall, F1) provides unique insights into quality, coverage, and balance of predictions.
Conclusion
Model evaluation is more than a single number.
By understanding Precision, Recall, F1-score, and Support through classification_report, you gain a clearer view of where your model excels and where it struggles.
This deeper understanding helps improve your model iteratively — turning results into real insight.
Code Snippet:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Generate the classification report
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n")
print(report)
No comments yet. Be the first to comment!