🧠 AI with Python – 📊 Multi-class Classification Evaluation with classification_report

Posted On: October 23, 2025

Description:

When working on classification problems, evaluating how well your model performs is just as important as training it.

While accuracy gives a single overall measure, it doesn’t reveal how the model performs on each individual class — especially when classes are imbalanced or overlapping.

The classification_report from sklearn.metrics provides a detailed, class-by-class breakdown of performance using Precision, Recall, F1-Score, and Support — all in one place.

Understanding Multiclass Evaluation

In binary classification, we usually look at metrics like accuracy, precision, recall, and F1-score for two classes.

However, in multiclass classification, these metrics are calculated for each class separately and then averaged (macro, micro, or weighted).

This allows you to see which classes your model predicts well and which ones need improvement.

Dataset and Model Setup

We’ll use the Iris dataset, one of the most common multiclass datasets in machine learning.

It contains three flower classes — setosa, versicolor, and virginica — and four features describing each flower’s dimensions.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a multiclass classifier
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

Generating the Classification Report

Once the model is trained, you can evaluate it with just a few lines of code using classification_report().

from sklearn.metrics import classification_report

# Make predictions
y_pred = model.predict(X_test)

# Generate report
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n")
print(report)

Interpreting the Report

The classification report provides the following metrics for each class:

Metric	Meaning
Precision	Out of all samples predicted for a class, how many were correct.
Recall	Out of all actual samples of a class, how many were identified correctly.
F1-Score	The harmonic mean of Precision and Recall — balances both metrics.
Support	The number of true samples in the dataset for that class.

Sample Output

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        16
  versicolor       1.00      0.93      0.96        14
   virginica       0.93      1.00      0.96        15

    accuracy                           0.97        45
   macro avg       0.98      0.98      0.97        45
weighted avg       0.97      0.97      0.97        45

This table gives a detailed look at how well the model performs for each class individually, along with averaged metrics at the bottom.

Why This Matters

Class-wise evaluation helps you identify specific weaknesses in your model.
Macro average treats all classes equally — useful when each class is equally important.
Weighted average considers class frequency — useful when dealing with imbalanced datasets.

The classification_report provides clarity, making it an essential step in any classification workflow.

Key Takeaways

The classification_report is a comprehensive summary of your model’s performance across all classes.
Use it to evaluate models beyond simple accuracy, especially for multiclass problems.
Each metric (Precision, Recall, F1) provides unique insights into quality, coverage, and balance of predictions.

Conclusion

Model evaluation is more than a single number.

By understanding Precision, Recall, F1-score, and Support through classification_report, you gain a clearer view of where your model excels and where it struggles.

This deeper understanding helps improve your model iteratively — turning results into real insight.

Code Snippet:

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


# Train the model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)


# Make predictions
y_pred = model.predict(X_test)

# Generate the classification report
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print("Classification Report:\n")
print(report)

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – 📊 Multi-class Classification Evaluation with classification_report

Description:

Understanding Multiclass Evaluation

Dataset and Model Setup

Generating the Classification Report

Interpreting the Report

Sample Output

Why This Matters

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – 📊 Multi-class Classification Evaluation with classification_report

Description:

Understanding Multiclass Evaluation

Dataset and Model Setup

Generating the Classification Report

Interpreting the Report

Sample Output

Why This Matters

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – 📊 Precision, Recall, and F1-Score Explained

🧠 AI with Python – Regularization in Linear Models (Ridge vs Lasso)

⚡️ Saturday AI Sparks 🤖 - 🌐📝#️⃣ Translate → Summarize → Hashtags

Comments