🧠 AI with Python – 📊 Precision, Recall, and F1-Score Explained


Description:

When evaluating a machine learning model, accuracy is often the first metric we look at.

However, in real-world scenarios — especially when dealing with imbalanced datasets — accuracy alone can be misleading.

Metrics like Precision, Recall, and F1-Score give a much deeper insight into how well your model is performing across different types of errors.


Why Accuracy Isn’t Always Enough

Imagine a dataset where 95% of samples belong to one class.

A model that simply predicts everything as that class would achieve 95% accuracy, yet it completely fails to identify the minority class.

That’s where Precision, Recall, and F1-Score come in — helping us understand how a model performs, not just how often it’s right.


Dataset and Model Setup

We’ll use the Iris dataset, restricting it to two classes for simplicity.

A Logistic Regression model will serve as our classifier.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X, y = iris.data, iris.target

# Convert to binary classification
X, y = X[y != 2], y[y != 2]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

Evaluating Model Predictions

After training the model, we can compute Precision, Recall, and F1-Score using scikit-learn’s built-in metrics.

from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

y_pred = model.predict(X_test)

precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

The classification report provides per-class metrics, along with overall averages — making it a one-stop summary for model evaluation.


Understanding the Metrics

Metric Formula Meaning
Precision TP / (TP + FP) Out of all predicted positives, how many were correct.
Recall TP / (TP + FN) Out of all actual positives, how many were identified correctly.
F1-Score 2 × (Precision × Recall) / (Precision + Recall) Harmonic mean of Precision and Recall — balances both metrics.
  • TP (True Positive): Model predicted positive and was correct.
  • FP (False Positive): Model predicted positive but was wrong.
  • FN (False Negative): Model predicted negative but missed a positive.

When to Focus on Each Metric

  • High Precision Needed:

    When false positives are costly (e.g., spam detection, medical diagnosis).

  • High Recall Needed:

    When missing a positive instance is more critical (e.g., fraud detection, cancer screening).

  • F1-Score Balances Both:

    Useful when both false positives and false negatives are important.


Example Output

A sample result from the Iris dataset:

Precision: 1.00
Recall: 0.93
F1-Score: 0.96

              precision    recall  f1-score   support
           0       1.00      1.00      1.00        16
           1       1.00      0.93      0.96        14
    accuracy                           0.98        30

Key Takeaways

  • Precision measures quality of positive predictions.
  • Recall measures completeness of positive predictions.
  • F1-Score balances both, especially when data is imbalanced.
  • Use all three metrics together for a complete evaluation picture.

Conclusion

Accuracy may tell you how often your model is correct, but Precision, Recall, and F1-Score reveal why it performs the way it does.

These metrics are indispensable in classification — especially when the cost of errors differs across outcomes.

By understanding and tracking them, you move from just “building models” to truly evaluating intelligence.


Code Snippet:

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report


# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Use only two classes (binary classification)
X, y = X[y != 2], y[y != 2]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)


# Make predictions
y_pred = model.predict(X_test)

# Calculate metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

# Complete classification report
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Link copied!

Comments

Add Your Comment

Comment Added!