🧠 AI with Python – 📊 Precision, Recall, and F1-Score Explained
Posted On: October 21, 2025
Description:
When evaluating a machine learning model, accuracy is often the first metric we look at.
However, in real-world scenarios — especially when dealing with imbalanced datasets — accuracy alone can be misleading.
Metrics like Precision, Recall, and F1-Score give a much deeper insight into how well your model is performing across different types of errors.
Why Accuracy Isn’t Always Enough
Imagine a dataset where 95% of samples belong to one class.
A model that simply predicts everything as that class would achieve 95% accuracy, yet it completely fails to identify the minority class.
That’s where Precision, Recall, and F1-Score come in — helping us understand how a model performs, not just how often it’s right.
Dataset and Model Setup
We’ll use the Iris dataset, restricting it to two classes for simplicity.
A Logistic Regression model will serve as our classifier.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
iris = load_iris()
X, y = iris.data, iris.target
# Convert to binary classification
X, y = X[y != 2], y[y != 2]
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
model = LogisticRegression()
model.fit(X_train, y_train)
Evaluating Model Predictions
After training the model, we can compute Precision, Recall, and F1-Score using scikit-learn’s built-in metrics.
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
y_pred = model.predict(X_test)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
The classification report provides per-class metrics, along with overall averages — making it a one-stop summary for model evaluation.
Understanding the Metrics
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) | Out of all predicted positives, how many were correct. |
| Recall | TP / (TP + FN) | Out of all actual positives, how many were identified correctly. |
| F1-Score | 2 × (Precision × Recall) / (Precision + Recall) | Harmonic mean of Precision and Recall — balances both metrics. |
- TP (True Positive): Model predicted positive and was correct.
- FP (False Positive): Model predicted positive but was wrong.
- FN (False Negative): Model predicted negative but missed a positive.
When to Focus on Each Metric
-
High Precision Needed:
When false positives are costly (e.g., spam detection, medical diagnosis).
-
High Recall Needed:
When missing a positive instance is more critical (e.g., fraud detection, cancer screening).
-
F1-Score Balances Both:
Useful when both false positives and false negatives are important.
Example Output
A sample result from the Iris dataset:
Precision: 1.00
Recall: 0.93
F1-Score: 0.96
precision recall f1-score support
0 1.00 1.00 1.00 16
1 1.00 0.93 0.96 14
accuracy 0.98 30
Key Takeaways
- Precision measures quality of positive predictions.
- Recall measures completeness of positive predictions.
- F1-Score balances both, especially when data is imbalanced.
- Use all three metrics together for a complete evaluation picture.
Conclusion
Accuracy may tell you how often your model is correct, but Precision, Recall, and F1-Score reveal why it performs the way it does.
These metrics are indispensable in classification — especially when the cost of errors differs across outcomes.
By understanding and tracking them, you move from just “building models” to truly evaluating intelligence.
Code Snippet:
# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Use only two classes (binary classification)
X, y = X[y != 2], y[y != 2]
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
# Complete classification report
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
No comments yet. Be the first to comment!