In many real-world problems, some classes are much rarer than others.
A model that always predicts “no fraud” could still achieve >99% accuracy!
This is why accuracy can be misleading in imbalanced datasets.
We need metrics that differentiate types of errors.
DummyClassifier: Confusion matrix
Which types of errors would be most critical for the bank to address? Missing a fraud case or flagging a legitimate transaction as fraud?
LogisticRegression: Confusion matrix
Are we doing better with logistic regression?
Understanding the confusion matrix
TN \(\rightarrow\) True negatives
FP \(\rightarrow\) False positives
FN \(\rightarrow\) False negatives
TP \(\rightarrow\) True positives
Practice: confusion matrix terminology
Confusion matrix questions
Imagine a spam filter model where emails labeled 1 = spam, 0 = not spam.
If a spam email is incorrectly classified as not spam, what kind of error is this?
A false positive
A true positive
A false negative
A true negative
Confusion matrix questions
In an intrusion detection system, 1 = intrusion, 0 = safe.
If the system misses an actual intrusion and classifies it as safe, this is a:
A false positive
A true positive
A false negative
A true negative
Confusion matrix questions
In a medical test for a disease, 1 = diseased, 0 = healthy.
If a healthy patient is incorrectly diagnosed as diseased, that’s a:
A false positive
A true positive
A false negative
A true negative
Metrics other than accuracy
Now that we understand the different types of errors, we can explore metrics that better capture model performance when accuracy falls short, especially for imbalanced datasets.
We’ll start with three key ones:
Precision
Recall
F1-score
Precision and recall
Let’s revisit our fraud detection scenario. The circle below represents all transactions predicted as fraud by an imaginary toy model designed to detect fraudulent activity.
Intuition behind the two metrics
Precision: Of all the transactions predicted as fraud, how many were actually fraud?
High precision \(\rightarrow\) few false alarms (low false positives).
Recall: Of all the actual fraud cases, how many did the model catch?
High recall \(\rightarrow\) few missed frauds (low false negatives).
Trade-off between precision and recall
Increasing recall often decreases precision, and vice versa.
Example:
Predict “fraud” for every transaction \(\rightarrow\) perfect recall, terrible precision.
Predict “fraud” only when 100% sure \(\rightarrow\) high precision, low recall.
The right balance depends on the application and cost of errors.
F1-score
Sometimes, we want a single metric that balances precision and recall.
Select all of the following statements which are TRUE.
In medical diagnosis, false positives are more damaging than false negatives (assume “positive” means the person has a disease, “negative” means they don’t).
In spam classification, false positives are more damaging than false negatives (assume “positive” means the email is spam, “negative” means they it’s not).
If method A gets a higher accuracy than method B, that means its precision is also higher.
If method A gets a higher accuracy than method B, that means its recall is also higher.
Counter examples
Method A - higher accuracy but lower precision
Negative
Positive
90
5
5
0
Method B - lower accuracy but higher precision
Negative
Positive
80
15
0
5
Takeaway
Accuracy summarizes overall correctness but hides class-specific behaviour.
You can have high accuracy but poor precision or recall,
especially in imbalanced datasets.
Always check multiple metrics before deciding which model is better.
Threshold-based classification
Predicting with logistic regression
Most classification models don’t directly predict labels. They predict scores or probabilities.
To get a label (e.g., “fraud” or “non fraud”), we choose a threshold (often 0.5). If the threshold changes, predictions change, and so do the errors.
What happens to precision and recall if we change the probability threshold?
Calculate precision and recall (TPR) at every possible threshold and graph them.
Top left \(\rightarrow\) Very high threshold (strict model = high precision)
Bottom right \(\rightarrow\) Very low threshold (linient model = high recall)
PR curve different thresholds
Which of the red dots are reasonable trade offs?
Average Precision (AP) Score
AP score summarizes the PR curve by calculating the area under the curve
It measures the ranking ability of a model; how well it assigns higher probabilities to positive examples than to negative ones, regardless of the specific threshold.
Clicker Exercise 9.2
Choose the appropriate evaluation metric for the following scenarios:
Scenario 1: Balance between precision and recall for a threshold.
Scenario 2: Assess performance across all thresholds.
F1 for 1, AP for 2
AP for 1, F1 Score for 2
AP for both
F1 for both
Clicker Exercise 9.3
Select all of the following statements which are TRUE.
If we increase the classification threshold, both true and false positives are likely to decrease.
If we increase the classification threshold, both true and false negatives are likely to decrease.
Lowering the classification threshold generally increases the model’s recall.
Raising the classification threshold can improve the precision of the model if it effectively reduces the number of false positives without significantly affecting true positives.
ROC Curve
Compute the True Positive Rate (TPR) and False Positive Rate (FPR) at every possible threshold, and plot TPR vs FPR.
How well does the model separate positive and negative classes in terms of predicted probability?
A good choice when the dataset is reasonably balanced or not extremely imbalanced (e.g., fraud detection, disease diagnosis).
ROC Curve example
Bottom-left\(\rightarrow\) very high threshold (almost everything predicted negative: low recall, low FPR).
Top-right\(\rightarrow\) very low threshold (almost everything predicted positive: high recall, high FPR).
AUC
The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.
ROC AUC questions
Consider the points A, B, and C in the following diagram, each representing a threshold. Which threshold would you pick in each scenario?
If false positives (false alarms) are highly costly
If false positives are cheap and false negatives (missed true positives) highly costly
For this demo, each student should click this link to create a new repo in their accounts, then clone that repo locally to follow along with the demo from today.