01Input space & score distribution
02Outcomes
| Metric | Definition | Value |
|---|---|---|
| TPR (Recall)how well the classifier "recalls" all actual positives | $\text{TP}/(\text{TP}+\text{FN})$ | — |
| FPRhow often the classifier raises a false alarm on negatives | $\text{FP}/(\text{FP}+\text{TN})$ | — |
| Precisionhow trustworthy a positive prediction is | $\text{TP}/(\text{TP}+\text{FP})$ | — |
| F1harmonic mean of precision & recall | $2\cdot\text{Prec}\cdot\text{Rec}/(\text{Prec}+\text{Rec})$ | — |
| Specificityhow well the classifier dismisses true negatives | $\text{TN}/(\text{TN}+\text{FP}) = 1-\text{FPR}$ | — |
| Accuracyoverall fraction of correct predictions | $(\text{TP}+\text{TN})/N$ | — |
| Error Rateoverall fraction of incorrect predictions | $(\text{FP}+\text{FN})/N = 1-\text{Acc}$ | — |
03Threshold sweep
Both share TP in the numerator, but their denominators pull in opposite directions as you move the threshold.
Lower the threshold → more positives predicted → recall rises, precision falls (more false alarms).
Raise the threshold → fewer positives predicted → precision rises, recall falls (more misses).
Predict positive on everything and recall is perfect — but precision collapses. Predict positive only when certain and precision is high — but you miss many true positives. The F1 score and the PR curve make this trade-off explicit so you can pick the operating point that fits your cost structure.
04Youden's J — optimal threshold
$J(\tau) = \mathrm{TPR}(\tau) - \mathrm{FPR}(\tau)$ measures how much better the classifier is than random at a given cut-off. A random classifier has $J = 0$; a perfect one has $J = 1$.
The Youden threshold $\tau^\star = \arg\max_\tau J(\tau)$ maximises the vertical distance between the ROC curve and the diagonal chance line — it is the point on the ROC curve furthest from the no-skill baseline.
This criterion implicitly weights sensitivity and specificity equally. If a false negative is much costlier than a false positive (e.g. cancer screening), you may prefer a lower threshold than $\tau^\star$ even though $J$ is slightly smaller there.
05Low false-alarm regime
The full AUC weighs every operating point equally — but many production systems can only tolerate a tiny false-alarm budget. Fraud alerts, medical screening, content moderation queues: every flag costs a human review, so only the very low-FPR slice of the ROC is operationally relevant. Two classifiers with identical AUC can behave very differently in that slice.
Partial AUC $\mathrm{pAUC}(\alpha)$ integrates the ROC only over $\mathrm{FPR} \in [0, \alpha]$ — the shaded strip on the left. The normalised value $\mathrm{pAUC}/\alpha$ rescales it back to $[0, 1]$ so you can compare it to the full AUC at a glance. A model that earns its AUC out in the high-FPR region (where you'd never operate) gets penalised here.
Success@K reframes the same idea from the analyst's seat: "of the top $K$ items the model flagged, how many were real?" (precision) and "how many of all the positives did we catch?" (recall). The marker tracks the current threshold's $K$ — slide the probability cut-off to see how the candidate list grows.