01Input space & score distribution

Predictions @ threshold · top: input space · bottom: score distribution

02Outcomes

Confusion matrix
Pred +
Pred −
Actual +
TP0
FN0
Actual −
FP0
TN0
MetricDefinitionValue
TPR (Recall)how well the classifier "recalls" all actual positives$\text{TP}/(\text{TP}+\text{FN})$
FPRhow often the classifier raises a false alarm on negatives$\text{FP}/(\text{FP}+\text{TN})$
Precisionhow trustworthy a positive prediction is$\text{TP}/(\text{TP}+\text{FP})$
F1harmonic mean of precision & recall$2\cdot\text{Prec}\cdot\text{Rec}/(\text{Prec}+\text{Rec})$
Specificityhow well the classifier dismisses true negatives$\text{TN}/(\text{TN}+\text{FP}) = 1-\text{FPR}$
Accuracyoverall fraction of correct predictions$(\text{TP}+\text{TN})/N$
Error Rateoverall fraction of incorrect predictions$(\text{FP}+\text{FN})/N = 1-\text{Acc}$

03Threshold sweep

Precision · Recall · F1 vs threshold
ROC curve
Precision–Recall curve
Why precision and recall trade off

Both share TP in the numerator, but their denominators pull in opposite directions as you move the threshold.

Lower the threshold → more positives predicted → recall rises, precision falls (more false alarms).

Raise the threshold → fewer positives predicted → precision rises, recall falls (more misses).

Predict positive on everything and recall is perfect — but precision collapses. Predict positive only when certain and precision is high — but you miss many true positives. The F1 score and the PR curve make this trade-off explicit so you can pick the operating point that fits your cost structure.

04Youden's J — optimal threshold

Youden's J vs threshold
$J(\tau) = \mathrm{TPR}(\tau) + \mathrm{TNR}(\tau) - 1 = \mathrm{TPR}(\tau) - \mathrm{FPR}(\tau)$
Why Youden's J gives the optimal threshold

$J(\tau) = \mathrm{TPR}(\tau) - \mathrm{FPR}(\tau)$ measures how much better the classifier is than random at a given cut-off. A random classifier has $J = 0$; a perfect one has $J = 1$.

The Youden threshold $\tau^\star = \arg\max_\tau J(\tau)$ maximises the vertical distance between the ROC curve and the diagonal chance line — it is the point on the ROC curve furthest from the no-skill baseline.

This criterion implicitly weights sensitivity and specificity equally. If a false negative is much costlier than a false positive (e.g. cancer screening), you may prefer a lower threshold than $\tau^\star$ even though $J$ is slightly smaller there.

05Low false-alarm regime

Partial AUC — ROC restricted to FPR ≤ α
$\mathrm{pAUC}(\alpha) = \int_{0}^{\alpha} \mathrm{TPR}\bigl(\mathrm{FPR}^{-1}(u)\bigr)\,du$
Success @ K — top-K precision & recall
$\mathrm{P@K} = \tfrac{\mathrm{TP}_K}{K}, \quad \mathrm{R@K} = \tfrac{\mathrm{TP}_K}{P}$  ·  rank items by score, keep the top $K$
Why low-false-alarm metrics matter

The full AUC weighs every operating point equally — but many production systems can only tolerate a tiny false-alarm budget. Fraud alerts, medical screening, content moderation queues: every flag costs a human review, so only the very low-FPR slice of the ROC is operationally relevant. Two classifiers with identical AUC can behave very differently in that slice.

Partial AUC $\mathrm{pAUC}(\alpha)$ integrates the ROC only over $\mathrm{FPR} \in [0, \alpha]$ — the shaded strip on the left. The normalised value $\mathrm{pAUC}/\alpha$ rescales it back to $[0, 1]$ so you can compare it to the full AUC at a glance. A model that earns its AUC out in the high-FPR region (where you'd never operate) gets penalised here.

Success@K reframes the same idea from the analyst's seat: "of the top $K$ items the model flagged, how many were real?" (precision) and "how many of all the positives did we catch?" (recall). The marker tracks the current threshold's $K$ — slide the probability cut-off to see how the candidate list grows.