Calibration (predicted vs. observed)

Module 8 · Risk Tool Lessons

Ranking well is not enough.

A risk assessment tool can rank people correctly and still produce inaccurate probabilities. That is where calibration comes in.

Key takeaway

A tool can rank people well and still produce misleading probabilities.

What Calibration Means

Calibration asks a different question than AUC:

The idea: Do predicted probabilities actually match observed outcomes?

In a well-calibrated tool, predicted risk closely matches what actually happens. For example, a predicted 30% risk corresponds to about 30% observed recidivism.

Illustration

Calibration compares prediction to reality

A well-calibrated tool produces predicted probabilities that closely match observed outcomes.

Examples of well-calibrated, under-predicting, and over-predicting risk assessment tools

Calibration asks whether predicted risk matches what actually happens across different score levels.

How Tools Mis-Calibrate

Tools can mis-calibrate in different ways:

Under-prediction → actual outcomes are higher than predicted
Over-prediction → predicted probabilities are too high
Mis-specified slopes → errors change across risk levels

Why This Matters

Decisions are often based on probabilities, not just rankings
Poor calibration can distort supervision and classification decisions
Tools may calibrate differently across populations and jurisdictions

Why Recalibration Matters

If a state or agency has adopted a tool developed elsewhere, has its calibration been evaluated?

Tools may perform differently when moved to new settings, populations, or jurisdictions.

Bottom Line

A tool can rank people well and still produce misleading probabilities. Calibration asks whether predicted risk actually matches observed outcomes.

← Back to Modules