What is a good forecast?
Forecasts are like friends: Trust is the most important factor (you don’t ever want your friends to lie to you), but among your trustable friends, you prefer meeting those that tell you the most interesting stories.
What do I mean with this metaphor? We want forecasts to be “good,” “accurate,” and “precise.” But what do we mean by that? Let’s sharpen our thoughts to better articulate and visualize what we want from a forecast. There are two independent ways in which forecast quality can be measured, and you need to consider both – calibration and sharpness – to get a satisfactory understanding of your forecast’s performance.
Forecast calibration
For simplicity, let’s start with binary classification: The forecasted outcome can only take two values, “true or false,” “0 or 1,” or similar.
To be more concrete, let’s consider emails, and whether they will be tagged as spam by the mailbox user. A predictive system produces, for each email, a percentage probability that this email would be considered spam by the user (which we take to be the ground truth). Above a certain threshold, say 95%, the email then ends up in the spam folder.
To evaluate this system, you can, in the first place, check the calibration of the forecast: For those emails that are assigned an 80% spam-probability, the fraction of true spam should be around 80% (or at least not differ in a statistically significant way). For those emails that were assigned a 5% spam-probability, the fraction of true spam should be around 5%, and so forth. If this is the case, we can trust the forecast: An alleged 5% probability is indeed a 5% probability.
A calibrated forecast allows us to take strategic decisions: For instance, we can set the spam-folder threshold appropriately and we can estimate the number of false positives / false negatives upfront (it’s unavoidable that some spam makes it to the inbox, and some important emails end up in the spam folder).
Forecast sharpness
Is calibration all there is to forecast quality? Not quite! Imagine a forecast that assigns the overall spam probability – 85% – to every email. That forecast is well calibrated, since 85% of all emails are spam or otherwise malicious. You can trust that forecast; it’s not lying to you – but it is quite useless: You can’t take any useful decision on the trivial repeated statement “the probability that this email is spam is 85%.”
A helpful forecast is one that assigns very different probabilities to different emails – 0.1% spam probability for the email from your boss, 99.9% for doubtful pharmaceutical ads, and that remains calibrated. This usefulness property is called sharpness by statisticians, as it refers to the width of the predicted distribution of outcomes, given a forecast: The narrower, the sharper.
An unindividualized forecast that always produces the spam probability 85% is maximally unsharp. Maximal sharpness means that the spam-filter assigns only 0% or 100% spam probability to every email. This maximal degree of sharpness – determinism – is desirable, but it’s unrealistic: Such forecast will (very probably) not be calibrated, and some emails marked with 0% spam probability will turn out to be spam, some emails marked with 100% spam probability turn out to be your significant other’s.
What’s the best forecast then? We don’t want to give up on trust, so the forecast needs to remain calibrated, but within the calibrated forecasts, we want the sharpest one. This is the paradigm of probabilistic forecasting, which was formulated by Gneiting, Balabdaoui and Raftery in 2007 (J. R. Statist. Soc. B 69, Part 2, pp. 243–268): Maximize sharpness, but don’t jeopardize calibration. Make the strongest possible statement, provided it remains true. As with our friends, tell me the most interesting story, but don’t lie to me. For a spam-filter, the sharpest forecast assigns values like 1% for the quite-clearly-not-spam emails, 99% to the quite-clearly-spam emails, and some intermediate value to the difficult-to-decide cases (of which there should not be too many).





