# How to measure time calibration skill?

I’m not actually using TagTime for this, but @dreev suggested I post in this category anyways, since it’s the closest to relevant.

I’m working on my time estimation calibration, so I’m recording data in the form of task,low_estimate,high_estimate,actual . For example, this morning I ironed a shirt, for which my data was

iron shirt,0:07:00,0:08:00,0:06:38


I want to capture some notion of both precision and accuracy in my estimation, to see if I can improve at it over time given practice and recording. What’s the right math to do on this data to capture those? Actually, first of all, am I even measuring the right data? I can get some notion of “percent error” by looking at how far the “actual” time is from the nearest endpoint, but I don’t really know what if/when actual time is inside my range. I could compare to the average of the endpoints, but I want to “score better” in some meaningful sense for being close with a tight range: I could estimate that all tasks will take between 1 and 6,000 minutes, and always be inside my range, but that’s obviously a garbage estimation. Even if the task did take exactly 3,000.5 minutes, I should have some sort of math way to say that my later estimation, that the task would take between 3,000 and 3,001 minutes, was better in a meaningful sense than the first estimate.

I’ve read Evidence Based Scheduling – Joel on Software, wherein he does a similar thing, but the “estimate” there is a single point . I could certainly do that, especially if it makes analysis significantly easier, but again I feel like there’s value in learning to tighten an estimate range over time? But maybe I’m wrong, or rather that recording that isn’t helpful, because getting better in general will also come with a natural range-tightening, and actually most bosses (e.g.) want a single point-in-time estimate, not a range, so maybe calibrating my point-estimate is more useful?

7 Likes

Love this. my first thought is be explicit about what that confidence interval is. i recommend an 80% confidence interval. so you should be 80% sure that the actual time falls between low and high.

then the simplest calibration exercise is to make sure you’re actually correct 80% of the time.

2 Likes

Hmm, that makes sense, and there’s clear value there—and is certainly easy to do in a spreadsheet—but does that mean that if I estimate “6-7 minutes” and the actual time is “7:01 minutes”, that’s exactly as much of a “fail” as if the actual time was “15 hours”? I feel like there ought to be some elegant way to capture all that in one calibration, or at least have two different measures that I’m looking at. Maybe testing actual-calibration in one report (where that 7:01 would be counted as a failure) and also calculating some sort of error-score—maybe assuming that the time-to-do-the-task is some sort of normal-ish distribution described by my bounds, then seeing how many st.dev. away the actual time would be? That’s, obviously, extraordinarily vague, but hopefully if that’s in the shape of a good idea you can help me fill it in The important insight here is that you want a wider estimate range to count against you if you fall within in it, and for you if you fall outside it.

If you claimed to be super-confident about how long it would take you, but you were wrong, that’s way worse than if you said upfront that you weren’t so sure. And if you’re confident and correct, that should count more in your favor than if you were vague and correct.

This means you’re probably going to need some sort of branching logic in your scoring function. You don’t want to be penalizing yourself for giving yourself a narrow target if you meet it, but it should count against you if you miss it. (And vice versa for a wide target range, of course.)

2 Likes

Hmm. You certainly might be right, but that doesn’t make intuitive sense, because it’s gameable: “I think every task will take between four days and eight years—oh, wow, it only took 25 minutes, I’m so good at time estimation!”

“Confident and correct… should count more [than] vague and correct” is definitely intuitively right, but I don’t think actively rewarding vague-but-wrong makes sense to me as a way to achieve that? Or actively penalizing vague-but-right, really.

1 Like

I don’t think we disagree, I think I just didn’t explain myself well enough.

I think you should lose points for that, but less than if hadn’t admitted your uncertainty. You should only gain points if you’re right.

Perhaps that’s the thing I didn’t explain properly: I’m not saying you should ever give yourself a positive score when wrong, or a negative score when right. Rather, the magnitude of your positive score for being correct should be higher the tighter the range you predicted, and vice versa for your negative score for being wrong.

(Hm, perhaps I should have spelled out explicitly that you need to get a negative score for being wrong, instead of just the lack of a positive score (or a lower positive score.) For much the reasons you describe, if you gave yourself a positive score even when you are wrong you could easily game the system, if only by making a whole lot of nonsense predictions.)

I did not at all mean you should actively reward vague-but wrong, only relative to precise-but-wrong. (And likewise vague-but-right should be penalized relative to precise-and-right, not penalized in an absolute sense.)

1 Like

Ah ha, yes, definitely! We are in agreement.

I feel like there should be some nice math-y way to smoothly adjust “goodness points” based on interval size/correctness, rather than having to set some arbitrary “tightness” threshold—something like \frac{1}{\textrm{delta in minutes}} points if right should work nicely (maybe times 100 and floored for aesthetic reasons). Probably stick some exponential or log in there to capture that being right at one-minute scale is way better than being right at 10-minute scale, which is somewhat better than being right at 100-minute scale, which is barely better than being right at 1000-minute scale. Advice welcome.

If-wrong is fiddly, since I want to capture both size-of-interval and amount-of-wrongness: estimating (7:00, 8:00) for actual 6:59 is less wrong than estimating (7:00, 8:00) for actual 16:00, but also less wrong than estimating (7:00, 99:00) for actual 6:59. Maybe for an estimated interval with low duration a and high duration b (in minutes) and actual duration \alpha, the wrongness points could be something like 1-\frac{1}{(b-a)(\lvert\textrm{avg}(a, b) - \alpha\rvert)})? I’m not sure multiplying is the right thing there. Or if this is even remotely reasonable. I haven’t plugged in many examples to make sure this looks sensible yet—I’m just shooting from the hip and hoping that I’m close enough to right that someone else will come in with more rigurous ideas I’d like to find some clean way to combine these such that I have a single formula for Correctness Points that applies to right estimates and wrong estimates alike.

2 Likes

In case I don’t manage to respond more fully, let me chime in that “proper scoring rule” is a relevant search term for this question. I think you want to turn your [low, high] interval into an implicit full probability distribution  over the possible outcomes. Then you can apply a standard scoring rule and that gives you the goodness function you’re looking for.

I think that elegantly captures all the desiderata – making the interval as tight as possible, centered at your best estimate of the true duration, etc.

 Probably by making it a normal distribution where “low” and “high” are plus and minus however many standard deviations to make [low, high] an 80% confidence interval.

3 Likes

Oh, I love that, and then the stddev of the actual result, assuming a normal distribution characterized by my estimates for all possible task completion times, could be the score on its own! I guess abs(stddev).

…actually a normal distribution is obviously wrong since there’s a hard cut at 0. Some quick googling indicates I might want some flavor of gamma distribution.

I’m not sure exactly how to do this—I’ll have to sit down with a whiteboard, a Wolfram!Alpha page, and maybe @poisson—but it appears that I should be able to define a gamma distribution for any time estimate range such that that range covers 80% of the distribution, then determine the whatever-the-gamma-equivalent-of-standard-deviation for the actual time spent, and use that as a goodness score, aiming for 0.

It occurs to me that we could avoid the “tasks can’t take negative time” problem the same way that most statisticians solve the “heights can’t be negative” problem: use units small enough that the values you run into in the real world occur with only ε frequency near the discontinuities: that is to say, I suspect no human with a height of less than, for example, two centimeters, has ever been recorded.

tl;dr if we measure in seconds or ms a normal distribution is probably fine and also much easier to think about than a gamma distribution and also also probably a reasonable distribution for task time completion to actually occur in for just about any given task.

Edit: wait, no, I’m wrong. About changing units, at least. But pretending a normal distribution works is probably still a good start because tasks really close to zero time aren’t going to happen very often. Probably. Or, rather, if I think it’ll take more time to record the estimation than to do the task, I probably won’t bother.

Ok, so I don’t know exactly what the “best” thing to do is, but here are some rambling bullshit thoughts.

Consider some task that we must do. For example, “solve this puzzle”. Solving the puzzle requires having a sequence of K insights, each of which we don’t really know how long will take to happen, perhaps it comes spontaneously with a probability of y per unit time. The distribution of waiting times for such a process is a gamma distritbution with shape parameter K and scale parameter 1/y.

However, let’s model it as being log-normal distributed. This maintains our desired qualitative shape but I think it’s easier to make something out of it because it’s easier to relate the confidence interval to the implicit parameters of the distribution. If you are willing to guess the mean and standard deviation instead of strictly guessing the 80% confidence interval, then the gamma distribution becomes easier to assume.

As Danny said, we should think our prediction is implicitly trying to guess the parameters of the distribution for the given task. We want to at the very least ensure that the optimal guess is to correctly guess the parameters of the distribution.

One puzzle, from this perspective - suppose a task is genuinely hard to guess, i.e. it genuinely has a large variance in the amount of time it will take to perform an identical task. Unfortunately, since we don’t know whether or not tasks are genuinely “identical”, it is hard to correct for this - most likely, we will end up picking a scoring rule that just gives low scores for a vague prediction, even if the task is “genuinely” hard to predict.

Anyway, let’s say you believe this log-normal shit. Then if you make a prediction of [60 min, 80 min], then we are predicting that the (natural) log of the time (in minutes) is between 4.09 and 4.38. As is convention, use μ, σ to denote the mean and standard deviation of the log. Let’s take this, according to Danny’s suggestion, to be a guess of the 80% confidence interval, which is roughly between μ +/- 1.285 σ. So we are guessing that the log of time has a mean of 4.238 and standard devation of 0.11.

Following Danny’s suggestion I looked up “proper scoring rules”. It seems like one such rule is the log of the density assigned to the observed value. I am not really sure of the relative merits of different scoring rules but let’s try that. I should reemphasize this so the reader doesn’t just get absorbed in the equations. The scoring rule we are using is - convert the guess into the implied guess about the probability distribution. The idea is that the score is assigned based on how high a probability you assigned to the measured time - thus, if the distribution you assign is too broad, you can’t get ANY POINTS becaue you assign such low probability to any individual time. This is how it conforms to what we want.

The implied PDF is So if we observe a time T, the score is

S = - \log{(\sqrt{2 \pi} \sigma T)} - \frac{\left(\log(T) - \mu \right)^2}{2 \sigma^2}.

If we call the interval [a,b], then again in the above expression

\mu =\frac{ \log{a} + \log{b}}{2}

and

\sigma \approx \frac{ \log{b} - \log{a} }{2.57}.

If we want to use the gamma distribution as our implied distribution instead, let’s instead make our guess for the mean \langle T \rangle and the standard deviation \sigma_T.

If we do so, then the implied estimate of the parameters of the distribution is \theta = \sigma^2_T/ \langle T \rangle and k = \langle T \rangle^2 / \sigma_T^2. (Look at the last entry of the table in the wikipedia article gamma distribution.)

Thus, the logarithmic scoring rule gives

S = - T \langle T \rangle / \sigma_T^2 + \left(\frac{T^2}{\sigma_T^2} -1 \right) \log{T} - \log{\left( \frac{\sigma_T^2}{\langle T \rangle} \Gamma{\left(\frac{\langle T \rangle^2 }{ \sigma_T^2}\right)}\right)}.

One unfortunate feature of this log scoring rule is that the numbers are all negative - the best you can do is get 0, if you guess with 100% confidence and get it right. Anyway this is all both low-confidence and poorly explained, so lemme know what you think.

EDIT: If you want positive numbers, the “spherical scoring rule” p(T) / \sqrt{\int_{0}^{\infty} p^2(T)} may be better. I need to do other stuff for now but I am 99% sure we can get a closed-form expression for that for the gamma distribution. Less sure for the log-normal distribution.

EDIT2: https://www.jstor.org/stable/2629907?seq=3#metadata_info_tab_contents was the source for scoring rules

EDIT3: If you really want to both use the gamma distribution and to interpret the low / high values as where the CDF is 0.1 and 0.9, then I think we could work out a way to get the score via solving the system of 2 equations CDF(a, k, θ) = 0.1 and CDF(b, k, θ) = 0.9 for k, θ on each individual case. But this sounds annoying to implement except in a real programming language or mathematica.

EDIT4: Why might a log-normal distribution be on the right track? Well, I think it satisfies our intuitive notion that the scale of the prediction is irrelevant. So if you predict [2 min, 4 min] and the real time is 5 min, you will get the exact same score as if you predict [2 hour, 4 hour] and the real time is 5 hour. (Because in log-space, this is just an overal translation of the distribution and the specific place we are looking on the T-axis to find our predicted P(T))

2 Likes

Comment regarding what @rperce called amount-of-wrongness (AOW) and @dreev 's confidence intervals (CIs) (zoomed out from the precise calculations):

When you work with a confidence level of 80 %, all your estimates should aim to have a 80 % chance of hitting and a 20 % chance of missing. If you are within your estimate 95 % of the time, that’s bad. It means your estimation intervals are too large (for a confidence level of 80 % - they actually capture 95 % of the outcomes). This is easy to understand but tricky to really “feel” because we always try to aim for “perfect” scores. But here, good means being close to your desired confidence level: 75 % is better than 95 %.

While CIs disregard AOW in the single estimate (you’re either within or without your estimate), they imply a notion of AOW. It’s implied in the average relative width of your time estimate that you need to be well-calibrated. Let’s say you start out needing a 100 % relative width to be perfectly-calibrated and one year later you only need a 50 % relative width to be perfectly-calibrated, then your AOW got better (lower, less wrongness).

1. Example: true duration = 10, estimate range = 6 to 12 --> relative width = 60 % ↩︎

2. i.e. for a task of 30 minutes your average estimate range is 30 minutes ↩︎

3. i.e. for a task of 30 minutes your average estimate range is 15 minutes ↩︎

1 Like

I think this is a brilliant point by @howtodowtle and in the spirit of MVPs, it’s probably worth focusing on a version with just confidence intervals and not going down the rabbit hole of inferring whole probability distributions and whatnot.

Though that’s extremely fun and we should totally continue that as well, for the sheer academic joy of it (and eventual pragmatic use too! not that I’m volunteering to build anything but would love to guinea-pig things like this).

Scattered notes from talking to @poisson and @rperce that may be subsumed by @poisson’s latest edits above:

1. i think for my own estimating i’d rather give my 80% confidence interval. that’s meaningful to me and useful in itself. it makes the calibration exercise straightforward, which is maybe the most important thing.
2. i think we can make an assumption about the shape of the distribution – gamma or lognormal or truncated normal – and it should be straightforward to compute the distribution parameters from the 80% confidence interval.
3. if you want to use the log scoring rule and don’t want negative scores, just add a big constant.
4. i think the brier score is a bounded one, and still proper. but i think it’s ok to not be bounded. main thing is that you never assign probability exactly zero or exactly one to anything. you deserve your negative infinity score if you do that.
5. “proper” means that you maximize your score by providing your true probability distribution. so that’s a key property.
6. i recommend either (a) log score with a big enough constant added so scores are always positive, and where the probability distribution has support over all possible outcomes [0, \infty) so there’s no worries about infinite scores. or (b) brier score: Brier score - Wikipedia (but see notes from @poisson about generalizing to continuous probability distributions)
7. sky-pie: let the user literally draw the distribution, maybe like with the UI metaculus uses. (after all, sometimes your true estimate is “this will either take 5 minutes or a couple weeks”). oh man their UI is pretty: Date Carbon Capture Costs <\$50 Per Ton | Metaculus
1 Like

New thing on LessWrong that sounds pretty relevant to this question:

3 Likes