Laws of Succession
Rajiv Prabhakar presents us with a hypothetical:
> You and your friend are walking by a magic store and find a trick coin. You toss
it 14 times and end up with 10 heads. Your friend thinks at least one of the
next two tosses will end up tails, and is willing to offer you $10 in an
even-money bet on it. Should you take him up?
You and your friend are walking by a magic store and find a trick coin. You toss
it 14 times and end up with 10 heads. Your friend thinks at least one of the
next two tosses will end up tails, and is willing to offer you $10 in an
even-money bet on it. Should you take him up?
This is a fancy way of asking,
> Is the probability of two heads on the next two tosses greater than 50 %?
Is the probability of two heads on the next two tosses greater than 50 %?
Rajiv claims to look at this from a frequentist perspective, arriving at a 51 %
probability, and then a Bayesian perspective, arriving at 48.5 %. Therein lies
the puzzle: one says “take the bet!” and the other says “ask for better odds!”
How can statistics betray us like that in our time of greatest need?
Fortunately, it is based on flawed analysis.
## Laplace’s law of succession
The problem with that analysis is that a true frequentist is unable to answer
the question we have asked. The question of the probability of the next two
tosses doesn’t even make sense from a frequentist perspective. This is why
frequentists test hypotheses by performing contortionist tricks like asking,
> Assuming heads and tails are equally likely, what would be the likelihood of
observing 10 heads in the next 14 tosses?
Assuming heads and tails are equally likely, what would be the likelihood of
observing 10 heads in the next 14 tosses?
and then comparing that likelihood to an arbitrary threshold. This is not a
strawman position on frequentism – this is the very thing that plagued the
father of frequentism, Ronald Fisher, for all his life.
Frequentists can, of course, answer a trivial questions like
> What was the probability of heads in the past 14 tosses?
What was the probability of heads in the past 14 tosses?
and it is tempting to extrapolate the answer to that straight into the future,
but again, that would not be a valid operation according to a frequentist.
Frequentists cannot assign probabilities to future events, they can only compare
probabilities of past events.
Now, not all is lost. There’s a neat trick that blends faux frequentism with
Bayesian reasoning. If our prior probability is uniformly distributed, we can
update it with Laplace’s rule of succession, which tells us to imagine there
have been two more trials, one of which was successful. Then the ratio
corresponds to the Bayesian posterior. In this case, that means imagining there
were 11 heads in 16 tosses. The new ratio is \(\frac{11}{16}\) and the posterior
probability of two heads would then be \(\left(\frac{11}{16}\right)^2 \approx
0.47\). This is much easier and gives almost the same result as the actual
Bayesian calculation, and the computer approximation.11 Why is this a slightly
different number than the actual value of 48.6 %? Because we have converted the
posterior distribution to a point estimation, and then squared the point
estimation. Jensen’s inequality tells us we will arrive at a lower number that
way than if we first square the posterior distribution and then make a point
estimation.
This is a law of succession most forecasters are familiar with: add one success
and one failure to get the Bayesian posterior of coin flips. It’s called
Laplace’s law of succession.
## Agresti–Coull confidence intervals
Here, have a pair of frequentist goggles; I want to take away your ability to
estimate probabilities of future events again.
Even as frequentists, we can still perform a standard hypothesis test to
figure out whether we should believe the coin is biased at all. For this we need
a 95 % confidence interval around the bias. The quick way to whip one of those
up is to first pretend there were four more trials: two successes and two
failures. Then we compute the size of two standard errors the normal way:
\[\hat{p} = \frac{10+2}{14+4} \approx 0.67\]
\[2\hat{se} = 2\sqrt{\frac{\hat{p}\left(1 - \hat{p}\right)}{14+4}} \approx 0.22\]
This gives us a confidence interval of 45 %–89 % for the bias. Since that
includes 50 %, our null hypotheses, we cannot reject the null hypothesis and
have to contend with the fact that the coin might not be biased at all. This is
probably a reason not to take the bet. See, Rajiv, frequentists are not so dumb
after all!
This procedure is called an Agresti–Coull confidence interval, and it’s an
approximation that’s very close to an actual confidence interval. The drawback
is that this simple version only lets us create two-sided 95 % confidence
intervals. The paper by Agresti and Coull contains the details to construct
other confidence intervals, but it gets a little more arithmetically complicated.
## Poisson-based law of succession
Imagine now someone tosses the coin once a week, and in the past year, it has
landed on heads four times. What’s the probability it lands on tails next week?
We can use what we already know here. There have been 52 trials and 4 heads, so
that’s a posterior of 5⧸54≈0.093 for heads. The inverse of
this is the probability that it does not land on heads in any given week:
90.7 %.
What if we had misheard, and the coin was actually tossed every day? This makes
the coin extremely biased. It has been tossed for 365 days and only landed heads
four times!22 A calendar year actually has 365.2425 days, but here we use “year”
to mean “the past 365 days”. By the same approach, we have the posterior
5⧸367≈0.014. On the other hand, during the next week, it will have more
opportunities to come up heads as well. To get the weekly probability of no
heads, we have to take the inverse of the posterior and raise it to the power
of 7. We get 90.8 %.
Oh but wait, I meant to say it is flipped every minute, not every day! Assuming
there are 525600-something minutes per year, we get a weekly probability of
heads of 90.9 %.
The general trend starts to show: when we pick finer subdivisions of time, the
probability goes up, because we get more certain about our estimation. The
posterior is more and more getting shaped by the data rather than the uniform
prior. We could compute the probability if the coin was flipped every second,
every millisecond, etc. But we won’t, because there’s a limit this tends to:
\[\left(1 + \frac{t}{T}\right)^{-(S + 1)}\]
Here, \(t\) is the future timespan we are interested in, \(T\) is the timespan in
the past we have observed for, and \(S\) are the number of successes observed in
that timespan. Unlike in the previous Laplace case, the units of the timespans
do not matter here, because they cancel out. With our coin, taking weeks as the
natural unit:
\[\left(1 + \frac{1}{52}\right)^{-(4 + 1)} \approx 0.909\]
This is the probability that a continuously flipped coin shows heads any time
during next week, if it has done so four times in the past year.
This is nearly the same probability as when we computed Laplace’s law of
succession using minutes as the time increment, and this will be the case more
generally. If we pick a time increment that is small compared to the future
period we’re interested in, Laplace’s law of succession will give an answer very
close to this limit. The benefit of knowing how to compute the limit is that we
can use more convenient units of time when doing so.
For more examples of this continuous limit, see the LessWrong article that
introduces it.
## Key points
We don’t really have to learn the continuous limit. If we’re fine with
arithmetic using big numbers, we can use the plain Laplace rule even for
continuous processes, as long as we pick a subdivision of time that is small
enough.
What needs to be remembered is that with \(s\) successes in \(n\) trials, the
posterior probability is
\[\frac{s+1}{n+2}.\]
But also, if we want a 95 % confidence interval, we compute
\[\hat{p} = \frac{s+2}{n+4}\]
and
\[2\hat{se} = 2\sqrt{\frac{\hat{p}(1-\hat{p})}{n+4}}\]
and then the interval is
\[\left(\hat{p}-2\hat{se}, \hat{p}+2\hat{se}\right)\]
It is easy to get these two methods confused, because they both add virtual
successes and failures. Try not to!