Mauro
Senior Member
The name 'Bayesian inference' is intimidating, but in the end the method is very simple and only requires multiplications and divisions. If you want a more formal (and probably more accurate) explanation you can start from wikipedia: Bayesian inference. I tried to keep things as simple as I could (without losing too much rigour, I hope) and, being me an engineer, I focused more on how to do it in practice rather than on math or philosophy.
THE VERY BASICS & DEFINITIONS
We need at least two competing hypothesis, we could call them H and K, but luckily in the simplest case one is the negation of the other (ie.: it was a cat, or it was not a cat). Given this is easier I'm going to use the simplest case in what follows, so instead of calling the hypothesis H and K (and then L, M, N..) I'll stick to H and notH from now on.
P(H) means 'the probability that H is true' while P(notH) means 'the probability that notH is true', or equivalently 'the probability that H is false'.
Notations: 70% probability is the same as 70/100 = 0.7, I'll use both notations.
P(H) means 'the probability that H is true' while P(notH) means 'the probability that notH is true', or equivalently 'the probability that H is false'.
Notations: 70% probability is the same as 70/100 = 0.7, I'll use both notations.
THE BAYESIAN ALGORITHM: HOW TO MAKE THE CALCULATIONS
Let's start from the very beginning: we have the two hypothesis but nothing else at all, no informations, no nothing. Which numbers should we assign to P(H) and P(notH)? We have no reasons to prefer H over notH, or viceversa, so the answer is P(H) = 50% (0.5) and P(notH) = 50%. We call these numbers prior probabilities, for reasons which will be clear later. Knowing the prior probabilities we can now calculate the prior odds of H being true vs. H not being true (or viceversa, if you prefer): it is P(H)/P(notH) = 0.5/0.5 = one to one, a fair bet.
We then get some evidence (let's call it E) which may help us to shed light on the mystery of H vs. notH. What we need now is a way to account for the new evidence in the calculation of our odds. To do that we need two more numbers:
We then get some evidence (let's call it E) which may help us to shed light on the mystery of H vs. notH. What we need now is a way to account for the new evidence in the calculation of our odds. To do that we need two more numbers:
- P(E, given H) = the probability that, if H is true, we will get evidence E
- P(E, given notH) = the probability that, if H is not true, we will get evidence E nonetheless
Finding these two numbers is the tricky part! For now, let's assume we have found those two numbers and let's see how we should use them to revise our prior probability. This is the formula, it looks nasty, but hold on and it will become simple:
P(H, given E)/P(notH, given E) = P(E, given H)/P(E, given notH) * P(H)/P(notH)
The formula is a consequence of a fundamental theorem of probability theory called "Bayes' theorem" (from which the name 'Bayesian inference'). For the demonstration, see the appendix at the end, but I want to first explain what all those Ps, Hs and Es actually are.
P(H, given E)/P(notH, given E) = P(E, given H)/P(E, given notH) * P(H)/P(notH)
The formula is a consequence of a fundamental theorem of probability theory called "Bayes' theorem" (from which the name 'Bayesian inference'). For the demonstration, see the appendix at the end, but I want to first explain what all those Ps, Hs and Es actually are.
- P(H, given E)/P(notH, given E) : this is the number we want to find! In plain language it is the odds (the ratio) of the probability of H being true given evidence E, versus the probability of notH being true given the same evidence.
- P(E, given H) and P(E, given notH) are the two tricky-to-find numbers that we assumed to have somehow determined before
- P(H)/P(notH) are the prior odds we started from
So we can just plug the numbers we know into the formula and calculate the new odds we want to find!
One simple example to fix ideas. I have two hypothesis: there was a cat in my backyard (H) and there was no cat in my backyard (notH). For some reason I haven't looked into my backyard since ages and I have no idea if there were cats, nor I have any idea on how frequently cats come to my backyard nor anything else: the only thing I can soundly say at this point is P(there was a cat) = P(there were no cats) = 50%. Then my neighbour shows me a photograph of a proud tabby tomcat sitting in my backyard: this is E, the evidence. Now the tricky part: how much is probable for that picture to exist if there really was a cat in my backyard? And how much if there were no cats? Well in this case I could reasonably say P(cat picture, given there was a cat) = 99%, P(cat picture, given there were no cats) = 1%. But this is not a given: for instance my neighbour could be a notorious serial prankster and then I'd better say P(cat picture, given there was a cat) = 60%, P(cat picture, given there were no cats) = 40% [I told you this was the tricky part!]. But anyway, let's stick with the original 99% and 1% and let's apply the formula:
The odds that there really was a cat in my backyard, given the picture, against 'there really were no cats', is = (0.99/0.01) * (0.5/0.5) = 99 times to 1. I bet it was a cat!
What if my neighbour is a prankster? The odds become (0.6/0.4) * (0.5/0.5) = only 1.5 times to 1. Should I bet now? Hmmmm...
What if my neighbour shows me a picture of my backyard with a blurry smudge which could be any kind of small animal (or even no animal at all?). Say your two numbers, P(blurry picture, if there was a cat) and P(blurry picture, if there were no cats), put them in the formula and calculate.
And now here comes the magic of Bayesian inference: the new odds we just calculated become our new prior odds and we can start from them and factor in another piece of evidence we have found, using the same formula as before, and more and more pieces of evidence after that until we have exhausted all the evidences we have and we are left with our final odds (often called the posterior or consequent odds, even if, should we find yet one more piece of evidence, they'll become the new priors from which to start a new round of calculations).
There is another, psychological good thing about using Bayes: one is compelled to think both to his preferred hypothesis (H) and to the hypothesis he doesn't like (notH), this because he needs to calculate both P(E, given H) and P(E, given notH). And this helps a lot in keeping things in perspective and not to be lead astray by what one would like the final answer to be. Just try reasoning this way (even if you don't do any calculations, but having to follow an algorithm helps a lot), then you'll tell me.
One simple example to fix ideas. I have two hypothesis: there was a cat in my backyard (H) and there was no cat in my backyard (notH). For some reason I haven't looked into my backyard since ages and I have no idea if there were cats, nor I have any idea on how frequently cats come to my backyard nor anything else: the only thing I can soundly say at this point is P(there was a cat) = P(there were no cats) = 50%. Then my neighbour shows me a photograph of a proud tabby tomcat sitting in my backyard: this is E, the evidence. Now the tricky part: how much is probable for that picture to exist if there really was a cat in my backyard? And how much if there were no cats? Well in this case I could reasonably say P(cat picture, given there was a cat) = 99%, P(cat picture, given there were no cats) = 1%. But this is not a given: for instance my neighbour could be a notorious serial prankster and then I'd better say P(cat picture, given there was a cat) = 60%, P(cat picture, given there were no cats) = 40% [I told you this was the tricky part!]. But anyway, let's stick with the original 99% and 1% and let's apply the formula:
The odds that there really was a cat in my backyard, given the picture, against 'there really were no cats', is = (0.99/0.01) * (0.5/0.5) = 99 times to 1. I bet it was a cat!
What if my neighbour is a prankster? The odds become (0.6/0.4) * (0.5/0.5) = only 1.5 times to 1. Should I bet now? Hmmmm...
What if my neighbour shows me a picture of my backyard with a blurry smudge which could be any kind of small animal (or even no animal at all?). Say your two numbers, P(blurry picture, if there was a cat) and P(blurry picture, if there were no cats), put them in the formula and calculate.
And now here comes the magic of Bayesian inference: the new odds we just calculated become our new prior odds and we can start from them and factor in another piece of evidence we have found, using the same formula as before, and more and more pieces of evidence after that until we have exhausted all the evidences we have and we are left with our final odds (often called the posterior or consequent odds, even if, should we find yet one more piece of evidence, they'll become the new priors from which to start a new round of calculations).
There is another, psychological good thing about using Bayes: one is compelled to think both to his preferred hypothesis (H) and to the hypothesis he doesn't like (notH), this because he needs to calculate both P(E, given H) and P(E, given notH). And this helps a lot in keeping things in perspective and not to be lead astray by what one would like the final answer to be. Just try reasoning this way (even if you don't do any calculations, but having to follow an algorithm helps a lot), then you'll tell me.
THE TRICKY PART
We still need to address the 'tricky part': how do we find the two numbers we need, the probability of getting our evidence E, first in case H is true, then in case H is not true? In the example with the cat reasonable numbers were easy to find and noone will yell at you from a forum if you cannot justify why you think P(cat picture, given there was a cat) is 99%. But for a real life discussion one cannot simply shoot probabilities at random or, worse, juggle the numbers so they fit his theory. This problem is difficult and there is no general formula for it, there are, however some sound methods to approach it (and many unsound ones).
1) FORMAL ARGUMENTS
From many point of views this method is ideal, the problem is that it's rarely applicable in practice. If you can find a correct formal logic argument which allows you to directly calculate the probabilities (starting from some undisputable premises) the problem is solved. In practice, this is much easier said than done.
2) THE REFERENCE CLASS
A much more broadly applicable method is the 'reference class'. Easy example: a friend of mine has taken a picture of a random person on the street, what is the probability it was a female? One possible reference class is easy to find: 'persons'. Do we have any data on how many persons are male and how many are females? Sure we do, we have lots of data, and so we can assign a probability to the picture representing a female, which will be about 50%. First rule for a reference class to be a sound one: we need to have (reliable) data about the members of the class. If we don't, our reference class is useless.
Now imagine that my friend took his picture in the streets at Mount Athos, a place in Greece which women are forbidden to enter (it's an absurd thing, I know, but it's true). This completely invalidates the previous assumption: the 50% probability we determined before is completely wrong! Second rule for a reference class to be a sound one: it must be as specific as possible, or in other words it must include all the available, relevant informations we have. If we leave out the information about the picture being taken at Mount Athos we make a big mistake. Our reference class cannot be 'persons' anymore, it must become 'persons at Mount Athos' to be meaningful.
Now imagine that my friend took his picture in the streets at Mount Athos, a place in Greece which women are forbidden to enter (it's an absurd thing, I know, but it's true). This completely invalidates the previous assumption: the 50% probability we determined before is completely wrong! Second rule for a reference class to be a sound one: it must be as specific as possible, or in other words it must include all the available, relevant informations we have. If we leave out the information about the picture being taken at Mount Athos we make a big mistake. Our reference class cannot be 'persons' anymore, it must become 'persons at Mount Athos' to be meaningful.
3) LAPLACE'S RULE OF SUCCESSION
This is even more broadly applicable than reference classes. What is the probability that the cat in my backyard had a bright pink coat and a blue tail? There are no known confirmed examples of pink cats with blue tails, but it would be a mistake to answer 'zero percent': this would mean to deny the not-hypothesis by definiton, or, if you prefer, to beg the question that no cat is or ever will be pink with a blue tail, a bad logical mistake, akin to say 'no aliens have ever been confirmed to exist, thus no aliens exist'. Laplace's rule of succession comes to the rescue: we can start examining cats (or aliens) one by one and see if we find a pink blue-tailed one. After, say, 10000 cats and (say, who knows?) zero weirdos found we apply Laplace's formula:
Probability = (N+1)/(S+2)
What does that mean? N is the number of pink blue-tailed cats we found, S is the number of cats we sampled. From the evidence given by our sampling the probability of the next cat to be a pink blue-tailed one is (0+1)/(10000+2) =~ 1/10000.
This formula can be used generally of course (it's in effect a recipe for building a reference class from scratch), but when N=0 it's the only possible way to rationally assign a meaningful probability.
Probability = (N+1)/(S+2)
What does that mean? N is the number of pink blue-tailed cats we found, S is the number of cats we sampled. From the evidence given by our sampling the probability of the next cat to be a pink blue-tailed one is (0+1)/(10000+2) =~ 1/10000.
This formula can be used generally of course (it's in effect a recipe for building a reference class from scratch), but when N=0 it's the only possible way to rationally assign a meaningful probability.
4) GUESSWORK
When everything else fails, guesswork is the last resort. It's also the most dangerous way of proceeding. A way to lessen the risk to be yelled at is to keep wide margins: ie. not from 70% to 75%, but maybe from 55% to 85%, and try to lean a lot against your favoured hypothesis. In the end, it's more a matter of achieving consensus than a matter of precision. The good news are that, surprisingly, even very broad estimates of the probabilities often yield strong odds in favor (or against) one of the hypothesis.
5) THERE IS NO WAY TO KNOW
If something is totally unknown and unknowable, even by applying the broadest and least reliable method, guesswork, it's an unknown, and it cannot be used as evidence for, nor against, H (or notH). The probabilities of something unknown is 50%/50%, just as the very first prior probabilities from which all the calculations started: the effect is to cancel out in the formula and to change nothing at all, as it should be. And here lies another beauty of Bayesian inference: you stop worrying about what you can't know and concentrate instead on what you do know, and on what you may come to know (and hopefully on how to get it).
TIPS & PITFALLS
Remember probabilities are numbers which must be between zero and one: any other number cannot be a probability. This is important to know because it allows to spot (big) mistakes in the premises, ie. if I find the probability of something is -2 or 1.5 I surely made some error somewhere. On the contrary, odds can be any positive number. And, try not to confuse probabilities with odds, they are two very different beasts even if closely related (I need to always repeat that to myself too!). Probabilities give more informations, but odds are more tractable, they're both needed.
If you have two mutually exclusive hypothesis, H and notH, then P(notH) = 1- P(H): if it's a cat with 70% probability, then it's not a cat with 30% probability. But take care of the mutually exclusive part, ie. if it's a cat with 70% probability you cannot conclude it's a dog with 30% probability, because it could be any other kind of small animal (or even not an animal at all). This mistake is not always as easy to spot as in this example, beware. And, given we are on this, please notice that the two tricky numbers, P(E, given H) and P(E, given notH) do not need to add up to one, they are two totally different things. Ie. the probability of the sun rising tomorrow is ~100% if today is an odd day, but it is ~100% also if today is an even day: they add up to ~2 (and they'll cancel out when calculating the new odds on such a silly evidence).
Probability zero and probability one. Wonderful quote from my University textbook: probability zero does not mean impossible and probability one does not mean certain. This is very true! One should never find nor suggest a 0% or 100% probability (that would be begging the question), even if it's perfectly legal to find and propose numbers such as 0.00000000000001% or 99.9999999999999%. In these cases writing just '0%' and '100%' is customary, but keep always in mind that 'impossibility' and 'certainty' can never be granted. But to avoid confusion, add a tilde instead: ~0%, ~100%.
Beware of correlations. Instead of using the full Bayesian calculation algorithm it's often faster to start from some 'given' probability and calculate the first prior by just multiplying them together. For instance, if the probability of a cat in my backyard is 50% and the probability of a cat being black is 10% then the probability of a black cat in my backyard is 0.5 * 0.1 = 5%. This is a correct method to do things except if there is a correlation between the two probabilities we are multiplying together. In the case of cats and backyards I could have installed an automatic door with a camera which only allows black cats in: the two variables ('cat in my backyard', and 'cat is black') are now correlated and multiplying them would be a (very bad) mistake. Yet again, this is easy to see with cats and computer controlled doors, but not easy at all in many many practical cases.
Chasing butterflies. When I talked about reference classes I stressed that, to be meaningful, "it must be as specific as possible, or in other words it must include all the available, relevant informations we have". There the focus was on the 'all the available', but the 'relevant' part is very important too, and unfortunately forgetting it is a sure method to demonstrate whatever one prefers the answer to be, which is a really bad thing. Example: I start from the 'persons at Mount Athos' reference class, but wait... there's a butterfly in the picture, shouldn't I multiply everything by the probability of having a butterfly in the picture? That exact butterfly maybe? At that exact position? Then I should conclude that my friend, actually, never took any picture, it's too improbable even to exist! The absurdity of doing this is pretty clear, but yet again, this may not be as easy to spot in real life cases: chasing butterflies is indeed a common trick used to 'demonstrate' the most absurd rubbishes.
If you have two mutually exclusive hypothesis, H and notH, then P(notH) = 1- P(H): if it's a cat with 70% probability, then it's not a cat with 30% probability. But take care of the mutually exclusive part, ie. if it's a cat with 70% probability you cannot conclude it's a dog with 30% probability, because it could be any other kind of small animal (or even not an animal at all). This mistake is not always as easy to spot as in this example, beware. And, given we are on this, please notice that the two tricky numbers, P(E, given H) and P(E, given notH) do not need to add up to one, they are two totally different things. Ie. the probability of the sun rising tomorrow is ~100% if today is an odd day, but it is ~100% also if today is an even day: they add up to ~2 (and they'll cancel out when calculating the new odds on such a silly evidence).
Probability zero and probability one. Wonderful quote from my University textbook: probability zero does not mean impossible and probability one does not mean certain. This is very true! One should never find nor suggest a 0% or 100% probability (that would be begging the question), even if it's perfectly legal to find and propose numbers such as 0.00000000000001% or 99.9999999999999%. In these cases writing just '0%' and '100%' is customary, but keep always in mind that 'impossibility' and 'certainty' can never be granted. But to avoid confusion, add a tilde instead: ~0%, ~100%.
Beware of correlations. Instead of using the full Bayesian calculation algorithm it's often faster to start from some 'given' probability and calculate the first prior by just multiplying them together. For instance, if the probability of a cat in my backyard is 50% and the probability of a cat being black is 10% then the probability of a black cat in my backyard is 0.5 * 0.1 = 5%. This is a correct method to do things except if there is a correlation between the two probabilities we are multiplying together. In the case of cats and backyards I could have installed an automatic door with a camera which only allows black cats in: the two variables ('cat in my backyard', and 'cat is black') are now correlated and multiplying them would be a (very bad) mistake. Yet again, this is easy to see with cats and computer controlled doors, but not easy at all in many many practical cases.
Chasing butterflies. When I talked about reference classes I stressed that, to be meaningful, "it must be as specific as possible, or in other words it must include all the available, relevant informations we have". There the focus was on the 'all the available', but the 'relevant' part is very important too, and unfortunately forgetting it is a sure method to demonstrate whatever one prefers the answer to be, which is a really bad thing. Example: I start from the 'persons at Mount Athos' reference class, but wait... there's a butterfly in the picture, shouldn't I multiply everything by the probability of having a butterfly in the picture? That exact butterfly maybe? At that exact position? Then I should conclude that my friend, actually, never took any picture, it's too improbable even to exist! The absurdity of doing this is pretty clear, but yet again, this may not be as easy to spot in real life cases: chasing butterflies is indeed a common trick used to 'demonstrate' the most absurd rubbishes.
APPENDIX: DEMONSTRATION OF THE BASIC FORMULA TO CALCULATE ODDS
We start from Bayes' theorem, which can be written this way:
I rewrite it, using the same notation I used before:
Mathematically it's an astonishing theorem, really, but don't focus on it, we just use it as a stepping stone. We now write two copies of Bayes' theorem, one for H and one for notH:
And now we divide 1) by 2), getting rid of P(E):
which is the formula for calculating odds given before, q.o.d.
Edit: added the 'Guesswork' chapter
Edit: corrected a mistake in the labelling of cats and pictures probabilities (thanks to @Mendel who first noticed this)
Edit: added the 'There is no way to know' chapter
Edit: added the 'Chasing butterflies' paragraph in 'Tips & Pitfalls'
Edit: removed an unnecessary and wrong sentence from the Appendix (thanks to @Mendel again and to @jplaza)
Last edited: