Bayesian Statistics

Many people have heard of the "Monty Hall Problem", popularized by Marilyn vos Savant. The problem is based on a game show, in which the contestant is required to pick one out of three identical closed doors. Behind two doors, there is a goat, but behind the third is a car. If the contestant happens to pick the door with the car, he gets the car.

The problem has an added twist: after the contestant has made his selection, the host opens one of the two remaining doors (the ones that the contestant did not pick), and behind that door is a goat. The host then asks the contestant if he wants to switch his choice or remain with his initial choice. The contestant may switch or not, as he wishes. Then all three doors are opened, and if the contestant made the correct choice, he wins the car.

There are many variations on this, all of which may or may not affect the outcome. I have described the commonest variant above, and it assumes:

This problem created a controversy because the answer seems counter-intuitive. The chain of reasoning goes something like this: initially, you have a 1 in 3 chance (33%) of picking the correct door, since all 3 doors are identical and there is no way to tell which one has the car behind it. Therefore, your chance to win is 33% and your chance to lose is 66% (approximately).

Picking one door randomly out of 3 gives you a 33.3% chance of guessing right.

Now when the host opens one of the doors that the contestant did NOT select, and shows a goat behind it, many people assume that nothing has changed so far as the problem goes. Each door is still associated with a 33% chance of having the car, therefore the two remaining doors (the one that the contestant initially chose and the other which the host did not open) still have a 33% chance of having the car.

Other people acknowledge that the opened door has changed the probabilities on the two remaining doors. Since the car HAS to be behind one of the two remaining doors, they divide the probability between them, assigning 100/2 = 50% probability to each of the remaining doors. What they have actually done is to take the open door and distribute its probability equally between the two remaining doors. Before the host opened the door, it had a 33.3% probability of having the car, the same as any other door. After the host opened the door and showed a goat behind it, the probability went from 33.3% to 0% for that door. People distribute this probability equally between the two remaining closed doors, adding 33.3/2 = 16.6% probability to each of the remaining doors, making the probability 33.3+16.6 = 50% for each of the remaining doors.

After the host opens one door to show the goat, people intuitively divide the chances equally between the two remaining doors.

Since they have assigned equal probability to the remaining two closed doors, there is no clear reason to accept the host's choice to switch. Since both doors have 50% probability of having the car, the contestant's odds are not improved by switching.

In fact, this is wrong. The contestants correctly recognize that the initial probability of each door having the car is 33% each. They further correctly identify that when the host opens one door and shows a goat behind it, the odds on the remaining doors have changed. Information has been added to the "system", and this information must be taken into account when calculating the new probabilities. The mistake is in how the new probabilities are calculated.

If you simply divide the open door's initial probability between the two remaining doors (making 50% each for the remaining doors as explained above), you have acted upon an assumption: namely, that the host opens one of the two doors at random. If this were true, then the host would pick the door with the car 50% of the time. But this does not happen. According to the rules, the host always opens a door and shows a goat behind it. In other words, the host has knowledge of which door hides the car (and deliberately avoids it), and this knowledge has also entered the "system". Look at the host's constraints:

This means the host has to pick one of the doors the contestant has not selected and which does not have a car behind it. The two doors that the contestant did not pick had a combined probability of 66% of having the car, as can be seen on the first diagram. When one of them is opened and a goat seen behind it, it's probability goes to 0%, and the remaining door assumes the entire 66% probability which the two doors had in sum before one was opened.

Picking a door at the start means you have a 66.66% chance of being wrong. That 66.66% probability is split
between the two doors you didn't choose. The host opens one of these doors, but is constrained to open only
the one which has a goat. Therefore, your 66.66% probablity of being wrong now moves to the single remaining
door which you didn't pick and the host didn't pick. You can now change your choice and turn that 66.6% chance
of being wrong into a 66.6% chance of being right.

So after the door is opened to show a goat, the contestant has a 33% chance to be right if he stays with his original choice, and a 66% chance to be wrong. Therefore, he can double his chances of being right if he accepts the switch.

Here is another way to look at this by listing all possible outcomes.

The letters C and G represent the car and the goats. The yellow circle around a letter indicates the contestant's initial choice. Since there are 3 doors, there are 3 possible initial choices, indicated by the 3 gray boxes above. In each case, as the next step is that the host then opens a door which the contestant did not pick, and reveals a goat. The contestant is then asked whether he wants to switch his initial choice or not. The outcome of not switching is shown by the black arrow, while the outcome of switching is shown by the red arrow. As can be seen, in two possible cases (where the contestant picked a door with a goat), switching will enable him to win. In one case only, where the contestant picked the right door to begin with, will switching cause him to lose. Therefore, the odds of winning are 2:1 if he switches, or 66.6% to 33.3% in favor of winning if he switches.

So how does this lead to Bayesian Statistics?

Thomas Bayes (1702-1761) was a British mathematician and Presbyterian minister. He formulated a special case of what is today known as Bayes Theorem. This was later reformulated to state a more general case by Laplace.

Bayes Theorem is well proven and widely accepted. It is not the same as Bayesian Probability, which is a more controversial and somewhat philosophical topic. However, they do share a similar conceptual basis. Bayesian probability approaches the concept of probability as a measure of the state of knowledge about something, and is therefore an epistemological rather than purely metaphysical approach.

Classically, probability was developed from a study of games of chance. Laplace defined it as follows:

The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all the cases possible is the measure of this probability, which is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible.

In other words, if you perform an experiment which has a certain number of mutually exclusive and equally likely outcomes, then the probability of outcome A is simply the number of times A occurs as a fraction of the total number of times the experiment is repeated. Or, P(A) = NA / N.

Although this definition looks very objective and does not depend on degrees of belief or knowledge, it is in fact circular. The words "equally likely" in the definition use the concept of probability to define probability.

Frequentism or Frequentist Probability is essentially the same thing, phrased differently. Frequentist Probability is simply the relative frequency of occurrence of an event over time. The underlying physical cause(s) governing the events is assumed to be random. It may be random in the sense we don't have sufficient information to make a prediction (though a prediction could be made if we acquired the information), or that the phenomena are fundamentally unpredictable (such as radioactive decay).

This way of looking at probability avoids the problem of calling events "equally likely". For a coin toss, classical probability would call the occurrence of heads or tails "equally likely". But Frequentism simply says that if you keep flipping a coin repeatedly, the occurrence of heads will converge towards 50% as the number of tosses increases. However, the fact is that one cannot repeat an experiment an infinite number of times. Therefore, for any given experiment, the probability of an event may well differ on different occasions. If a coin is tossed 100 times for the first experiment, it's possible to get 45 heads and 55 tails. If the experiment is then repeated for another 100 tosses, it's possible to get 57 heads and 43 tails. The true probability is therefore hard to determine, and each experiment has an error in the probability measurement, which is itself probabilistic. So the frequentist definition doesn't completely avoid circularity either.

Conditional Probability: is the probability of event A, given some other event B. That is, it's the probability of A occurring provided that B occurs. It's written P(A|B).

Joint Probability: is the probability of both A and B happening together. It's written as P(A,B)

Marginal Probability: is the independent probability of A, whether or not B happened. It can be obtained by summing or integrating all occasions in which A occurred. So in this case it would be P(A|B) + P(A|B'), adding the probability of A with B and A without B. Calculating this probability is called marginalization.

Bayesians are sometimes called subjectivists, because of they propose Epistemic Probability - which considers probability to be a degree of belief in a certain hypothesis, or in the outcome of some experiment. This leads to differences in interpretation. Frequentists assign objective probability values to hypotheses, based on past experience. For example, a researcher might be studying the effect of variable x on the outcome of some experiment. He might do a t-test on the results, and assign a p value to the results (the p value is the probability of obtaining a value of the test statistic at least as extreme as the one that was actually observed, given that the null hypothesis is true). A scientific journal might decide that they will consider a p value of 0.05 or less to negate the null hypothesis, meaning they declare the results to be significant. These are all frequentist interpretations, based on objective numbers and prior knowledge. Most, or all, frequentists would agree on what significance values should be used to accept or reject a hypothesis.

Bayesians methods depend on assigning a prior probability to some event. A prior probability is simply a probability based on the absence of some specific evidence (usually the variable(s) under observation in the experiment). It might just be a good guess, or an "expert opinion". Posterior probability is calculated based on the prior AND the evidence (the experimental variable). Since it is possible to assign different priors to a situation, it is possible that Bayesian methods will produce different probabilities for an event, because they were based on different priors. This is not acceptable to frequentists. Bayesians, on the other hand, say that assigning priors may be subjective, but is not arbitrary. If several priors can be assigned, it simply means that the situation is ambiguous, and should not be treated as if it were not.

Bayes theorem states that:

where:

Bayes Theorem "conditions" the variables. Conditioning means updating probabilities based on new information. In other words, it updates our beliefs about observing A, having observed B.

RESULTS:

So if the contestant has picked the red door and the host opens the blue door to show the goat, then the probability of the car being behind the red door is 1/3, behind the green door 2/3, and behind the red door 0.

Therefore, the contestant can increase his chances of winning the car by switching to the green door.

The same result applies no matter which color door he picks initially. He will always double his chances of winning in this situation by switching.

Let's apply this to the Monty Hall Problem. Let's say there are 3 doors: red, blue and green. In the absence of prior knowledge, the contestant has to pick a door at random, and all 3 doors have equal probability of having the car behind them. We'll call this variable A from the above formula, so the probabilities are:

and the prior or marginal probabilities are P(Ar) = P(Ab) = P(Ag) = 1/3. Let's say the contestant picks the red door.

Now let's look at the host's role, which is adding the conditional probability (we'll call it B, as in the equation above). If we didn't assume that the host had any prior knowledge of which door held the car, then he would be equally likely to open either of the two remaining doors. So the prior or marginal probability that he would open the blue door, for example, would be 50% or 1/2. Therefore P(B) is 1/2.

But the host does know which door has the car, and since he can't open either the door the contestant selected or the door with the car, what is the probability that the host will open the blue door?

So putting these numbers in the equation:

Here P(Ar), P(Ag) and P(Ab) are the prior estimates of the car being behind the red, green or blue doors. P(B) is the prior estimate of the host opening either one of the two remaining doors, IF the host didn't know which door had the car. The conditional estimates are based on the host's knowledge of which door had the car.

This is an example of how the Bayes Theorem allows us to add posterior information to recalculate probabilities. Lots more on Bayes Theorem at the Wikipedia page.

When Should We "Add Information"?

As we saw in the Monty Hall problem, the opening of one door to reveal a goat has added information to the problem, and therefore it requires a recalculation of probabilities based on the new data. This is the reason why our intuitive estimate of probabilities is incorrect - because it failed to take the new information into account.

This leads to an important question. When should we take such new information into account? Always? Sometimes? How do we choose?

Consider this popular question on math quizzes: Mr. Jones has two children, at least one of whom is a son. What is the probability that both are boys? Let's say we can make the following two assumptions - (1) roughly the same number of boys and girls are born, so the probability of either is 50%, and (2) the sex of one child does not influence the sex of the other child (the two events are independent). This is like a coin toss. The probability of either heads or tails is 50%, and the results of one coin toss do not affect the results of another coin toss.

In this case, the intuitive answer to "what is the probability of both children being boys" is 50%. We already know that one child is a boy. The second has a 50% chance of being a boy, which would make both children boys. Therefore, the probability of both being boys is 50%.

Now consider this in a Bayesian sense. We know that Mr. Jones has 2 children. We know that one of them is a boy. What are the possibilities regarding the sex of the two children? They are:

  1. boy + girl
  2. boy + boy
  3. girl + boy

The 4th possibility, girl + girl, is ruled out because we already know that one of the children is a boy. So we have 3 possibilities, all of which are equally likey, but only one of which represents 2 boys. Therefore, the chance of Mr. Jones having 2 boys is 1/3. This is a different answer from our "intuitive" estimate of 1/2. Which answer is right?

If the question had simply been "guess if Mr. Jones' child is a son", then both the intuitive and bayesian answer would have been "50% odds of being a son". However, the question actually adds some information by saying that Mr. Jones has 2 children, one of which is a son, and we are asked to guess the gender of the other child. This is why the intuitive and bayesian answers diverge. The question though, is whether this information is relevant and ought to be considered in our probability estimate.

The answer is, we don't know from this precise statement of the problem whether the information is relevant or not. It depends upon how the person asking the question proceeded. If he picked a random name from the phonebook (Mr. Jones was randomly picked), and if then Mr. Jones happened to have 2 children, one of which was a boy, then the chance of him having two boys is really 1/2, our intuitive guess. This is because no new information was added by the questioner - he had no control over the data, because he picked a name at random.

On the other hand, if the questioner picks names from the phonebook (at random or otherwise) but then selects from those only the people who have at least one boy, then the questioner had added information to the problem. By being selective, he has changed the odds. In that case, the Bayesian answer is correct, and the chance of the second child also being a boy is 1/3.

This is critical in Bayesian calculations. We need to know if information was added or not, and it's not always evident from the question. In the Monty Hall problem, it's clear that information was added, because the rules of the game specifically say that the game show host opens a door behind which he knows there is a goat. It's clear that the host knows which door has a goat behind it, and opens that door, revealing this information to the contestant. Therefore, the contestant can take advantage of the new information in recalculating the odds.

In other examples, such as the question about Mr. Jones, the new information is not evident. It depends upon the sampling technique used by the questioner to locate Mr. Jones. If the technique was random, no new information was added, and the correct answer is 1/2. If the technique was non-random, if Mr. Jones was selected on the basis that he already had one son (and other people were rejected if they didn't have at least one son), then the correct answer is 1/3.

As we can see, it's not always possible to know which method to use. The question we need to ask is "does new information enter the calculation"? This can only be answered if the context of the question and how it was formulated is properly explained.