Friday, June 26, 2009

Clark pointed out in my last post on the Washington Post's statistical analysis of the Iranian election results that we can't conclude from an event with 4 in 1000 probability of occurrence that there is a 996 in 1000 probability that occurrence of the event was faked. This is true, of course, and I should have been more clear in my remarks.

The information we want is the probability that the Iranian election results were legitimate, given the data we see. What the WaPo has given us, however, is the probability of seeing the data that we see, given a legitimate election. They sound like the same thing, but they're not.

The Wikipedia article on conditional probability gives the classic example of why this is the case. Suppose you have a test for a disease that is 99% accurate--it returns a false result 1% of the time. That sounds pretty good, but if you use it to screen for a disease that affects only 1% of the population, then the probability that a person actually has the disease, given a positive test result, drops to 50%. So the probability of seeing a positive test result, given the presence of the disease (99%) is not the same as the probability of having the disease, given a positive test result (50%).

The Prosecutor's Fallacy describes the application of this logic in the real world. Here, the minuscule probability of certain evidence emerging, given the innocence of the accused, is used to argue that the probability of the accused being innocent must be just as tiny. This is what the WaPo appears to be engaging in with their statistical analysis.

Or is it? The thing to remember here is that this discrepancy exists because of the discrepancies in the prior probabilities that our conditional probabilities are based on. For example, in our disease scenario, the test becomes as accurate as it appears if 50% of the population are affected by the disease. The probability of seeing a positive test result, given the presence of the disease, and the probability of having the disease, given a positive test result are both 99%. It's the fact that the disease is so rare to begin with that makes all the difference.

So with the Iranian election, if we already knew that there was a 50% chance the election results were faked, then the analysis described in the WaPo would be strong evidence indicating shenanigans. I'm not saying we know that, but there does seem to be a lot more evidence of irregularities other than statistical analysis alone.

One more example to illustrate what I mean. Suppose 99 out of every 100 swans are white, and one is black. If you're walking along in the park and you see a black swan, you've just witnessed a rare event, with a probability of 1%. But this does not mean that you can be 99% sure that this swan is fake. That's because the odds of seeing a fake swan are based on something else entirely, and are usually far from 50%.

Suppose, however, that you heard on the news this morning that people have been out spray-painting white swans black. Suppose further that half of them have been spray-painted. Now, when you see a black swan, the odds of it being fake have climbed much closer to 99%, because you know the odds of seeing a fake swan to begin with are 50%.

1 comments:

Clark said...

That was quite a discussion from my little comment! The difficulty along this line of reasoning is that the more confident we are that the Iran election was fudged, the more credence we should give to the anomalous final digits in the results. It's a self reinforcing cycle. (This isn't anyone's fault, it's just how it is.) Unfortunately, it's tough to quantify things like "it doesn't make sense that he lost his home town" and "it doesn't make sense that he won by the same margin almost everywhere".

Post a Comment