But let's be concrete. I find it easier to think of a particular example first, and then think about more general cases. (Some people prefer the other way around; people are different. But they're not writing this blog :-) Consider tossing a coin 5 times. What are the chances that we get 3 heads?
There's only 1 way to get 5 heads -- get a head on the first throw, and the second, and the third, and the fourth, and the fifth. 'And' is an important word in probability, meaning that we should multiply the probability of the individual events involved. Since we're using a fair coin with a 1/2 chance of turning up heads, this means 1/2 * 1/2 * 1/2 * 1/2 * 1/2. So, 1/32 chance of turning up 5 heads in 5 throws.
But let's think about getting 4 heads in 5 throws. We have 5 different ways of getting that. I'll list them off with 'H' meaning a heads, and 'T' meaning tails. They are:
THHHH or HTHHH or HHTHH or HHHTH or HHHHT.
Each one of these has a 1/32 chance of happening. The other important probability word is present. 'or' means to add the chances. So there is a 5/32 chance of getting 4 heads in 5 throws of a fair coin. (Or for your team to win 4 games out of 5 between evenly matched teams, or for a player to hit 4 shots out of 5 when he has a 50% shooting percentage), and so on.
It gets more complicated with 3 heads, 2 tails. And a lot more complicated when it's 493 tosses of the coin. That's where we want the general formula to do the legwork for us. When we get to cases where one event is more than 50% likely, again, it becomes nice to have a more general formula. I've made up a little spread sheet in Open Document format, if there's interest.
For the case at hand, about 'Grumbine scientists', we're wondering what the chances are that there'd be 5 or more scientists in a group of 493 people. By the way, speaking of family odds, the Bernoullis had an extraordinary run of mathematicians (hence the name for this bit of probability) and mathematical physicists ('Bernoulli effect' in fluid dynamics). In the case of 'scientist', the coin is weighted heavily against coming up that way. I made up a probability of 99.9% that a given person (in the US) was not a scientist who had published in the scientific literature in the last 20 years. Don't know that this is correct, which we'll return to. Nor, as we've already discussed and had comments on, are we confident that the number 493 is right or even very close.
With only 1 person in 1000 qualifying, we wouldn't be surprised to see 0 of 493 turn out to be scientists. The actual calculation gives 61% of the time that we'd expect 0. 30% of the time, we'd expect only 1. Conversely, the chance of something happening is 100% minus the chance of it not happening. Having 2 or more scientists show up, then, is only about a 9% chance. We shouldn't worship at the altar of the 5% level, but it's a good rule of thumb for getting started. With a 9% chance, we're not very impressed to see 2 or more scientists in this group of 493. But chances of getting 2 scientists are 7.4%. Between that, and some rounding in the earlier figures, there's only a 1.4% chance of finding 3 or more scientists in a group of 493 people. And that beats the 5% requirement handily. It's 0.2% for 4 or more, and 0.016% for 5 or more. This gets to a level where, as a matter of the probability, we'd be pretty confident that something real was going on.
But there's a joker in this deck, and I'm it. There is a selection effect problem. Namely, this group contains me -- because I started looking at the subject on the grounds that I was in it. Any group that contains me is guaranteed to have at least 1 scientist, 1 left-handed person, 1 runner, and so on. If I'm the person selecting the group, then we have to not count me. That leaves us with only 4 identified Grumbine scientists for the purpose of our research. As that's still at the 0.2% level, we're still pretty confident that something real is going on. Or at least if not, it's a pretty surprising coincidence.
Suppose that the number I made up for fraction of people who are scientists is too low. Let's say it's 0.3% instead of 0.1%. Then our chances of getting 4 or more scientists in the sample of 493 rises to 6.3%, which would not pass our standard for 'probably not chance'. So it is important to get a good idea of the figure. Now I'm pretty sure that the true figure isn't as high as that. It would mean 1 million people in the US had published in the last 20 years, and that just doesn't seem plausible. Even 300,000 (the 0.1%) strikes me as high, but I was trying to err on the high side in the first place.
Some odds and ends:
- We had some difficulty in finding data to work with.
- Once found, we saw that the data had some serious quality control problems
- Our starting point included a selection bias problem
- In trying to evaluate our conclusion, we discovered that the conclusion depended on an assumption we'd made without much evidence (the 0.1%)
Partly, this set of notes is to illustrate the Central Skill of a Scientist note. Although it turned out that the original idea, of there being surprisingly many Grumbines in science, is probably acceptable, it could have turned out otherwise. At that point, move on to other ideas. Scientists have many ideas, which is one of the secondary skills. So moving on to others is not a big deal.
Then again, having passed this far, we are in a position to ask -- again -- "So what?". More politely, "What would be shown even if the idea were true?" If there really were some exceptional number of Grumbines (or Bernoullis or Darwins to name some much better known families) in science, what would that mean? Unfortunately, nothing particular. Maybe it is a sign that there's a genetic contribution to entering science. Maybe it's a sign that there are family environment features which, for some reason, are common in this crowd. And maybe it really is just chance. These are more reasons to not get too wedded to ideas.