13 May 2009

More noodling with numbers

I followed up quasarpulse's link to getting a handle on how many Grumbines there are in the US. This rapidly illustrated a truth that all scientists have to deal with, even if by way of avoidance. That is, data are messy and often ugly.

I started with states, since it has to be done state by state, that I know have relatively many Grumbines -- Pennsylvania, Maryland, and California (PA, MD, and CA). Found 104, 58, and 44, respectively. Those figures all made fair sense. Important part to early looking at data is thinking about whether they make sense. So far so good. Then I went to looking at large population states -- Texas and Florida. 18 and 36. Both seem ok, lower Grumbine rates than MD and PA, as expected. About 1 in 1 million.

Then I started working my way away from Pennsylvania. All was well until I hit Rhode Island. Returned 116 names. Wow! Rhode Island is not a big state, and there it is with even more Grumbines than Pennsylvania!

Sanity checker alarm goes off. Let's look a little more carefully. Hmm. Of that 116, 0 are shown in any cities that are in Rhode Island. Alert time: People may be shown as living in states (from the search) that they are not living in (from the residence data). If we were being rigorous here, we'd go back to the beginning and look carefully through all states' information for people who don't live in the state we're looking for at the moment. Since I'm not being rigorous, I merely note that the figures are going to be over-estimating how many Grumbines there are and take a look at how this over-estimate affects any conclusions we try to draw later. It turns out that 116 is the number (and it's the same set of names and places shown) given when there are actually zero.

Alerted by the Rhode Island result, I pay a bit more attention to where people are listed as being from. Not much, and I take the simple numbers anyhow aside from Nevada, which at first glance shows 8 Grumbines. But 3 are not living in Nevada. And 4 of them are Robert E. Grumbine, living in Carson City, Nevada. I simply don't believe that a single city in Nevada has 4 different guys by that first name and middle initial. (Now, as far as that goes, R. Grumbine is pretty common, 59 of them in the US by that site's search.)

Total figure for Grumbines that I get, keeping in mind that it's an overestimate, is 493. That, versus 5 already-named Grumbines publishing in the last 20 years in science. The names for all Grumbines may well be over-estimating in a different way, now that I think about that. The web site shows many people with no age (estimate), estimated ages over 90, etc.. There's a fair chance (in fact in one case I know it's true) that they're showing dead people. That, too, will inflate the totals. In the case of the 5 publising in science, I know they're either currently or at least pretty recently alive.

So, divide our 5 Grumbine scientists (already known -- there might be more) by 493 Grumbines in the US, and we've got a 'scientist rate' of 1.01%. If the real number of scientists should be 10 (it's quite easy for me to have not found another 5 since I didn't look much), then the more accurate number would be 2%. If the real number of Grumbines is only 400 (given what I saw in Nevada, Wyoming, and Oklahoma, I'd be unsurprised by seeing about 20% of the listings being duplicates), the rate would be 1.25%.

This is a different aspect of sanity checking -- look to see how much flex there is in numbers you are working with. It also points to my usual complaint about excess precision. That initial 1.01% is absurd. We don't have enough data to draw that fine a conclusion. Just 1 Grumbine scientist more or less changes that by 0.2%. If one data point more or less changes you in the tenths, the hundredths are not meaningful. In some classes, you encounter this as 'significant digits'. We only have 1 digit representing the number of Grumbine scientists. The 'rate' can't have more than that. So, go with 1% if you need a rate. Given that we're talking about modest numbers, better is to simply work with the numbers themselves.

To complete the test, we then compare the number of Grumbine scientists to the number we'd expect if the rate were the same as for the rest of the population. I previously made up the figure 0.1% for the general population. That gives us a prediction of 0.493 Grumbine scientists, and shows a problem we need to address. Fractional people tend not to be available. That's also why it takes an additional note.

In the mean time, I'll point out that there's nothing special about 'Grumbine' and 'doing science'. That's the real reason for the detail and multiple posts. It's a very general matter of scientific approach. You could be looking instead at 'Americans' and 'with swine flu', and the folks at the Centers for Disease Control are doing exactly that, along with a ton of other examinations. In a trial for a new medicine, you'd be looking at 'people who took placebo and a) got better b) got worse' vs. 'people who took drug and a) got better or b) got worse', and much more elaborate matters. For climate, we might take 'recording stations' and 'shows warming trend over the last 30 years'. And so on.

1 comment:

Bayesian Bouffant, FCD said...

If you can't beat 'em, join 'em. I think you should pack up and move to Carson City.