Sunday, May 31, 2009

Star Trek

Yes, mom, if you haven't seen it yet, do so. My wife and I, and our youngest son, enjoyed it.

I don't know if the rest of you will like it. But I and my mother have been watching Star Trek, on TV and in movies, since the start of its run on television (yes, I'm over 40 years old). So I know her tastes. If you're towards our end of the viewing spectrum, you'll enjoy it.

Our end is an enjoyment of the at times absurd but enthusiastic exploration that was central to the original series. The current movie regains a lot of that. If you liked, say, the Alice in Wonderland episode, you're set. The 'red matter' is, scientifically, piffle. But no matter. There are some adventure to be had, challenges to be met, some personalities to deal with. And it gets there with some humor. Not stand up comic humor, or pratfall (at least so much). But people (even if some are Vulcans or other things) have their foibles and you live with them. It's aimed at an older audience, in some senses (namely how much clothing some of the women have on) than the original TV series (which, by today's standards, may well get a 'G').

There are also some nice references to the originals, plus the original Spock appearing (Leonard Nimoy, though if I need to tell you that, just go see the movie and have fun with some space opera). If you insist on rigorous self-consistency and sensibility within the movie and with respect to the original series ... I understand Boole wrote a lovely book on Mathematical Logic, and you might enjoy Chandrasekhar's Ellipsoidal Figures of Equilibrium.

And yes, my mother is one of my readers. Where do you think I got the interest in the universe from? Who gave it a pretty free rein when I was growing up? But parenting kids to understand and appreciate science (and writing, and music, and ...) is a different post. I've tried to pass it on, which is a different matter than trying to make my kids become scientists.

Friday, May 29, 2009

Sea Ice Odds

I've been working out, finally, my thoughts on what the sea ice cover will be like for the Arctic, in September 2009. That for the average for the month, rather than minimum at any time.

But also a good time to return to some basics about sea ice. Two different questions we could ask about ice cover are: 1) what area of the earth (Arctic ocean, Hudson Bay, ...) has sea ice floating on it? 2) If I look at a grid of sea ice coverage, what is the area of the grid cells that has any noticeable ice in it? The former is sea ice area, and the latter is sea ice extent. See some of the web sites I link to on the right for samples.

In terms of measurement, extent is much easier to get than area. If there is sea ice in an area, it usually covers a large fraction of the surface around there. Extent doesn't worry about whether it was 70% covered or 95% covered. As long as, on the average, you can correctly distinguish between more than 15% and less than 15%, you're set. Sea ice area, on the other hand, is more challenging. Now you have to be confident that when the satellite says '10% water', all 10% are because of the ocean. Problem is, the water could also be ponds that are on the surface of the ice. (Melt ponds; we're not very creative in our vocabulary.) You can't have an ice floe entirely covered by melt ponds -- the water would run off. So extent is still safe, even if area could be off by 30% in a part of the pack with lots of melt ponds.

My prediction, then, regards the extent of ice, rather than the area. Predict numbers you can measure more reliably if you have a choice. For other reasons, it's for the average extent for the whole month. (As computed by the NSIDC, and, again, Arctic only).

The prediction is: 4.92 million km^2, with a standard deviation to it of 0.47 million km^2. 2008 showed 4.67 million km^2, and 2007 showed 4.30 million km^2 for the month's average. I'm not saying that the sea ice is 'recovering'. Actually, if my method (which, notice, I haven't told you what it is) is right, then I also expect the Arctic ice to be going to zero sometime after 2022, which is on the early end of estimates. How much after requires the next step of sophistication in the method. That'll come later. Recovery would mean 6.67 million km^2 (the mean of the last 30 years, arguably, it should mean more like 7.4 million). All that's involved is that 2007 and 2008 were extraordinary, which we already knew, and remain extraordinary even after going to a better :-) model for the progression.

I don't make cash bets, or recommend doing so. I will, however, bet honor points. (Quatloos, a group I was in named it.) One place I'll be making such a bet is Stoat, William Connolley's site. Then again, we'll have to see if he finds it attractive. Given my prediction, one obvious bet is even money that the September average will be less than (or more than) the number I gave, 4.92 million km^2. Someone who thinks the sea ice has recovered (which does not include William) should be leaping up with glee to take the 'more than' side of the bet. Actually, though, if they really believed that ice had recovered, they should go for even odds the the ice would be more than 6.67 million km^2.

This is where we get to the 'odds' mentioned. If someone believes that there's been no real change in the sea ice cover, just some bouncing around, or if they think it has recovered, this means they have a particular stastical model in mind. Namely, that sea ice in any given year should be about the mean. And there's a degree of spread (the standard deviation). This model says, given the last 30 years of observations, that we should see (every year) 6.67 million km^2, with a standard deviation of 0.87. The 0.87 is the measure of spread. One wouldn't be surprised to see ice cover be less than 5.8 million km^2, about 1 year in 6 should do so. On the other hand, only 1 year in 40 should be more than 2 standard deviations below normal, 4.9 million km^2. Someone who believes that the Arctic ice has recovered, then, should be offering me 20:1 odds that the Arctic ice extent will be more than 4.97 million km^2 (working this number out with full detail).

So I'll invite those who've been saying that Arctic sea ice has recovered the 20:1 bet -- they owe me 20 quatloos if the the Arctic ice extent averages less than 4.97 million km^2 for September, as computed by NSIDC. I give them 1 if it's more. I'll also invite you to refer people who say that the Arctic (or who just say 'sea ice') has recovered to this note. Or, if they prefer, I'll give 20 quatloos to their 1 that the figure will be less than 6.67 million km^2. By their thinking, it's even chance that it'll be more than 6.67. By mine, it's awfully unlikely.

Also open to other offers. Add them to the comments. I might be slow to respond, but offers will be considered through the end of June. If you think the trend is a simple straight line, your prediction is for 5.46 million km^2, with standard deviation 0.53 (again based on 30 years). You should offer your 5 quatloos to my 3 that the figure will be greater than my prediction of 4.92. I'll give 5 to your 3 that it will be less than 5.46. Notice that there's a symmetry to the odds.

Oh well. I hope I haven't just confused most everyone, and bored to tears the rest.

Currently having technical difficulties, but I'm adding a poll at the bottom about how it'll turn out. The numbers are centers of the ranges. 5 means 4.75 through 5.25, for instance.

Wednesday, May 20, 2009

Binomial probabilities

This also goes under the name of 'Bernoulli trials'. The idea is that you have a circumstance in which one of two things will happen. And you're going to try out the process many times. This could be tossing coins, or dice, or a batter going to the plate (or wicket) many time, a basketball or hockey player taking a number of shots, and so on. After a bunch of trials, you then ask what the chances are that you got that many heads (or sixes, hits, baskets, ...) or more. Even better is when you try to figure out the chances before you start the trials.

But let's be concrete. I find it easier to think of a particular example first, and then think about more general cases. (Some people prefer the other way around; people are different. But they're not writing this blog :-) Consider tossing a coin 5 times. What are the chances that we get 3 heads?

There's only 1 way to get 5 heads -- get a head on the first throw, and the second, and the third, and the fourth, and the fifth. 'And' is an important word in probability, meaning that we should multiply the probability of the individual events involved. Since we're using a fair coin with a 1/2 chance of turning up heads, this means 1/2 * 1/2 * 1/2 * 1/2 * 1/2. So, 1/32 chance of turning up 5 heads in 5 throws.

But let's think about getting 4 heads in 5 throws. We have 5 different ways of getting that. I'll list them off with 'H' meaning a heads, and 'T' meaning tails. They are:
THHHH or HTHHH or HHTHH or HHHTH or HHHHT.
Each one of these has a 1/32 chance of happening. The other important probability word is present. 'or' means to add the chances. So there is a 5/32 chance of getting 4 heads in 5 throws of a fair coin. (Or for your team to win 4 games out of 5 between evenly matched teams, or for a player to hit 4 shots out of 5 when he has a 50% shooting percentage), and so on.

It gets more complicated with 3 heads, 2 tails. And a lot more complicated when it's 493 tosses of the coin. That's where we want the general formula to do the legwork for us. When we get to cases where one event is more than 50% likely, again, it becomes nice to have a more general formula. I've made up a little spread sheet in Open Document format, if there's interest.

For the case at hand, about 'Grumbine scientists', we're wondering what the chances are that there'd be 5 or more scientists in a group of 493 people. By the way, speaking of family odds, the Bernoullis had an extraordinary run of mathematicians (hence the name for this bit of probability) and mathematical physicists ('Bernoulli effect' in fluid dynamics). In the case of 'scientist', the coin is weighted heavily against coming up that way. I made up a probability of 99.9% that a given person (in the US) was not a scientist who had published in the scientific literature in the last 20 years. Don't know that this is correct, which we'll return to. Nor, as we've already discussed and had comments on, are we confident that the number 493 is right or even very close.

With only 1 person in 1000 qualifying, we wouldn't be surprised to see 0 of 493 turn out to be scientists. The actual calculation gives 61% of the time that we'd expect 0. 30% of the time, we'd expect only 1. Conversely, the chance of something happening is 100% minus the chance of it not happening. Having 2 or more scientists show up, then, is only about a 9% chance. We shouldn't worship at the altar of the 5% level, but it's a good rule of thumb for getting started. With a 9% chance, we're not very impressed to see 2 or more scientists in this group of 493. But chances of getting 2 scientists are 7.4%. Between that, and some rounding in the earlier figures, there's only a 1.4% chance of finding 3 or more scientists in a group of 493 people. And that beats the 5% requirement handily. It's 0.2% for 4 or more, and 0.016% for 5 or more. This gets to a level where, as a matter of the probability, we'd be pretty confident that something real was going on.

But there's a joker in this deck, and I'm it. There is a selection effect problem. Namely, this group contains me -- because I started looking at the subject on the grounds that I was in it. Any group that contains me is guaranteed to have at least 1 scientist, 1 left-handed person, 1 runner, and so on. If I'm the person selecting the group, then we have to not count me. That leaves us with only 4 identified Grumbine scientists for the purpose of our research. As that's still at the 0.2% level, we're still pretty confident that something real is going on. Or at least if not, it's a pretty surprising coincidence.

Suppose that the number I made up for fraction of people who are scientists is too low. Let's say it's 0.3% instead of 0.1%. Then our chances of getting 4 or more scientists in the sample of 493 rises to 6.3%, which would not pass our standard for 'probably not chance'. So it is important to get a good idea of the figure. Now I'm pretty sure that the true figure isn't as high as that. It would mean 1 million people in the US had published in the last 20 years, and that just doesn't seem plausible. Even 300,000 (the 0.1%) strikes me as high, but I was trying to err on the high side in the first place.

Some odds and ends:
  • We had some difficulty in finding data to work with.
  • Once found, we saw that the data had some serious quality control problems
  • Our starting point included a selection bias problem
  • In trying to evaluate our conclusion, we discovered that the conclusion depended on an assumption we'd made without much evidence (the 0.1%)
All of this is common. To do science, you have to be ready to go back over your whole process to verify that where each step was not extremely strong, it at least doesn't change your conclusions. If it could, you have to mention this. Something that happens, though, in science is that once it's been mentioned, issues don't necessarily get rehashed in every paper for ever after. Inexperienced readers of science sometimes complain, because of this, about scientists 'hiding' problems. It isn't hidden, it's there in the scientific literature. It's just that the scientific literature assumes you're all big kids and have done your homework in reading the previous work. Citations aren't there for decoration.

Partly, this set of notes is to illustrate the Central Skill of a Scientist note. Although it turned out that the original idea, of there being surprisingly many Grumbines in science, is probably acceptable, it could have turned out otherwise. At that point, move on to other ideas. Scientists have many ideas, which is one of the secondary skills. So moving on to others is not a big deal.

Then again, having passed this far, we are in a position to ask -- again -- "So what?". More politely, "What would be shown even if the idea were true?" If there really were some exceptional number of Grumbines (or Bernoullis or Darwins to name some much better known families) in science, what would that mean? Unfortunately, nothing particular. Maybe it is a sign that there's a genetic contribution to entering science. Maybe it's a sign that there are family environment features which, for some reason, are common in this crowd. And maybe it really is just chance. These are more reasons to not get too wedded to ideas.

Monday, May 18, 2009

The central skill of a scientist

"Another beautiful theory slain by an ugly fact." I believe that is from Julian Huxley, but it could be T. H. Huxley or JBS Haldane instead.

In any case, being able to say that and move on to the next theory or idea is the central skill of a scientist. The math, experiments, field observations, and so on, are only tools. Central is to know that what you think, perhaps even very strongly, to be the case today could run in to some ugly facts tomorrow ... and then you'll have to change your mind in accord with this new evidence.

One of my encounters with this involved a notion (far too early in the process to call it a theory) about clouds. A friend was studying clouds and the snowfall from them. The clouds were in a bunch of parallel bands. This was no surprise. The surprise was that spaced regularly down the bands were 'knots' of particularly high snowfall rates. Why should such a thing happen? I happened to be looking at hydrodynamic stability problems at the time, and one looked to be about right. So I mentioned it, and he said 'Bob, there's no known surface tension effect in clouds.' Ok, I knew that, but maybe we had the first observations that there was such a surface tension effect in clouds. The thing was, the mechanism made definite predictions about how far apart the knots would be. So, we checked. The knots were not spaced right for that idea, and were nowhere near. Ugly fact slayed my beautiful theory.

Oh well, time to move on to newer and better ideas. And there is the hard part to doing science. People -- scientists included -- get attached to their ideas, or to ideas they learned long ago. Letting go of one because the data just aren't there, new data come along and show the old data was bad, or any of the host of reasons that leads us to change our minds in science ... that's hard. This doesn't mean that you must accept whatever new thing anyone presents. That would be foolish; you're certainly entitled to check under the hood, kick the tires, take the new out for a test drive. But, after challenging the new (data, model, theory, inference, ...) if you can't shake it, you have to grant it at least tentative acceptance -- even if it is contrary to something else that you like better.

Wednesday, May 13, 2009

More noodling with numbers

I followed up quasarpulse's link to getting a handle on how many Grumbines there are in the US. This rapidly illustrated a truth that all scientists have to deal with, even if by way of avoidance. That is, data are messy and often ugly.

I started with states, since it has to be done state by state, that I know have relatively many Grumbines -- Pennsylvania, Maryland, and California (PA, MD, and CA). Found 104, 58, and 44, respectively. Those figures all made fair sense. Important part to early looking at data is thinking about whether they make sense. So far so good. Then I went to looking at large population states -- Texas and Florida. 18 and 36. Both seem ok, lower Grumbine rates than MD and PA, as expected. About 1 in 1 million.

Then I started working my way away from Pennsylvania. All was well until I hit Rhode Island. Returned 116 names. Wow! Rhode Island is not a big state, and there it is with even more Grumbines than Pennsylvania!

Sanity checker alarm goes off. Let's look a little more carefully. Hmm. Of that 116, 0 are shown in any cities that are in Rhode Island. Alert time: People may be shown as living in states (from the search) that they are not living in (from the residence data). If we were being rigorous here, we'd go back to the beginning and look carefully through all states' information for people who don't live in the state we're looking for at the moment. Since I'm not being rigorous, I merely note that the figures are going to be over-estimating how many Grumbines there are and take a look at how this over-estimate affects any conclusions we try to draw later. It turns out that 116 is the number (and it's the same set of names and places shown) given when there are actually zero.

Alerted by the Rhode Island result, I pay a bit more attention to where people are listed as being from. Not much, and I take the simple numbers anyhow aside from Nevada, which at first glance shows 8 Grumbines. But 3 are not living in Nevada. And 4 of them are Robert E. Grumbine, living in Carson City, Nevada. I simply don't believe that a single city in Nevada has 4 different guys by that first name and middle initial. (Now, as far as that goes, R. Grumbine is pretty common, 59 of them in the US by that site's search.)

Total figure for Grumbines that I get, keeping in mind that it's an overestimate, is 493. That, versus 5 already-named Grumbines publishing in the last 20 years in science. The names for all Grumbines may well be over-estimating in a different way, now that I think about that. The web site shows many people with no age (estimate), estimated ages over 90, etc.. There's a fair chance (in fact in one case I know it's true) that they're showing dead people. That, too, will inflate the totals. In the case of the 5 publising in science, I know they're either currently or at least pretty recently alive.

So, divide our 5 Grumbine scientists (already known -- there might be more) by 493 Grumbines in the US, and we've got a 'scientist rate' of 1.01%. If the real number of scientists should be 10 (it's quite easy for me to have not found another 5 since I didn't look much), then the more accurate number would be 2%. If the real number of Grumbines is only 400 (given what I saw in Nevada, Wyoming, and Oklahoma, I'd be unsurprised by seeing about 20% of the listings being duplicates), the rate would be 1.25%.

This is a different aspect of sanity checking -- look to see how much flex there is in numbers you are working with. It also points to my usual complaint about excess precision. That initial 1.01% is absurd. We don't have enough data to draw that fine a conclusion. Just 1 Grumbine scientist more or less changes that by 0.2%. If one data point more or less changes you in the tenths, the hundredths are not meaningful. In some classes, you encounter this as 'significant digits'. We only have 1 digit representing the number of Grumbine scientists. The 'rate' can't have more than that. So, go with 1% if you need a rate. Given that we're talking about modest numbers, better is to simply work with the numbers themselves.

To complete the test, we then compare the number of Grumbine scientists to the number we'd expect if the rate were the same as for the rest of the population. I previously made up the figure 0.1% for the general population. That gives us a prediction of 0.493 Grumbine scientists, and shows a problem we need to address. Fractional people tend not to be available. That's also why it takes an additional note.

In the mean time, I'll point out that there's nothing special about 'Grumbine' and 'doing science'. That's the real reason for the detail and multiple posts. It's a very general matter of scientific approach. You could be looking instead at 'Americans' and 'with swine flu', and the folks at the Centers for Disease Control are doing exactly that, along with a ton of other examinations. In a trial for a new medicine, you'd be looking at 'people who took placebo and a) got better b) got worse' vs. 'people who took drug and a) got better or b) got worse', and much more elaborate matters. For climate, we might take 'recording stations' and 'shows warming trend over the last 30 years'. And so on.

Tuesday, May 12, 2009

Assortment of Grumbines

A couple of coincidences turn thoughts towards the topics of Grumbines. A few weeks I heard from R. Edward Grumbine. Turns out we are indeed related, though perhaps no closer than my 5-great grandfather (Leonhart Krumbein). Also in recent visitation, it seems some people looking for Dr. Francis Grumbine found themselves here. Francis is a medical doctor, I'm a PhD in geoscience, Ed (R. Edward) is a PhD and professor of ecology. One more, if not recent, coincidence: I was an undergraduate at Northwestern University and graduate student at the University of Chicago. While at Northwestern, I worked in the office (what once was) of William Krumbein, another relative, who was a noted geologist. (He's another 3 generations back from me, and I've never talked to Francis though he is in the area here.)

If you look some more, you'll find more Grumbines (also Krumbein, Crumbine, ...) in science and medicine. Rather a surprising number, at least to me.

Ok, a surprise. What should a scientist do with a surprise? Start thinking about how surprising it really is, of course. Keeping it to current Grumbines (throw in a Richard who is also involved in natural science), we've got at least 4 scientific researchers (Francis publishes in the scientific literature, though I've also run in to a patient or two of his so he must double in clinical practice).

Now, Grumbine is a very uncommon name. Among the most uncommon, in fact, in the US. So maybe there's a science gene we Grumbines carry? After all, here we've got 4 of the name doing science and probably almost none of you had ever heard of the name before stopping by here.

How would we test an idea about there being particularly many Grumbines doing science? We really want the same kind of numbers that I looked at in dismissing the bogus petition -- how many people are there, and how many have the characteristic we're interested in? It would not be terribly hard to come up with good numbers on how many Grumbines are publishing in science: Just do a scientific literature search, or even Google Scholar search and start counting. There'll be some fuzziness as apparently different names (R vs. Robert vs. Robert W. vs. R. W., for instance) might, all be me, or maybe not. Same for the others.

But how to get a sense of how many Grumbines there are in the US? That's a problem. The listing I saw of name frequency only gave the rank order, not how many there were. You might be tempted to do a general web search on the name. But then you run in to a very strong selection effect. The ease of finding people on the web depends heavily on what it is they do. Scientists are typically extremely easy to find, so get represented well. On the other hand, carpenters are probably relatively hard to find (aside from Bill Grumbine, who seems extremely well-known in the world of bowl turning; some lovely pictures out there of his work). By this sort of thing you could come up with 4 scientists, 1 bowl-turner, and 1 stand-up comedian (Peter, and another profession likely to be overrepresented). Now, if the majority of Grumbines were scientists, that would clearly be different from the general population.

This is what makes selection effects a problem. With a small sample that is biased to finding the sort of person we're trying to test a hypothesis about, you're in trouble. In the US population as a whole, scientists are something like 1 in 1000. If there were actually about 4000 Grumbines in the US, then the 4 I've named would be about par for the course. If there were 40,000, then the easily found 4 actually show it uncommon for Grumbines to be scientists. But most people* don't leave very much web trace, so the additional 4000, or 40,000, would be much harder to find than the 4. (Well, at least 5 -- there's a David Grumbine in physics.)

So, anyone have clever ways of putting some limits on just how many Grumbines there are? Er, that may not have come out right. Finding ways of estimating with some confidence that there are between X and Y Grumbines?


I'm not proposing any genetic link, nor that if there is one, my family has it+. Rather, the idea is to illustrate an early step or two in doing science. Have some notion, from whatever unlikely source, and then start looking at what kind of data you would need to test the notion. If there can't be data to test the idea, it can be good and interesting, but not science. In this case, there clearly can be such data. We then move on to the next step -- how can I get hold of it? If I can't get hold of what I really want for data (accurate counts of how many Grumbines there are, and how many are publishing in science), can I find something close to it that will let me test the idea anyhow?


*Ok, maybe I should say most people of my generation and older.

+ Unlike for teaching, where, if there can be a genetic disposition to teaching, I'll definitely submit my genealogy in candidacy for illustrating it.

Sunday, May 10, 2009

Happy Mother's Day

Late in the day, but Happy Mother's Day to all the moms out there.

I'll also mention a virtual 'Mothers' project, for women who are scientists to pass word on to women (their virtual daughters) who would be scientists -- the Letters to our Daughters project. There have been a number of letters already. Whichever side of that equation you're on, it's probably worth a look.