Data are messy, and all data have problems. There's no two ways about that. Any time you set about working seriously with data (as opposed to knocking off some fairly trivial blog comment), you have to sit down to wrestle with that fact. I've been reminded of that from several different directions recently. Most recent is Steve Easterbrook's note on Open Climate Science. I will cannibalize some of my comment from there, and add things more for the local audience.
One of the concerns in Steve's note is 'openness'. It's an important concern and related to what I'll take up here, but I'll actually shift emphasis a little. Namely, suppose you are a scientist trying to do good work with the data you have. I'll use data for sea ice concentration analysis for illustration because I do so at work, and am very familiar with its pitfalls.
There are very well-known methods for turning a certain type of observation (passive microwaves) in to a sea ice concentration. So we're done, right? All you have to do is specify what method you used? Er, no. And thence comes the difficulties, issues, and concerns about reproducing results. The important thing here, and my shift of emphasis, is that it's about scientists trying to reproduce their own results (or me trying to reproduce my own). That's an important point in its own right -- how much confidence can you have if you can't reproduce your own results, using your own data, and your own scripts+program, on your own computer? Clearly a good starting point for doing reliable, reproducible, science.
This turns out, irrespective of any arguments about the honesty of scientists, to be a point of some challenge even as software engineering. Some of this was prompted by a discussion I had some months ago at a data-oriented meeting -- where someone was asserting that once the data were archived, future researchers could 'certainly' reproduce the results you got today. I was not, shall we say, impressed.
We'll start by assuming that the original data have been archived. (Which I've done for my work, at least for the most recent period.) Ah, but did you also archive the data decoder? Turns out that even though the data were archived exactly as originally used, the decoder itself is allowed to give slightly different results when acting on the same data (or maybe there was a bug in the old decoder that was fixed in the newer one?). So, even with the same data, merely bringing it out of the archive format in to some format you can work with can introduce some changes. Now, do you archive all the data decoding programs along with the data? Use the modern decoders? (but when is modern? If this year's decoder gives different answers than 5 years ago, or than 5 years from now, what should the answer be today?)
Having decoded the data to something usable, now we run our program that translates the data we have in to something that is meaningful. In my case, this means translating 'brightness temperatures' (themselves the result of processing the actual satellite observations in to something that is meaningful to people like me) in to sea ice concentrations. The methods are 'published'. Well, some of the method is published. The thing is, the basic algorithm (rules) for translation are published -- it's the NASA Team algorithm from 1995 through 23 August 2004, then my variation from then to August 2006, and then my variation on top of NASA Team2 from there to the present. One issue being, my variation is too minor to be worth its own peer-reviewed scientific literature (though I confess I've been reconsidering that statement lately, as more trivial-to-me papers are being published). So, where, exactly is the description of the methods? Er. In the programs themselves.
That's not a problem in itself. I have, I think, saved all versions of my programs. Related, though, is that the algorithms don't really stand on their own. There is also the matter -- seldom publishable in the peer-reviewed literature, but vital to being able to reproduce the results -- of what quality control criteria were used, and what, exactly, the weather filtering was. To elaborate: As I said, data are messy and ugly. One of the problems is that the satellite can report impossible values, or at least values that can't possibly correspond to an observation of the sea ice pack. Maybe it's a correct observation, but of deep space instead of the surface of the earth. Maybe the brightness temperatures are correct, but the latitude and longitude of the observation are impossible (latitude greater than 90 N, for instance). And maybe just there was a corruption in the data recorder and garbage came through. In any case, I have some filters to reject observations on these sorts of grounds before bothering the sea ice concentration algorithm with them. And then there's the matter of filtering out things that might be sea ice cover (at least the algorithm thinks so) but which are probably just a strong storm that's got high rain rates, or is kicking up high waves ('weather').
But, of course, these quality control criteria, and weather filter, have changed over the years. Again, not publishable results on their own, but something you need in order to reproduce my results. And, again, the documentation, ultimately, is the program itself. Since I've saved (I think) all the versions of the relevant program(s), you might figure we're home free -- fully reproducible results.
Or, at least the results would be exactly reproducible if you also had several other things.
One of them is, the programs rely on some auxiliary data sets. For instance, the final results like to know where the land is in the world. So I have land masks. If you want my exact results, you need the exact set of land masks I used that day. Again, I've saved those files. Or at least I think I have. As many a person has discovered the hard way at home -- sometimes your system back up won't restore. Or what you actually saved wasn't what you meant to save.
It's worse than that, though. You probably can't run my programs on your computer. At least not exactly my programs. I wrote them in some high level language (Fortran/C/C++) and then a compiler translated my writing in to something my computer (at that time) can use. The exact computer that I was running on that day. Further, there are mathematical libraries that my program uses (things to translate what, exactly, it means to compute the sine of an angle, or a cosine), that were used the day the program originally ran.
What you can do is compile my program's source code (probably, I'm pretty aggressive about my programs being able to be compiled anywhere) on your computer. But ... that uses your compiler, and compilers don't have to give exactly the same results. And it's on your cpu, which doesn't have to be exactly the same as mine. And it uses your computer's math libraries, which also don't have to be exactly the same as mine.
So all is lost? Not really. The thing is, these sorts of differences are small, and you can analyze your results with some care (and effort) to decide that the differences are indeed because compilers, processors, libraries, or operating systems, etc., don't always give exactly the same answers. I've done this exercise before myself, as I was getting different answers from another group. I eventually tracked down a true difference -- something beyond just the processors (etc.) doing things slightly differently (but different in a legal way). The other group was doing some rounding in a way that I thought was incorrect, and which gave answers that differed from mine in the least significant bit about 3/4ths of the time.
With that understood, we could get exactly the same answers in all bits, all of the time, in spite of the different processors and such. But, it was a lot of work to get to that point. And this is a relatively simple system (in terms of the data and programs).
So are you lost on more complex systems, like general circulation models? Again, no. The thing is, if your goal is science -- understanding some part of nature -- you understand as well that computers aren't all identical and have done some looking in to how those differences affect your results. The catchphrase here is "It's just butterflies". Namely, the line goes that because weather is chaotic, a butterfly flapping its wings in Brazil today can lead to a tornado in Kansas five days from now. What the catchphrase is referring to is that small differences (can't even call them errors, just different legal ways of interpreting or carrying out commands) in the computer can lead to observable changes down the road. They don't change the main results -- if you're looking at climate, the onset of a tornado at 12:45 PM, April 3 1974 is not meaningfully different from 3:47 on the same day (though it certainly is if you live in the area!) -- but they do change some of the details.
What do we do, then? At the moment, and I invite the software and hardware engineers out there to provide some education and corrections, what you need for exact reproducibility is to archive all the data, all the decoders, all the program source, all the compilers, all the system libraries, and all the hardware, exactly. The full hardware archive is either impossible or close enough as makes no difference. The system software archives (compilers and system libraries) are at least extraordinarily difficult and lie outside the hands of scientists (there's a reason they're called system tools). The scientists' data and programs, not as easy as you might think, but doable. Probably.
As you (I) then turn around and try to work with some archived processing system, then, when (not if) you get different results than the reference result, your first candidate for the difference is 'different operating system/system libraries/compilers/...', not dishonesty. That means we have work to do, unfortunately. I hope the software engineers have some good ideas/references/tools for making it easier. I can say, though, from firsthand experience working with things I wrote 5-15 years ago, that there is just an astonishingly large number of ways you can get different answers -- even when you wrote everything yourself. If you go on a fault-finding expedition, you'll find fault. If you try to understand the science, though, even those 5-15 years worth of changes don't hide the science.
There's nothing special to data about this; models have the same issues. Nor is there anything special to weather and climate. The data and models used to construct a bridge, car, power plant, etc., also have these same issues. If you've got some good answers to how to manage the issues, do contribute. I'll add, though, that Steve (in his reply to my comments, a separate post, and a professional article) has found that climate models, from a software engineering perspective, appear to be higher quality than most commercial software.