Data are messy, and all data have problems. There's no two ways about that. Any time you set about working seriously with data (as opposed to knocking off some fairly trivial blog comment), you have to sit down to wrestle with that fact. I've been reminded of that from several different directions recently. Most recent is Steve Easterbrook's note on Open Climate Science. I will cannibalize some of my comment from there, and add things more for the local audience.
One of the concerns in Steve's note is 'openness'. It's an important concern and related to what I'll take up here, but I'll actually shift emphasis a little. Namely, suppose you are a scientist trying to do good work with the data you have. I'll use data for sea ice concentration analysis for illustration because I do so at work, and am very familiar with its pitfalls.
There are very well-known methods for turning a certain type of observation (passive microwaves) in to a sea ice concentration. So we're done, right? All you have to do is specify what method you used? Er, no. And thence comes the difficulties, issues, and concerns about reproducing results. The important thing here, and my shift of emphasis, is that it's about scientists trying to reproduce their own results (or me trying to reproduce my own). That's an important point in its own right -- how much confidence can you have if you can't reproduce your own results, using your own data, and your own scripts+program, on your own computer? Clearly a good starting point for doing reliable, reproducible, science.
An Astronaut’s Guided Video Tour of Earth
2 hours ago