Monday, November 30, 2009

Data set reproducibility

Data are messy, and all data have problems.  There's no two ways about that.  Any time you set about working seriously with data (as opposed to knocking off some fairly trivial blog comment), you have to sit down to wrestle with that fact.  I've been reminded of that from several different directions recently.  Most recent is Steve Easterbrook's note on Open Climate Science.  I will cannibalize some of my comment from there, and add things more for the local audience.

One of the concerns in Steve's note is 'openness'.  It's an important concern and related to what I'll take up here, but I'll actually shift emphasis a little.  Namely, suppose you are a scientist trying to do good work with the data you have.  I'll use data for sea ice concentration analysis for illustration because I do so at work, and am very familiar with its pitfalls.

There are very well-known methods for turning a certain type of observation (passive microwaves) in to a sea ice concentration.  So we're done, right?  All you have to do is specify what method you used?  Er, no.  And thence comes the difficulties, issues, and concerns about reproducing results.  The important thing here, and my shift of emphasis, is that it's about scientists trying to reproduce their own results (or me trying to reproduce my own).  That's an important point in its own right -- how much confidence can you have if you can't reproduce your own results, using your own data, and your own scripts+program, on your own computer?  Clearly a good starting point for doing reliable, reproducible, science.



This turns out, irrespective of any arguments about the honesty of scientists, to be a point of some challenge even as software engineering.  Some of this was prompted by a discussion I had some months ago at a data-oriented meeting -- where someone was asserting that once the data were archived, future researchers could 'certainly' reproduce the results you got today.  I was not, shall we say, impressed.

We'll start by assuming that the original data have been archived.  (Which I've done for my work, at least for the most recent period.)  Ah, but did you also archive the data decoder?  Turns out that even though the data were archived exactly as originally used, the decoder itself is allowed to give slightly different results when acting on the same data (or maybe there was a bug in the old decoder that was fixed in the newer one?).  So, even with the same data, merely bringing it out of the archive format in to some format you can work with can introduce some changes.  Now, do you archive all the data decoding programs along with the data?  Use the modern decoders? (but when is modern?  If this year's decoder gives different answers than 5 years ago, or than 5 years from now, what should the answer be today?)

Having decoded the data to something usable, now we run our program that translates the data we have in to something that is meaningful.  In my case, this means translating 'brightness temperatures' (themselves the result of processing the actual satellite observations in to something that is meaningful to people like me) in to sea ice concentrations.  The methods are 'published'.  Well, some of the method is published.  The thing is, the basic algorithm (rules) for translation are published -- it's the NASA Team algorithm from 1995 through 23 August 2004, then my variation from then to August 2006, and then my variation on top of NASA Team2 from there to the present.  One issue being, my variation is too minor to be worth its own peer-reviewed scientific literature (though I confess I've been reconsidering that statement lately, as more trivial-to-me papers are being published).  So, where, exactly is the description of the methods?  Er.  In the programs themselves.

That's not a problem in itself.  I have, I think, saved all versions of my programs.  Related, though, is that the algorithms don't really stand on their own.  There is also the matter -- seldom publishable in the peer-reviewed literature, but vital to being able to reproduce the results -- of what quality control criteria were used, and what, exactly, the weather filtering was.  To elaborate: As I said, data are messy and ugly.  One of the problems is that the satellite can report impossible values, or at least values that can't possibly correspond to an observation of the sea ice pack.  Maybe it's a correct observation, but of deep space instead of the surface of the earth.  Maybe the brightness temperatures are correct, but the latitude and longitude of the observation are impossible (latitude greater than 90 N, for instance).  And maybe just there was a corruption in the data recorder and garbage came through.  In any case, I have some filters to reject observations on these sorts of grounds before bothering the sea ice concentration algorithm with them.  And then there's the matter of filtering out things that might be sea ice cover (at least the algorithm thinks so) but which are probably just a strong storm that's got high rain rates, or is kicking up high waves ('weather').

But, of course, these quality control criteria, and weather filter, have changed over the years.  Again, not publishable results on their own, but something you need in order to reproduce my results.  And, again, the documentation, ultimately, is the program itself.  Since I've saved (I think) all the versions of the relevant program(s), you might figure we're home free -- fully reproducible results.

Or, at least the results would be exactly reproducible if you also had several other things.

One of them is, the programs rely on some auxiliary data sets.  For instance, the final results like to know where the land is in the world.  So I have land masks.  If you want my exact results, you need the exact set of land masks I used that day.  Again, I've saved those files.  Or at least I think I have.  As many a person has discovered the hard way at home -- sometimes your system back up won't restore.  Or what you actually saved wasn't what you meant to save.

It's worse than that, though.  You probably can't run my programs on your computer.  At least not exactly my programs.  I wrote them in some high level language (Fortran/C/C++) and then a compiler translated my writing in to something my computer (at that time) can use.  The exact computer that I was running on that day.  Further, there are mathematical libraries that my program uses (things to translate what, exactly, it means to compute the sine of an angle, or a cosine), that were used the day the program originally ran.

What you can do is compile my program's source code (probably, I'm pretty aggressive about my programs being able to be compiled anywhere) on your computer.  But ... that uses your compiler, and compilers don't have to give exactly the same results.  And it's on your cpu, which doesn't have to be exactly the same as mine.  And it uses your computer's math libraries, which also don't have to be exactly the same as mine.

So all is lost?  Not really.  The thing is, these sorts of differences are small, and you can analyze your results with some care (and effort) to decide that the differences are indeed because compilers, processors, libraries, or operating systems, etc., don't always give exactly the same answers.  I've done this exercise before myself, as I was getting different answers from another group.  I eventually tracked down a true difference -- something beyond just the processors (etc.) doing things slightly differently (but different in a legal way).  The other group was doing some rounding in a way that I thought was incorrect, and which gave answers that differed from mine in the least significant bit about 3/4ths of the time.

With that understood, we could get exactly the same answers in all bits, all of the time, in spite of the different processors and such.  But, it was a lot of work to get to that point.  And this is a relatively simple system (in terms of the data and programs).

So are you lost on more complex systems, like general circulation models?  Again, no.  The thing is, if your goal is science -- understanding some part of nature -- you understand as well that computers aren't all identical and have done some looking in to how those differences affect your results.  The catchphrase here is "It's just butterflies".  Namely, the line goes that because weather is chaotic, a butterfly flapping its wings in Brazil today can lead to a tornado in Kansas five days from now.  What the catchphrase is referring to is that small differences (can't even call them errors, just different legal ways of interpreting or carrying out commands) in the computer can lead to observable changes down the road.  They don't change the main results -- if you're looking at climate, the onset of a tornado at 12:45 PM, April 3 1974 is not meaningfully different from 3:47 on the same day (though it certainly is if you live in the area!) -- but they do change some of the details.

What do we do, then?  At the moment, and I invite the software and hardware engineers out there to provide some education and corrections, what you need for exact reproducibility is to archive all the data, all the decoders, all the program source, all the compilers, all the system libraries, and all the hardware, exactly.  The full hardware archive is either impossible or close enough as makes no difference.  The system software archives (compilers and system libraries) are at least extraordinarily difficult and lie outside the hands of scientists (there's a reason they're called system tools).  The scientists' data and programs, not as easy as you might think, but doable.  Probably.

As you (I) then turn around and try to work with some archived processing system, then, when (not if) you get different results than the reference result, your first candidate for the difference is 'different operating system/system libraries/compilers/...', not dishonesty.  That means we have work to do, unfortunately.  I hope the software engineers have some good ideas/references/tools for making it easier.  I can say, though, from firsthand experience working with things I wrote 5-15 years ago, that there is just an astonishingly large number of ways you can get different answers -- even when you wrote everything yourself.  If you go on a fault-finding expedition, you'll find fault.  If you try to understand the science, though, even those 5-15 years worth of changes don't hide the science.

There's nothing special to data about this; models have the same issues.  Nor is there anything special to weather and climate.  The data and models used to construct a bridge, car, power plant, etc., also have these same issues.  If you've got some good answers to how to manage the issues, do contribute.  I'll add, though, that Steve (in his reply to my comments, a separate post, and a professional article) has found that climate models, from a software engineering perspective, appear to be higher quality than most commercial software.

30 comments:

Judith Curry said...

Bob, this is just a superb essay, thank you for this!

skanky said...

One thing that can help with your own stuff are source control tools. Most (can't say all, I haven't used them all) allow you to put a "label" on a set of files and a specified (probably, but not necessarily, the current) one. If you did that with your papers (say), then to reproduce the results for that paper, you'd restore to that revision.

You would need to add everything to it, scripts, code, filters, references, documentation etc.

At my company, we also add compilers, some libraries and external tools etc. for our releases, thus we can (for our purposes) exactly recreate a release. That won't resolve all your issues, and it's certainly not a case of "just" doing something, but a decent source control system allows you to manage things like that much more easily. Like any tool, it comes down to being aware of how it works, where it works, what its limitations are, but more importantly in these cases, where it can be used beyond the obvious.

FWIW we use (and recommend) Perforce, which costs but has some great release and branching mechanisms. However, I'm sure most well known ones will do a decent job, if used properly.

MT recommends this course:
http://software-carpentry.org/

Which will be low-level and a bit too basic for some scientists, but a revelation for some others. The reason I point it out is for these questions in the introduction:

* Do you use version control?
* Can you rebuild everything with one command?
* Do you build the software from scratch daily?
* Can you trace everything you release (not just software) back to its origins?
* Can you set up a development environment on a fresh machine without heroic effort?
* Is there a searchable archive of discussions about the project?

Which seem especially pertinent to this post.

There are never any silver bullets, and one size fits all solutions though.

Robinson said...

No, you don't need to archive everything. You just need to archive your original, raw data, alongside the method you've used to analyse it and include the method used to capture the raw data in the first place. You don't even need to archive your program.

Computer Science (I have honours here), we have formalisms for describing what a program does, such that it should produce the same result wherever it's implemented.

There may be rounding errors and other such minor issues between implementations, but in general your work should be published as an abstraction: data, method, results. Other researchers don't need to run your exact program to get your results. If that is the case, then your results are as suspect as a Cold Fusion experiment that only runs in Fleischmann and Pons' lab!

Anonymous said...

Your post gives even more credibility to those who say we should not be discussing how to re-order the world's economy based on this very uncertain "science"

Bob

Chris R. Chapman said...

Building on skanky's comments, you need to either learn professional software engineering techniques or second/buy it from somewhere.

First place to start: Automated unit tests and test driven development (TDD). Unit tests are analogous to a test "scaffold" or "harness" to prove-out what your program is doing at a functional level. Good frameworks allow you to easily construct test programs that can then be run automatically to validate your code against expected outcomes and norms.

While not a replacement for comments and clean coding standards, unit tests can provide "living" documentation to demonstrate how your code was intended to work out of your lab. It can also be used to validate behaviours on different platforms.

Take this to the next level by practicing test-first development, and your code becomes even more robust, as it forces you to write your methods in discrete parts to be easily testable.

Good unit tests will try to prove out the following criteria:

1) Are the results RIGHT? (do they look right at first blush)
2) Are the boundary conditions correct?
3) Can you check inverse relationships?
4) Can you cross-check results by other means?
5) Can you force error conditions to happen?
6) Are performance characteristics within bounds?

For examples of potentially applicable unit test frameworks, see:

Fortran Unit Test Framework
http://fortranwiki.org/fortran/show/Fortran+Unit+Test+Framework

Robinson said...

I don't agree with Chris or skanky. I don't see this essay as a discussion on "how to engineer software", it's more a discussion on how to present your science in an accessible manner.

As I said above, you don't even need to float your software out to the masses in order to get this right. They are quite capable of writing their own code. What they need are your conclusions, your raw data, your assumptions and your methods.

edaniel said...

Go here to get this code and let us know what you think.

skanky said...

My answer is not about getting code out to the masses (as code should be irrelevant to them, as you say), but about scientist A being able to re-run is programs to reproduce his results five years after he originally first calculated them. Even better, simply by running a single script (that does the source control sync, then the build, then the results generation).

That's an ideal, and would make anything else easier as it's a subset and helps to document what's needed for someone else to do it.

Chris R. Chapman said...

As long as there is custom code that is required to reproduce the results, it cannot be rationally separated from the outcomes - they are an integral part of the climate scientists "laboratory".

Unit tests aren't cumbersome and they fit skanky's model of having one-touch deployment and verification.

As we're seeing from other sources, the source data may have several layers of interpretation against them depending on the code used.

Ergo, tests that can set up reproduceable scenarios aren't dispensable to understanding the outcomes. They're indespensible.

steve said...

I think the tools skanky points to are important (and Software Carpentry is run by my colleague Greg, so naturally I'm a huge supporter).

But there's a lot in Bob's account of the research process that these tools don't help with, and quite often there's a poor fit for many of these tools/techniques and scientific workflows. I had a good discussion on this with my student, Jon, this morning, and he identified as a key issue the fact that most of the time you're not sure what research steps will work, you're exploring lots of dead ends, whereas most software engineering tools assume you are "engineering" - i.e. following a (moderately) rational process from specification to implementation.

See, the thing about all software engineering tools is that they come with a cost. You can code faster without them, but they pay off in the long term when it comes to modifying your code, sharing it with others, coordinating changes, etc. Many SE tools should in theory deliver this payoff, but in practice they don't, and a lot of time it depends on context. The scientific software development context is sufficiently different from commercial software practices that it would be wrong to have the same expectations for which tools will payoff, and which won't.

Greg's goals with the software carpentry course is to get scientists who build code up to speed with good software development tools and practices, to reduce a lot of the accidental complexity of scientific coding. But some of what Bob describes is not accidental complexity. The challenge is to separate the two, prescribe appropriate tools to fix the problems that can be fixed easily, and think hard about the ones that cannot.

Oh, and tool adoption itself comes with a big price in the initial hit on your productivity. So you've got to be pretty sure a tool is worth adopting before you commit to it. This is where software engineers need to learn more about what scientists do and need, and scientists have to learn more about what software engineers have to offer.

Gareth Rees said...

I'd like to add my voice to that of your other commenters. The problem of archiving software and data so that computations can be reliably replicated at a later date is one that software engineers have good solutions for (at least on small timescales like a decade or two: centuries, I think, are still beyond the state of the art).

In particular, there's a whole discipline of software configuration management, and pretty much all software engineering professionals use the techniques and tools of this discipline.

Just today, for example, I took out some code and data that was archived in 2005, and ported it to a new platform. There were no difficulties.

I guess the question is, what's the appropriate way for scientists to make use of the techniques and tools of software configuration management? How can software engineers provide the necessary help? Maybe there needs to be a new specialist position of "lab software engineer" (like a lab technician, but for software).

John Norris said...

If you archive all your data, code, and scripts, you are leaving at least the opportunity for replication. If your process is somewhat complex, and you don't archive all your data, code, and scripts, you instantly make it difficult for the replicator.

Looking at your example of later trying to replicate your own work, how much harder are you making it on yourself if you didn't archive all your original data, code, and scripts?

steve said...

Gareth: Not so fast with the simple prescriptions. An example. The Hadley Centre have a state of the art configuration management system, which we have documented in two papers:
http://dx.doi.org/10.1109/MCSE.2008.144
and
http://doi.ieeecomputersociety.org/10.1109/MCSE.2009.156

It means they can rebuild any model configuration at any time. But it doesn't guarantee that model configuration will compile and run on anything (the hardware changed! the operating system was upgraded! a new Linux patch was applied!) And as soon as you modify the code to get an old model to run on a modified platform, you break bit-level reproducibility, and you're back to many of the the issues Bob raises.

Greg reckons the typical half-life of a figure in a scientific paper is of the order of months. I.e. within a few months of publication, half the figures in published papers could no longer be reproduced exactly.

Bryan Lawrence said...

My day job is trying to help preserve the data part of the scientific record, and do so in a way that facilitates the doing of science ...

All of the issues Bob raises are real, but as someone noted, we can't throw up our hands in horror and give up either. Reproducibility isn't about getting the same numbers, it's about coming to the same conclusions. Being able to checkpoint progress by examining raw data, auxillary data, and algorithms is a key part of doing that.

I believe that the average scientist is pretty bad at using simple tools to help checkpoint their own progress, so that makes it doubly hard to reproduce others work.

If I had a magic wand, the first thing I'd do is make every scientist use a configuration management tool to manage their code, and some measure of their data provenance. I'd also put in place better systems for keeping (and deciding when to throw out) intermediate data. I'd make sure that whenever anything was published, the key numbers were published alongside them ...

Oale said...

Ah, thank you for this insight to your work and difficulties therein. I'm clearly not up-to-date on climate models and was assuming (foolishly enough) that your work was on the type of simple models I was somewhat familiar with during the 1990s. Please bear with my (somewhat silly?) questions and rants, which are going to come your way much less after this.

Professor Mandia said...

Robert,

Thank you for this excellent piece, but I am afraid it may fall on deaf ears.

I really do not think most of the skeptics out there really want to see the data nor do they wish to replicate the results. This is just a red herring.

I believe Gavin at RC states that GISS code and data have been available for many years and yet not one person from the outside has contributed comments.

All of these requests are really just to stall scientists from doing their work. Does the Data Quality Act ring a bell?

http://www.boston.com/news/globe/ideas/articles/2005/08/28/interrogations/?page=full

Gareth Rees said...

I didn't mean to suggest that software configuration management is at all "simple" (and I don't think I did), just that it's a fairly mature discipline and the problems described by Bob are routinely solved in practice.

There are different approaches to keeping up with constantly-changing tools and operating systems. One is to track all your dependencies, including OS and hardware. Another is virtualization. A third is to accept some risk that your software will be broken by configuration change but ameliorate that risk by careful engineering.

Again, I'm not suggesting that configuration management and reproducibility are simple problems, just that there are well-known techniques for solving them.

I'd be interested to read the papers you cited above. Do you have open access versions of them?

Penguindreams said...

Source version systems are already fairly common. They're a little new to the group I'm now working with, but we're implementing one (Subversion). As Steve suggests, though, they don't solve all our issues. Compilers and math libraries, for instance, still lie outside the system. There is also an overhead involved in using them, which means time spent doing something other than getting better answers from the programs -- and that time is always in short supply.

One realm that subversion doesn't manage well at all, perhaps there is versioning that does?, is data files. Plain text files, as we would use for source code and control scripts, are managed well. But the binary file that carries the land mask, or climatology, or ..., it doesn't handle at all well. Yet such files certainly do go through versions. At minimum we get better algorithms for where the land is, climatology files get updated every 10 years, the resolution changes, ....

I confess I engaged in some literary license. My own programs, I have been reusing with little difficulty for about 30 years now, and can still use the ones (that I didn't write in Pascal, seems a little hard these days to get a free Pascal compiler) that I wrote as an undergraduate. A few times they've even turned out to be useful. My real problems have usually been in dealing with other peoples' programs. Missing libraries (site-specific libraries) are a common problem.

A different vein of my thinking did get taken up in comments -- how much reproducibility do you want? Robinson is ok with a much lower degrees than some others here, and certainly many in particular parts of the blogosphere. The ideal is absolute, bitwise, reproducibility. But as I tried to illustrate, you can't have that, and to the extent you could, large chunks of it are not in the control of the scientists.

A step that helped reproducibility was supported by the IEEE -- their standard on floating point representation. That's some years ago now. But it did help when it came out. Perhaps something analogous can be done for system math libraries? This is a serious problem for us at work, where it looks like probably the libraries in the latest jump differ by more from the previous versions than anything we've seen in at least 20 years.

edaniel:
It's really best to give some real text to your links, or at least in the comments. The links are to the NASA GISS site. I'd already pulled down the models (both the EdGCM and the '1988' model). At lunch I took a go at both. Neither compiled on my system, but then neither was claimed to be able to. I may take this up in its own post. In the mean time, what are the results of your examination of the software? I'll note that the more modern one (per suggestions of many) is using a version control system (CVS).

Professor Mandia and Anonymous (Bob):
I'm not very concerned about the extremists.

Folks who would be pointing to this issue as their 'reason' to not '...re-order the world's economy based on this very uncertain "science"' should be honest -- and also not use cars, planes, power plants, banks, and computers. All of which have the same reproducibility problems in their design and management.

Similarly, those who complain of the lack of 'openness', while at the same time never bothering to download and work with/on the GISS models, or the NCAR CCSM, aren't really skeptics. Skeptics do real work. Deniers just whine about what was done (without looking seriously at it themselves). No matter what scientists do, deniers will whine.

On the other hand, I would like to see it be much easier for someone like me to pull down the GISS models and get them running on my system.

Judy:
Thank you. Your recent essays were, of course, another part of what prompted this note.

skanky said...

"Compilers and math libraries, for instance, still lie outside the system."

On our system they (or their equivalents) do get source controlled.

Binaries are an issue in source control as it's difficult to just store the differences. Our system (Perforce) handles them adequately enough, but it can be an issue if syncing over a slow connection, if there's a lot to download. There are ways round that though (scheduled syncs to reduce the JIT syncing during the working day, for example). It may not be the best for binaries though.

Incidentally, and this isn't a complaint but just for future reference for myself as I can't recall any part of it that may be problematical, which part of my last comment has caused it to be blocked?

gmcrews said...

Hi Robert,

Great post. Let me expand on something Gareth Rees mentioned. I have encountered a similar problem for nuclear safety related scientific software that is able to run on a desktop. (Not the stuff that must run on mainframes/clusters.) The software quality assurance plan required that strict configuration management be placed on safety applications and their operating environments. As you describe in your post, it is a practical impossibility to exactly specify and maintain multiple versions of a scientific application or its operating environment (operating system, libraries, configuration files, etc.). It was a dilemma. My solution (again, for desktop applications) was the use of a virtual machine (VM).

Let me briefly outline how it could work for the data coder program you talked about. You could first install a Type 2 hypervisor (for example, Sun's VirtualBox on your desktop host. Using the hypervisor's VM management program, create a VM (which is just a big file to the host computer) and install your data coder program along with its required operating system (which can be different from your desktop host) and necessary libraries, configuration files, etc. Then run the VM and make sure everything worked.

Then the only configuration you have to manage is the host's big VM file. You can change the host computer's operating environment or version of hypervisor and the the VM data coder will perform exactly as before.

This VM approach is actually quite easy to set up. In fact, when performing IV&V on nuclear safety software, I would even request that the scientists submit their programs already embedded in a VM. (Not that they would pay too much attention to the software guys! Bless their hearts!)

George

dhogaza said...

"Computer Science (I have honours here), we have formalisms for describing what a program does, such that it should produce the same result wherever it's implemented.

There may be rounding errors and other such minor issues between implementations..."

These are the kind of things Robert is talking about, of course. But different compilers will optimize code differently, too, and differing order-of-execution of floating point operations will often yield slightly different results.

In some cases extremely different results - with floating point hardware you can "prove" that 1 + 1E-200 = 1.

It's gotten a lot better with the adoption of the IEEE floating point standard, with its carefully thought out and bounded rounding specification.

One of the reasons FORTRAN is still commonly used for scientific programming is that there's a tradition in the implementation community of providing carefully written math libraries and the like. As the co-author of what was once a very popular Pascal compiler for the PDP-11, I can tell you that many C/Pascal etc implementation teams didn' t pay much attention to the finer details of the implementation of math libraries (I was fortunate, a friend of mine did ours and he also wrote the first math libraries for another now-defunct company called Floating Point Systems, it was his area of speciality).

Traditional CS courses of study - even honours ones - tend not to devote much time to nuts-and-bolts stuff such as all the subtle areas you can screw up while writing libraries that implement transcendentals for floating point representations, etc.

Specifying compiler version, math library, and underlying hardware is really about the best you can do. Also using minimal optimization - some compilers provide (or at least used to back when I cared a lot more than I do now) a list of optimizations to avoid if you want maximum faithfulness to your code in terms of the order of execution of expression terms, etc. Follow the recommendations ...

steve said...

Gareth,
My own versions of the papers are here:
http://www.cs.toronto.edu/~sme/papers/2008/CiSE-FCMpaper.pdf
http://www.cs.toronto.edu/~sme/papers/2008/Easterbrook-Johns-2008.pdf

They differ slightly from the published versions but mainly because they haven't edited for house style and length (which makes them better in my mind)

Gareth Rees said...

Thanks, steve.

Horatio Algeranon said...

Nobel Prize winning physicist Richard Feynman used to make sure he could derive the same result in several different ways before he would have confidence that it was correct.

Something most non-scientists fail to appreciate is that "reproducibility of results" in science is not about precisely "repeating" all the steps that were taken by a particular scientist.

It's about getting the same result (within some margin of error), possibly with different methods. In fact, it's actually preferable to use an entirely different method to get the result.

Finding that the two results are the same using different methods increases one's confidence in the correctness of the answer.

On the other hand, getting a different answer leads one back to the drawing board to see just why the answers differ.

Sometimes that leads to yet a third (fourth, fifth, etc) method to discover which of the first two is correct (if either).

Sometimes this also leads to discovery of completely new phenomena that had previously been overlooked.

All this has obvious relevance to claims by some to be "auditing" scientific results.

"Auditing" scientific results is (or at least should be) approached quite differently from the run of the mill auditing that accountants do.

Accountants have standard procedures that they follow to make sure everyone is on the same page.

A scientific 'audit' (if you can even call it that) is best done by using a totally independent method and seeing if the results nonetheless come out the same.

Anonymous said...

"
A scientific 'audit' (if you can even call it that) is best done by using a totally independent method and seeing if the results nonetheless come out the same.
"

Ability to duplicate results by different researchers doing identical things in different laboratories has been a mainstay (cold fusion comes to mind).

I see people object to the "audit" as being worthless, but these same people seem to gung ho about science. I guess it's an intellectual version of wave particle duality.

Anonymous said...

FWIW, back when version control and software development tools were spotty, we'd use "typescript" (unix, available on some systems). This would create a text file of everything that came across the screen. It gave about as complete a record of algorithm development and exploratory data analysis as you can get. At the session close, the file was piped through some scripts for formatting purposes, and comments added with a text editor as needed.

Not very elegant, but you could ALWAYS see exactly what was done and what the result was, even if a lot of things were being tried on the fly.

Penguindreams said...

skanky: I don't see any comment of yours being blocked. Sometimes I'm slow about approving comments, but that's usually the extent of it.

Steve: Thanks for the links to your papers. I was going to ask too.

gmcrews: Interesting. I was wondering about a virtual machine approach, myself.

The drawback is the same one against not using the less common optimization flags -- climate models basically by definition need all the computational speed that can be wrung from a system. VM systems, and minimal compiler flags, don't give you that. Or at least not the ones I'm familiar with from some years back.

anon-audit:
In the early days of the 'audits' against Michael Mann, the self-appointed auditors took the business audit model and language. Maybe they've changed their language since then, but their approach still seems to be using business as their ideal, rather than science.

One thing I see as an error of description is shown by your example. Your example is fine for a lab science. Climate is seldom a lab science. In trying to reproduce the cold fusion results, for instance, the other scientists were working much more in the vein Horatio Algernon was describing. Namely, take the description of what was done, and -- using their own equipment -- set about reproducing the results. But the reproducers did not seize the original authors' equipment. They used their own test tubes, electrodes, etc..

For computational science, that's what Horatio was describing. Use your own computer, your own programming, and a copy of the inputs. The inputs are, themselves, generally available (rather like getting a bottle of a reagent from a chemistry supply house). I would like for them to be even more readily available, and even easier to get hold of. But it's not that hard.

If you're doing a business audit, however, rather different standards apply. As well they should; a friend did do business auditing. Business isn't science, however.

EliRabett said...

Business auditing doesn't work that way either. You start by sampling the records (constructing the plan of how to sample the records is a major part of the art), you then look for inconsistencies as well as errors. If you find problems you ask for more data, but you now have an idea of the systematic problems to look for. This is both effective and limits the burden on the business being examined. Any auditor who demanded all of the data would be fired summarily.

A good audit provides reasonable certainty that the records are in good shape without tying up the firm forever

McIntyre's "audit" demands all of the records at the start. The purpose is to burden the scientist. He then yells and screams about every little dot and jot almost 99% of which is due to his not understanding what was being done. At the end maybe one or two points remain. October's Briffa Fest is an excellent example

Horatio Algeranon said...

Ability to duplicate results by different researchers doing identical things in different laboratories has been a mainstay (cold fusion comes to mind).

Actually, when it comes to reproducing lab results, cold fusion [sic] may be the best example of a case where it is not a good idea to precisely "repeat" the original procedure.

Pons and Fleischman were remiss in properly monitoring for nuclear byproducts of fusion, which would have told them that they had in all likelihood not observed fusion in their test tubes.

The most important refutation of their fusion claim came precisely from researchers who did take care to properly monitor for such by-products -- and did not find what would be expected above background.

skanky said...

"skanky: I don't see any comment of yours being blocked. Sometimes I'm slow about approving comments, but that's usually the extent of it."

Ah, okay it was probably an error on my part then (I may have missed a word verification time out). Apologies for that.

It was a long post and I forget all I wrote, but I did use Steve Easterbrook's mention of the UKMO to remind myself of the MIDAS project that replaced the CDB (flat file DB that made use of MF architecture) at the UKMO. It's their obs data and was moved to CA-IDMS (more recently it's been moved to Oracle, apparently).

The relevance was in getting data in very legacy, very propitiatory format to a relational database - thus more portable, shareable etc.

It was a several year project, for a team of about 6 dedicated (plus various support staff), and quite a lot of hardware. Few science organisations could afford that.

Some of the dataset is available on BADC, I think.

There might have been something in there abut binary file support too, and how difficult it is plus that some control systems support them at least adequately.

This is now probably as long as the original. ;)