Frequent Errors In Scientific Software May Undermine Many Published Results

from the it's-a-bug-not-a-feature dept

It’s a commonplace that software permeates modern society. But it’s less appreciated that increasingly it permeates many fields of science too. The move from traditional, analog instruments, to digital ones that run software, brings with it a new kind of issue. Although analog instruments can be — and usually are ? inaccurate to some degree, they don’t have bugs in the same way as digital ones do. Bugs are much more complex and variable in their effects, and can be much harder to spot. A study in the F1000 Research journal by David A. W. Soergel, published as open access using open peer review, tries to estimate just how much of an issue that might be. He points out that software bugs are really quite common, especially for hand-crafted scientific software:

It has been estimated that the industry average rate of programming errors is “about 15-50 errors per 1000 lines of delivered code”. That estimate describes the work of professional software engineers — not of the graduate students who write most scientific data analysis programs, usually without the benefit of training in software engineering and testing. The recent increase in attention to such training is a welcome and essential development. Nonetheless, even the most careful software engineering practices in industry rarely achieve an error rate better than 1 per 1000 lines. Since software programs commonly have many thousands of lines of code (Table 1), it follows that many defects remain in delivered code — even after all testing and debugging is complete.

To take account of the fact that even when there are bugs in code, they may not affect the result meaningfully, and that there’s also the chance that a scientist might spot them before they get published, Soergel uses the following formula to estimate the scale of the problem:

Number of errors per program execution =
total lines of code (LOC)
* proportion executed
* probability of error per line
* probability that the error meaningfully affects the result
* probability that an erroneous result appears plausible to the scientist.

He then considers some different cases. For what he calls a “typical medium-scale bioinformatics analysis”:

we expect that two errors changed the output of this program run, so the probability of a wrong output is effectively 100%. All bets are off regarding scientific conclusions drawn from such an analysis.

Things are better for what he calls a “small focused analysis, rigorously executed”: here the probability of a wrong output is 5%. Soergel freely admits:

The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values.

But he rightly goes on to point out:

Nonetheless it is sobering that some plausible values can produce high total error rates, and that even conservative values suggest that an appreciable proportion of results may be erroneous due to software defects — above and beyond those that are erroneous for more widely appreciated reasons.

That’s an important point, and is likely to become even more relevant as increasingly complex code starts to turn up in scientific apparatus, and researchers routinely write even more programs. At the very least, Soergel’s results suggest that more research needs to be done to explore the issue of erroneous results caused by bugs in scientific software — although it might be a good idea not to use computers for this particular work….

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Frequent Errors In Scientific Software May Undermine Many Published Results”

Subscribe: RSS Leave a comment
46 Comments
Mark Wing (user link) says:

It’s not uncommon to find bugs in financial and insurance software 10-15 years into a production system that is online every day with people paying close attention to what it’s doing.

The people who own these systems love money more than scientists love science, and scientists are known to be terrible about admitting they are wrong, so I can only imagine how buggy scientific software is.

Harald K (profile) says:

Re: Re:

Sure. But going from that to “you can’t trust your your bank receipt” is a stretch. Most of the bugs are going to be in code that doesn’t really matter for the outcomes you care about, and even then, most of the time it’s going to be really obvious (if your bank balance suddenly is 10^20, or temperature ends up as NaN in your weather model).

The real heavy lifting in scientific code goes on in numerical libraries. Those are written not just by specialist programmers, but programmers who are specialists on that specific topic. They are heavily scrutinized.

Anonymous Coward says:

Re: Re: Re:

They are heavily scrutinized.

Sure. But by whom?

Everyone who programs knows that it’s really hard to find your own errors: if it was easy, they’d never escape your gaze. Asking your colleague(s) to check your code is better because they’re not you…but they’re NOT completely independent, even if they make a good-faith effort to be so.

So if review stops there and never goes any further, that is, it never extends to people who are completely independent, then “heavily scrutinized” comes down to “me and the people I work with”.

And that’s not very thorough.

Mason Wheeler (profile) says:

Re: Re: Re: Re:

Many (though not all!) of the most widely-used scientific and numerical computing libraries are open-source packages that anyone can look into. This is known as Linus’s law: “with enough eyes, all bugs are shallow [ie. easy to detect]”, and it explains the high quality of popular open-source software.

Anonymous Coward says:

Re: Re: Re:4 Re:

Ah! Bystander Effect. I thought there was something nagging at the back of my mind. Anyone ever figure out the weirdness w/ OpenSSL’s lack of scrutiny? I personally wouldn’t ever bother looking at the code, ’cause to me crypto and security just feel very specialized and, frankly, intimidating. Maybe I’m not as alone in that feeling as I thought.

Lawrence D’Oliveiro says:

Re: Re: Re:4 OpenSSL was an extreme outlier though, and hardly typical.

OpenSSL seemed to suffer from an insistence on carrying around a whole lot of legacy baggage, combined with a shortage of funding to maintain the mess.

Which is why LibreSSL was forked off.

Certainly other popular open-source projects get a lot of scrutiny. Look at what Coverity does, for example.

Nathaniel Brown (profile) says:

All bugs are not considered equal.

When people estimate that there are 10 to 50 errors per 1000 lines of code rarely if ever will those bugs change the outcome of normal data being used in a normal way.

These bugs are easy to find, easy to reproduce and easy to fix. They are also easy to test for and are considered serious bugs. You generally don’t ship code with serious bugs.

The kinds of bugs that generally exist for stable programs are things like the following REAL bug that currently exists in EVERY version of Microsoft Access:

Create an Access Database in Access (Any version). Open the database in a Chinese Version of Microsoft Access and create a NEW Form then create a Macro or VBA code and save the database. If you now open the database in Microsoft Access in an English version of Access it can open the form but not run the form. You can change the name of the form from a Chinese name to an English name which will fix the form BUT you cannot change the code that Access generated for you which is in Chinese that does not work on Windows without Chinese installed. You have to create a new form. Copy everything over and delete the old form.

This is a real bug. It is really annoying but data is never corrupted. These are the kinds of low level bugs that generally are not fixed. It involves using multiple computers with different languages and only happens in a very specific situation. It can also be worked around easily. To fix the issues is also difficult and likely to create new bugs.

Saying that there is a 10% chance that a bug will “meaningfully affects the result” is crazy because you with a little bit of effort you can figure out a more accurate number. Look at Linux, Mozilla or smaller project and then classify the bugs and find out how many actually change or corrupt the data as opposed to crashing the program.

Richard (profile) says:

Re: All bugs are not considered equal.

When people estimate that there are 10 to 50 errors per 1000 lines of code rarely if ever will those bugs change the outcome of normal data being used in a normal way.

I’m not so sure. Some years ago I heard a presentation by an expert of bugs who recounted the following story:

He took the mission critical data analysis software of 3 or 4 different major oil/oil exploration companies and configued each so that they would run exactly the same set of algorithms. He then fed in the same input data to each. The agreement was only to one decimal place.

Also remember the design error in the 8087 that made the floating point evaluation stack unusable. I was still getting program crashes caused by this bug in the mid 90’s (via more than one different compiler).

JoeCool (profile) says:

Re: Re: Re: All bugs are not considered equal.

The only possible cause of the lack of agreement was software errors.

Not at all! There are plenty of other ways to get different results without a single error. It depends on how the programmer interprets different algorithms. Maybe one is approximating part of the problem as linear when it’s not because linear problems are easier to solve. Maybe another realizes linear isn’t very accurate in the case, so he does piecewise linear for better results. Maybe yet another uses a quadratic that’s even better over part of the range, but much worse outside the range.

You don’t need bugs to have problems with accuracy. It’s one of the things that my engineering classes at the Uni covered. Certain problems cannot be evaluated directly, so you make approximations and then justify those over the range of inputs you expect to receive. Specially designed computers and programs are often used in physics to solve certain problems DIRECTLY, and they can take over a year to solve one problem. That’s not acceptable for many folk, so they MUST approximate the solution.

Anonymous Coward says:

Re: Re: Re:2 All bugs are not considered equal.

Did you also cover the inherent problems that arise when doing numerical calculations on computers? Size of operands and order of operations in all algorithms and approximations can end up giving completely bogus results. Particularly with 0 bugs in the code.

Very careful analysis of the information flow is also required beyond just looking for bugs.

JoeCool (profile) says:

Re: Re: Re:3 All bugs are not considered equal.

Yep. Even the difference between floats and doubles can totally change an output. You have to pay particular attention to integer/fixed point math in graphics rendering and compression. Compression specs usually have very specific tests and requirements for integer/fixed point implementations to be sure that the output is accurate to less than a bit.

Anonymous Coward says:

Re: Re:

But are the weirdnesses there because of errors in the simulation software running the models, or the fact that the models themselves have plenty of ‘iffy’ bits?

To wander out into digression-land, I had a professor who had once worked at Los Alamos National Labs. He was put on a team that was working on climate models. Funny thing, when testing the models he couldn’t help noticing that the climate behaved very much like a nuclear explosion.

Anonymous Coward says:

Re: Re: Re:

Who cares whether or not he is, his point needs to be acknowledged. Over 30 years of coding and experience tells me that user generated code (that is user is specialist non-programmer) will have many bugs lurking and not so lurking in the code base. Code they expect to executed all the time is never executed, code that should be executed on an irregular basis is executed all the time and everything in between.

The interesting point for me is how much of the modelling being used in all scientific field has the potential (if not the actuality) of not working as it is believed to be working. I have seen too many software systems that have appeared to give reasonable results (and people have made major decisions on those results) and yet the models have been flawed.

Don’t forget that belief is a strong motivator in accepting the results produced. If you believe the results are correct and they sort of match your expectation then you will believe the software is doing its job.

This is why extensive regression tests are important. This is why extensive review by expert non-involved parties is important. This is why extensive analysis of data and data flows using alternative means is important. This is why extensive analysis of algorithms used and testing of those algorithms is important.

Many of the programs used in the scientific community (particularly the models built up by scientific teams) are piecemeal development. Accretion of new facilities are a fact of life. As such, these specialists (who often are quite over-optimistic of their programming abilities) haven’t a clue that their piece of software is a piece of junk.

JoeCool (profile) says:

Re: Not just software

It was the Pentium, and it had a problem with a certain range of numbers in certain operations. The “fix” was to look for the problem numbers first, then do a different operation that wasn’t erroneous. For example, if xy would give an error because x was one of those problem numbers, the code might do 2x*y/2 to avoid the problem. Needless to say, that may those processors slower since they had to check for “bad” numbers. They did fix the problem, but the damage had been done.

Anonymous Coward says:

Detectability of software defects

We’re all now painfully aware that many eyeballs don’t necessarily make deep bugs shallow — but it’s worth noting that at least all those eyeballs have the opportunity to conduct independent peer review of anything/everything whose source code is published and freely available.

Contrast with closed-source software, where no such opportunity exists. It’s not an option. Yet, as we’ve seen on those few occasions where such source code has been leaked, the defect rates are often much higher than for open-source code. There’s absolutely no reason to believe for even a moment that the programmers at Microsoft and Oracle and Adobe and elsewhere are any better at what they do than the programmers working on Mozilla or Wireshark or Apache HTTPD.

So the author of the referenced paper is right, but arguably doesn’t go far enough: we don’t just have to worry about the code being written by random graduate students, we need to worry about the code being USED by random graduate students. If it’s never been independently peer-reviewed, then it can’t be trusted.

The irony of this is that the entire scientific research process is founded on the concept of peer review. Yet researchers will uncritically use a piece of software that’s essentially an unreviewed black box and trust the results that it gives them.

Mason Wheeler (profile) says:

Re: Detectability of software defects

We’re all now painfully aware that many eyeballs don’t necessarily make deep bugs shallow

I assume you’re referring to Heartbleed? That could actually be written up as a textbook failure of the open-source development process: in the OpenSSL project, the many eyeballs simply weren’t there. People who looked into it found that very few people besides the authors were actually doing anything to review the code before news of Heartbleed became public.

Anonymous Coward says:

Re: Re: Detectability of software defects

Not just Heartbleed. We’ve seen enough examples of open-source code that’s heavily-used and in some cases heavily-hacked-on…but still had glaring problems that escaped everyone. I think we now know that eyeballs are good, independent eyeballs are better, clueful independent eyeballs better yet, many clueful independent eyeballs still better — but none of these are a panacea.

In other words, independent peer review is necessary, but not sufficient. “Sufficient” is TBD, but it probably looks like “serious in-depth audit”, e.g., what Truecrypt went through recently. I think that’s probably the best that we can do today. I’m hopeful that we’ll evolve better methods of coding and review, and that we’ll create better automated analysis tools — and both of those are happening, so maybe, just maybe, we might be asymptotically approaching “quality code”.

On a good day. 😉

Lawrence D’Oliveiro says:

Re: Re: Re: but still had glaring problems that escaped everyone

As soon as the problems were spotted, they were very quickly fixed.

As opposed to proprietary software, where known problems can go unfixed for years.

I had this peculiar conversation with someone once, who refused to use a well-known piece of actively-developed open-source software because it was “unsupported”, while in the next breath he complained about how his preferred proprietary alternative would keep crashing all the time.

I wondered what his idea of “support” was…

Richard (profile) says:

NAAA

although it might be a good idea not to use computers for this particular work….

You think that manual methods are less suspect!

With a computer program at least the same result (correct or erroneous) happens every time the line is executed. With a manual system you get to put in a whole new error every time! I’d say that’s a whole lot worse!

Anonymous Coward says:

Re: NAAA

Your point is valid, but:

With a computer program at least the same result (correct or erroneous) happens every time the line is executed.

That’s not necessarily true. Sometimes flaky code winds up giving different results every time; sometimes it gives a right result most of the time and a wrong one some of the time; sometimes it…well, you get the point.

Repeatable and obvious errors are the easy ones to find and fix. Semi-random weirdness can be difficult to even notice, let alone diagnose and fix.

Bob Webster (profile) says:

That's not a study.

That is not a study. It is an opinion article. It’s stupid to believe that there are that many significant errors in production code (with the possible exception of phone apps).

The defects referred to in David Soergel’s reference (“Code Complete”, written more than 10 years ago) include things such as misaligned output, insufficient error trapping, invalid data input filters, user interface problems, and many other errors that do not cause “wrong answers”.

Furthermore, debugging is a major process in software development. There are many, many errors that appear in any non-trivial software project as it is being written and tested. Many of these prevent the software from running in the first place. With testing, these are largely eliminated, particularly those that can significantly affect results.

For example, in a scientific application with a limited number of users, it may be completely acceptable for the application to crash on invalid input. It may take more programming time than it’s worth to add elegant error handling. This is a “software defect”, yet it has zero effect on the results. Many of the defects referred to in “Code Complete” are of this nature.

In addition, the statics quoted above (and all over the internet), are mere guesswork. The article even states “The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values,” and this in the “rigorous analysis”!

In my opinion, this does not merit appearance in Tech Dirt, and certainly lowers the average quality of this site. It’s a typical scare-hype article, all too common today.

Anonymous Coward says:

Re: That's not a study.

It’s stupid to believe that there are that many significant errors in production code (with the possible exception of phone apps).

It’s actually more stupid to believe that there aren’t many significant errors in production code. As a someone who has been involved in software review, testing, design, development and implementation, I know that your statement misses the mark completely.

include things such as misaligned output, insufficient error trapping, invalid data input filters, user interface problems, and many other errors that do not cause “wrong answers”.

Let’s take these one by one shall we.

Misaligned output – I have seen examples of this where this is then used as input to another program and errors are then generated because a wrong value is read in at the next stage. This is a problem for pipes (you know that feature used in unix based machines).

Insufficient error trapping – if an error is not caught properly (or not caught at all), the program can continue as if nothing has happened. For example, if values to be used have a range and there is no out of range error detection, then an out of range value can change the results in significant ways that still appear reasonable but are wrong.

Invalid data input filters – how often do we see this problem arising. SQL injection anyone, incorrect string processing, conversion of strings to numbers because floating point conversions initiated and so forth.

User interface problems – how many errors have been generated because the user interface doesn’t work correctly, allowing wrong data to be entered, not giving feedback that a value is out of range, indicating a process has been completed when it hasn’t or indicating that a process hasn’t completed when it has and the action being initiated for a second time which leads to processing errors.

There are many, many errors that appear in any non-trivial software project as it is being written and tested. Many of these prevent the software from running in the first place. With testing, these are largely eliminated, particularly those that can significantly affect results.

True, however, one of the most common kinds of errors is the incorrect tests on any kind of conditional statement. I have seen too many programs written when the incorrect conditional is executed and the bulk of the code (which should have been executed) is not. Getting conditions right is actually quite difficult, particularly is highly complex situations. Testing can miss much of this. It is quite tedious to create a complete conditional testing map and as a consequence quite easy to miss particular branches. This is why there is a rise in having conditional matching (often called match in the relevant programming languages) being analysed by the compiler and reporting such errors. Such is not available in old languages such as c and c++.

For example, in a scientific application with a limited number of users, it may be completely acceptable for the application to crash on invalid input. It may take more programming time than it’s worth to add elegant error handling. This is a “software defect”, yet it has zero effect on the results. Many of the defects referred to in “Code Complete” are of this nature.

If the results are needed for publication and/or further analysis, then it behooves the people in question to ensure that their program doesn’t vomit on inelegant errors. It can significantly effect the reliability of those results. How can we trust them if the system cannot handle errors properly.

In addition, the statics quoted above (and all over the internet), are mere guesswork. The article even states “The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values,” and this in the “rigorous analysis”!

With regards to software systems, the less reliable you consider it, then the more likely you will challenge the results and check the software for correctness. Much as you might be happy to just believe, I would prefer more work done on checking and testing. Too many software systems today are being used in situations that significantly effect the lives of people, whether these be banking and financial systems, medical and medical research, traffic and traffic management, etc. Software systems are complex and need to be be scrutinised more closely than they are currently.

In my opinion, this does not merit appearance in Tech Dirt, and certainly lowers the average quality of this site. It’s a typical scare-hype article, all too common today.

You are entitled to your opinion. However, I believe the opposite for this article. It is certainly not scare-hype, but a reflection of the reality of the systems that are used to create the scare-hype used by various. If we actually did more critical analysis of our software system, we would be more likely to have less scare-hype arising. we could be more confident of our software systems in the areas of reliability and security.

Mark Wing (user link) says:

Some languages also invite more errors. It’s been said that C++ is a loaded gun pointed to your head by default. If you do something bad with a pointer, the software has undefined behavior, meaning it might work fine, and then stop, and then go back to working fine. 10 people looking at those millions of lines of code may not see it. Run it through the debugger a dozen times and it looks fine. A shitty programmer can do lasting damage on big systems in languages like C++.

nasch (profile) says:

Hand crafted

He points out that software bugs are really quite common, especially for hand-crafted scientific software:

Is there some other kind of software? Software produced on an assembly line maybe? Or are you drawing a distinction between custom made software and “shrink-wrapped” (a less relevant term now that almost everything is distributed electronically but I’m not sure if there’s a new term to replace it)?

Anonymous Coward says:

Re: Hand crafted

Yes there is. It is called using compilers, etc.

But seriously, haven’t you written or even used systems that take specifications or configurations that generate code. You don’t hand craft the code, you let the computer generate it for you. There are a myriad of software systems that require only a specification to be designed and they will automate the software generation for you.

Few people write in binary these days as it is so much harder to get it right. Even LISP and its ilk use macros to generate code.

But I do see your point. But the amount of hand crafting can be quite minimal for some environments. That is one reason for the growing use of libraries of software, so the various parts don’t have to hand-crafted.

nasch (profile) says:

Re: Re: Hand crafted

There are a myriad of software systems that require only a specification to be designed and they will automate the software generation for you.

Yeah that’s true.


Few people write in binary these days as it is so much harder to get it right.

You don’t have to write in binary or even assembler for it to be hand crafted.

That is one reason for the growing use of libraries of software, so the various parts don’t have to hand-crafted.

Not exactly. The libraries are mostly hand-crafted. We use them so that functionality doesn’t have to be created over and over again.

Anonymous Coward says:

Re: Re: Re: Hand crafted

You raised the point of hand-crafted. Hand-crafted generally means you have done most if not all of the work using relatively simple tools. It then becomes a matter of opinion as to whether you consider using libraries to provide functionality as being hand-crafted or you write the code yourself to provide that specific functionality.

The comparison, I suppose can be the likened to the situation of hand crafting a Damascus steel blade. Is it hand-crafted if you use a power hammer to beat the steal or do you need to actually use a blacksmith’s anvil and hammer to shape the material? Different people will have different views on that and we can not really say either is right or wrong. They both have some merit.

nasch (profile) says:

Re: Re: Re:2 Hand crafted

It then becomes a matter of opinion as to whether you consider using libraries to provide functionality as being hand-crafted or you write the code yourself to provide that specific functionality.

I don’t consider “hand-crafted” to mean I made it, it just means someone made it. The library was hand-crafted, just not by me.

scorpion136 (profile) says:

No-code-generated bug

The most-executed bit of code in a mainframe operating system had a hidden bug that was causing processes to get “lost”, i.e. they were not an any dispatchable queue. They were effectively zombie processes which would never run again, could not be communicated with and would never execute another instruction. I helped find the bug.

We were profiling the code to see which instructions were using the most time, so we could further optimize the routine. (It was already carefully handcrafted in assembler, aligned on cache lines, etc. to be as fast as possible.)

In the bar graph of time used per instruction, there was one instruction which never got executed. The strange part – it was in a section of code with no branches. An impossibility – linear execution of a series of instructions where one of them is never found executing when the ones before and after it are found executing millions of times.

The bug turned out to be a comment in the previous line of code. The comment extended one column too far, into the “continuation” column. This made the comment continue onto the next line, which contained the instruction that never got executed. Of course, even though the instruction was in the listing, it was not in the generated binary shown to the left of each line in the assembler output. In other words it was in the source but never assembled. Nobody looked over there, we were reviewing the code, not the binary instruction stream generated from the code.

That one missing instruction caused all the lost-process problems.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...