One of the first reported times that people died
as a direct result of software errors.
The Therac-25 was a radiation therapy machine made by Atomic Energy Canada Limited (later called Theratronics International Limited). The Therac-25 was a follow-on product to products called the Therac-6 and Therac-20, which were developed in conjunction with the French company GGR. Both the Therac-6 and Therac-20 had optional computer controls, but were primarily designed to be operated manually, and as such had numerous safety interlocks. The Therac-25 was designed to be entirely computer controlled. As such, it was decided not to include the same hardware safety interlocks, instead relying on the computer for all of this functionality.
Between June 1985 and January 1987, the Therac-25 was involved in 6 known accidents, several of them fatal, and even the non-fatal ones having gruesome side effects, like skin falling off, and shoulders and hips becoming completely immobilized.
[I saw a photo of one of the victims who died from overexposure. On their back was a perfect circle of pure black skin (the patient was very light skinned otherwise). The mind boggles at the dosage that would have been required to give that effect.]
Someone want to summarize the major defects in the Therac-25 software?
The fundamental errors were caused by RaceCondition
s between various concurrent activities within the system. Basic techniques of concurrent programming were either not known or ignored. This was exacerbated by project management that did not put sufficient emphasis on analysis of the design or testing of the implementation.
Perhaps an even more fundamental error was the reliance on software interlocks (aka if statements) for what was previously done with hardware interlocks...the race conditions couldn't have killed without this extra level of trust in the software. Indeed, one of the bugs was found in an earlier version of the Therac, but didn't cause any damage because a hardware interlock caught it. -- AdamBerger
[Additionally, the removal of hardware interlocks violates all recognized practices of electronics design. In any kind of control system you want sensors to tell the controller when an axis is near its limit. There also needs to be a switch or other cutout device that physically prevents the axis from crashing at the limit. For instance, a mechanical arm driven by a DC motor might have a switch with a diode across it at the end of travel. When the arm nears end of travel the controller slows down the arm and stops it before it hits the mechanical stop. If the controller fails the switch cuts off the drive current, but the diode allows the drive to be reversed and drive the arm away from its limit.
My first medical device gig was with H.G. Fischer (Fischer X-Ray). This was at the time they were developing their first computer-controlled X-ray system. The beta machine was at the University of Chicago Medical Center. The doctors wanted clean images despite the effects imaging would have on the patients -- who were mainly indigents, homeless, and other poor folk. The Beta machine had limits calculated for certain exposure values, and defaults of "none" once those bounds were exceeded. We couldn't believe it when the UCMC doctors were complaining that the machine was croaking at the high end of their studies until
we saw what kind of exposure values they were using. There were a few jokes circulating around Fischer about the x-ray imager/metal cutter we had provided them.
All mechanisms have limits. All mechanisms need those limits recognized by both the hardware and software. If a mechanism's limits aren't easily transposed to software then the software needs safely low limits set into it before it is ever even tested.]
For details, see:
Some of the hardware flaws:
- The dose measuring hardware could be overloaded. When overloaded, it produced a low dose reading, not a high dose reading.
- The exact position of a turntable mattered. This turntable did not have detents restricting it to the three or so valid locations. The sensing for the turntable location was based on three switch locations, not on a potentiometer reading.
- There was no dead-man switch to make sure that the patient was still in the treatment position.
- The controls were intended to be operated from a separate room, with no requirement that the operator could hear complaints by the patient.
Some of the software bugs:
- An error flag was implemented by incrementing a variable, instead of setting the variable to a known error condition. The variable happened to be an 8-bit integer, and could roll over to the "OK" value of 0.
- The system had a lag of 8 seconds while various magnets were adjusted. The user interface was not locked out during this time -- even though changes made during this time would be silently ignored.
- The error numbers were never intended to be read by the operators, even though errors were frequent.
- No documentation was available to the users about the error numbers.
- Even though different modes had electron beam power differences of order 1000x, the software assumed that a problem dose could be safely readministered up to 5 times.
- The system did not have any logging. (It did have a manual print feature.)
- The high and low bytes of a two-byte memory location were used for completely different purposes by different subsystems. This meant that there were unnecessary race conditions involving this memory location.
Some of the documentation problems:
- No documentation was available to the users about the error numbers.
- The investigation summary does not clearly distinguish between particle energy (e.g. 25 MeV), instantaneous electron beam power, average electron beam power, output beam power, and output beam power density. The investigation summary does not say whether the system documentation discussed these issues clearly.
- Presumably, the system documentation did not say that the system was theoretically capably of delivering 1000 times the intended dose.
- The system procedures did not say to take the system off-line completely pending an accident investigation if a patient had symptoms consistent with a large overdose.
Some of the process problems:
- The manufacturer assumed that software problems were nearly impossible.
- The software was assumed to be correct, even though it frequently reported errors.
- The frequently reported errors were considered a measure of the system's safety, not a sign that something was wrong with the system.
- The manufacturer did not proactively perform a root cause analysis of the frequent errors.
- The manufacturer did not have a software test plan.
- The manufacturer did not have any automated software tests.
- The system did not have any logging.
- Pointed questions after the first incident by one eleventh of the customer base did not trigger any internal investigation.
- The first lawsuit did not trigger any internal investigation.
- The lawsuits were settled out-of-court, and the results of the lawsuits were not disclosed.
Perhaps related to CategoryQuality