The inquiry board's report on the failure of the ArianeFive
's maiden flight is available from:
It's a most informative read.
- The maiden flight of the ArianeFive launcher, on 4 June 1996, ended in the erratic maneuvering and, 40-odd seconds after lift-off, subsequent automatic self-destruction of the craft.
- The problem was in the software.
- There are two main computers on board ArianeFive, each of which has a hot backup; the inertial reference system (SRI), which integrates data from inertial sensors to keep track of where the craft is, and the on-board computer (OBC), which, using the SRI's reports, controls the engines to steer the craft. The computers are connected by a data bus.
- There was an instrument on board measuring horizontal acceleration, a value used during the pre-launch calibration process, which reported values as 64-bit floats which were subsequently converted in the SRI software to 16-bit signed integers.
- During the flight, the value reported became too large to be encoded in a short (a FixedQuantityOverflowBug), resulting in an exception being thrown; the SRI wasn't programmed to catch the exception, and it crashed. Because this was a flaw in the software, the hot backup crashed at the same time as the primary. The failed SRI started writing diagnostic output onto the data bus (as opposed to reporting errors OutOfBand). The OBC, however, continued to read the SRI's output as if it were navigational data; consequently, it made essentially random maneuvers, leading to the destruction of the craft.
It's not clear where the point of failure was. The horizontal accelerometer wasn't needed after take-off; it could have been shut down. The conversion from doubles to shorts is one that usually generates a checked exception (the software was written in the AdaLanguage
), but the programmer had decided that the exception couldn't happen, and so had suppressed the check; he could have handled it properly. The unexpected exception crashed the SRI; it could have had a recovery strategy, like restarting if something 'impossible' happened. The hot backup crashed as well, because it was an exact duplicate; the backups could have been running different software (?). The SRI-OBC communication channel (bus + protocol + software at each end) allowed the OBC to become confused; it could have signalled that the SRI's output was diagnostic rather than navigational data. The OBC treated the bizarre SRI data as gospel; it could have sanity-checked it. Any one of those 'could's (especially the earlier ones) could have saved the mission, Ariane
Space's face, and the 3 billion franc Cluster mission it was carrying.
Incidentally, the software was reused from ArianeFour?
has a different take-off trajectory, with less horizontal acceleration, so even though the potential existed for this bug to manifest itself, the conditions were never right. The correct operation of the software in ArianeFour?
seems to have led the ArianeFive
engineers to think that they didn't need to test the software as heavily as if it had been new.
The really interesting questions are "What could they have done differently to prevent this happening?", "What other totally unforeseen problems are there?", "How can they prevent those
?" and "Am I in any better a position than them?". Perhaps the best quote in the report is "The Board is in favour of the [...] view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct."!
Apparently, the double-to-short conversion wasn't checked because it was thought to be too computationally expensive; the SRI had a tight CPU budget. However, there is no indication that this decision was made on the basis of profiling - it looks like PrematureOptimization.
I believe the conversion had been proven never to overflow. However, the proof was valid only for Ariane IV. Ariane V was more powerful and had a flatter trajectory - the proof was no longer valid.
There was no proof:
n) During design of the software of the inertial reference system used for Ariane 4 and Ariane 5, a decision was taken that it was not necessary to protect the inertial system computer from being made inoperative by an excessive value of the variable related to the horizontal velocity, a protection which was provided for several other variables of the alignment software. When taking this design decision, it was not analysed or fully understood which values this particular variable might assume when the alignment software was allowed to operate after lift-off.
o) In Ariane 4 flights using the same type of inertial reference system there has been no such failure because the trajectory during the first 40 seconds of flight is such that the particular variable related to horizontal velocity cannot reach, with an adequate operational margin, a value beyond the limit present in the software.
The analysis that showed that the horizontal velocity could not have overflowed in Ariane IV was done after the Ariane V failure, not during the Ariane IV development.
See Also: FixedQuantityOverflowBug