10-03-2023 | By Robin Mitchell
As the engineering world continues to increase its reliance on software, engineers must recognise that while the software is great for most applications, those involved with safety must always consider a hardware approach. In this article, we will explore just how engineers have come to rely on software, why software reliance can be detrimental, and how engineers should consider changing their attitude towards safety and hardware.
During the early years of computing, it was generally believed that the value of computers was the hardware that they were built out of and not the software that they ran. This was especially true when early microprocessors made up the highest cost in even the simplest systems, RAM was severely limited, and software available for such computing systems was limited. In fact, this fact was made abundantly clear when IBM allowed Microsoft the rights to distribute MS-DOS outside of IBM computers, thinking that the operating system held little value.
However, fast forward to 2023, and it is evident that software is where the money is. Hardware used in computing is merely a means to run the software, which gets users to their end goal. For example, it doesn’t matter if a computer uses an Intel, AMD, or Apple SoC processor; it only matters if the user can run their favourite OS, web browser, games, and applications. This is becoming even more true thanks to the introduction of cloud computing, whereby entire applications are web-based, with the heaviest processing being done at a data centre, and the hardware used by such centres is entirely transparent to the user.
This growing dominance of software has also impacted how engineers develop new solutions, and the use of software to solve problems over dedicated hardware is increasing in popularity. For example, a microcontroller that is required to communicate with an external device via UART may not have any free UART ports, but instead of sourcing a new device, a UART port can be bit-banged in software. Another example, albeit more extreme, is the development of self-driving vehicles that, instead of using LiADR and RADAR for depth detection via physical measurement, AI software is used to infer depth from multiple cameras. Such software solutions are also capable of reading road signs, detecting the status of traffic lights, and making decisions based on perceived conditions.
Using software to solve problems that would otherwise use hardware presents engineers with several advantages. The first is that software-based solutions can be tested and refined over time to improve their performance without making any changes to the underlying hardware. The second advantage is that errors in the software can be addressed with future updates, even if the target has already been deployed. This makes software-based solutions highly adaptable, while purely hardware-based solutions are ridged and unchanging.
While there is no doubt that software plays a critical role in modern design, engineers relying on software to solve all problems can have serious repercussions, some of which are potentially fatal. A prime example of where a software solution has failed was the Therac-25 computer-controlled radiotherapy machine designed to generate different sources of radiation at varying power, manufactured in 1982. The idea behind the Therac-25 was that combining multiple radiation sources into a single machine, all controlled by computer software, could help reduce the overall treatment cost while providing doctors with more options during treatment.
Under normal conditions, the Therac-25 could provide one of three different modes of operation; light-mode, electron beam therapy, and megavolt x-ray therapy. The light-mode was used to create a collimated source of visible light that operators could use to align the system, the electron-beam therapy mode utilised a low-current beam of high-energy electrons to treat specific areas, with magnetic fields being used to guide the beam, and the megavolt x-ray mode was used for generating a high-energy beam of x-rays for directed treatment.
However, due to race conditions in the software used by the Therac-25, it was possible for the electron source to output the energy needed to generate x-rays while in electron beam mode. This meant that for some unlucky patients, the Therac-25 would briefly fire an electron beam of over 100 times the intended dose, causing extreme pain and radiation burns. Over a few days, radiation poisoning would set in, resulting in death.
After a thorough investigation, it was discovered that numerous factors contributed to the failure of the Therac-25. Firstly, it was found that the engineers responsible for the design hadn’t performed due diligence in checking the software for potential issues, as well as not conducting third-party analysis. Secondly, the engineers of the Therac-25 didn’t consider failure modes and how the machine would behave under fault. Thirdly, AECL, the company that developed the Therac-25, assured operators that it was impossible to receive an overdose from the system, likely due to the computer-controlled nature of the system. Fourthly, the Therac-25 was never tested with both the hardware and software combined before being assembled at the hospital.
To make matters worse, investigators even discovered that the manual for the Therac-25 didn’t include error code explanations, meaning that operators couldn’t recognise machine faults from potentially dangerous situations. Finally, the Therac-25 lacked any hardware interlock mechanisms which prevent the system from arming, but the software used on the Therac-25 was copied over from the earlier versions, the Therac-6 and Therac-20, which did have hardware interlocks.
The Therac-25 was a machine from the early 1980s, so, understandably, it suffered from software issues that had never been faced. However, there are countless examples of modern software solutions that have been inappropriately used in safety-critical applications.
For example, Tesla EV owners have the ability to experiment and use Tesla’s Full Self-Driving mode, but while this has proven to work 99% of the time, there are numerous incidences of the system failing, either resulting in a crash or, worse, fatalities. While older Tesla vehicles were fitted with RADAR and ultrasonic parking sensors to get real-world measurements, newer vehicles are camera-only platforms entirely based on software to make safety decisions.
Another example of using software in a safety application is the Boeing 737-Max, whose automated software used to prevent stalling would unexpectedly force the plane to nosedive. Simply put, two sensors on the front of the Boeing 737-Max determine the plane’s angle relative to the horizon (i.e., angle of attack), and the output of these sensors is supposed to be connected to an automated system called Manoeuvring Characteristics Augmentation System (MCAS). However, in reality, only one of these sensors was connected, and if that sensor malfunctioned, the plane would think it was rising too much, forcing the plane to nosedive.
If these examples teach us anything, software and safety, don’t generally mix well. In the case of the Therac-25, engineers should have put safeguards in hardware that were fool-proof, preventing patients from being exposed to the electron beam during a switchover of modes. In the case of Tesla, vehicles should incorporate more measurement systems that can operate even if cameras are disabled, and the Boeing 737-Max MCAS should have never been able to override pilot controls.
For engineers that are required to use software in safety-critical applications, it goes without saying that extensive testing is needed, but it also helps to have third parties test solutions as bugs that would otherwise go unnoticed may be spotted. At the same time, engineers must think about how their code operates and what should happen under worse-case scenarios such as unexpected power loss, stack overflows, and out-of-memory conditions. Finally, code should be structured carefully to eliminate infinite loops, using finite state machines and timeouts as much as possible, and always having complete error catching that will reset the system to a known safe state.