Safe and Secure Systems
Industrial Computers in Safety-Critical Applications: Out of the Ordinary
In safety-critical environments, requirements go well beyond the usual demands. As safety increasingly becomes an issue in computing in more and more areas, costs don’t have to explode. There are COTS solutions for CompactPCI and VMEbus.
SUSANNE BORNSCHLEGL, MEN MIKRO ELEKTRONIK
Page 1 of 1
Wherever you go these days, chances are you have to put your faith in some type of electronic system that is controlling all kinds of equipment and amenities in our lives, and the service provider is expected to take care of the necessary safety precautions. Systems like these are subject to many stringent, safety-related and market-specific standards, as found in the railway and avionics markets. The subconscious (or blind eye) trust that people put into these systems to operate reliably is quite large, especially when their safety depends on proper system operation.
This is of course a bit unnerving, so the obvious question is: what types of computers do system integrators typically employ? Certainly not just “ordinary,” commercial off-the-shelf industrial computers, when safety-critical systems are involved. Manufacturers have been developing industrial SBCs on two tried and tested standard platforms, CompactPCI and VME, that are certifiably safe and yet COTS-based. Of course, these two platforms are far from ordinary (Figure 1).
Safety-critical applications can employ cost-effective COTS components that have met the certifications of relevant industry-specific standards.
Safety as the Standard
Any reasonable design for safety-critical applications uses a set of common techniques to achieve the high reliability that is demanded. One of the key items is redundancy. A system component that, in the event of a failure, will stop the entire system from working is called a single point of failure (SPOF). In the worst case, such a failure could cause great damage and endanger human life.
This is why critical components are always incorporated several times, increasing system redundancy. Being identical, these components can keep the system in an operational state, even if one fails, or put the system into a safe state.
Systems that are “fail-operational” continue to operate when an error occurs, such as in the control system of an airplane. “Fail-safe” systems are turned off in case of a failure; they are immediately put into a safe state. For these scenarios, there are different types of redundancy setups, from “1-out-of-2” (1oo2) systems up to “2-out-of-4” (2oo4) systems. A “voter” mechanism in the redundant structure compares and evaluates the output of each component. Depending on the redundancy type, the system then reacts in a defined way.
In a 2-out-of-3 system with three identical CPU cards based on CompactPCI, for example, the voter compares the results and the majority of the circuits with a matching output to determine the outcome used to control the system. Building up this type of redundancy using COTS components in a standard bus system seems to be a cost-efficient way to do it, but only if the complex software requirements and specialized I/O necessary to synchronize the three systems and implement the voter are addressed (Figure 2).
A 2-out-of-3 voting arrangement uses a “majority-rule” voting approach to ensure continued operation in the event of an error from any one of these three redundant circuits.
There are currently CompactPCI and VME CPU boards that come with onboard triple redundancy, called a lock-step architecture, where the identical components work synchronously and always do the same thing, making them virtually visible only once for the software. The same software can be used as for a single-CPU board, facilitating software integration, keeping overhead low and reducing development costs.
In case of failures, simple code is sufficient to synchronize the three processors. And for upgrades of existing systems based on a single-CPU solution, it takes less effort to port existing, dedicated applications.
A 2oo3 design also tolerates hardware faults and transients. A typical problem experienced in avionics applications is a single event or multiple event upset (SEU or MEU). These random events can be caused by a single bit being “flipped” through the effects of cosmic radiation. A single processor or memory bank is very susceptible to single event upsets.
The triple-redundant dynamic memory found in 2oo3 systems automatically corrects such upsets. Reading and writing is always performed on all memory banks and a “scrubbing” mechanism reads out one memory cell through the voter and writes back the voted data into all three memory banks. This happens for every refresh cycle and prevents accumulation of flipped bits over time.
Critical functions such as the voter and memory management functions can be implemented in an onboard FPGA. Contrary to specially developed, expensive components, as commonly used in aerospace, an FPGA protects the design from obsolescence and is a smart way to lower overall costs in the long run. This is particularly true where complex functions like Avionics Full Duplex Ethernet (AFDX) are needed. To harden the FPGA component, design tools can turn its internal memory structure triple redundant as well, making it virtually immune to cosmic radiation.
Safety Relies on Data Execution
Next to being fault-tolerant, safety-critical systems often require predictable execution times. The system must react to an external event within a defined time, and this reaction time must even be met under worst case conditions. Systems need to avoid interrupts and DMA to assure strictly deterministic operation because they could compromise the system reaction time. Additional diagnosis mechanisms within the system can help detect latent errors before they lead to a system error. They include extensive built-in test equipment (BITE) features such as ECC error correction or monitoring of all internal voltages, further increasing safety and availability.
Bumps on the Road to Redundancy
Still, one obstacle seems to block the way when developing a redundant system for critical applications: Because redundant subsystems are identical, there is the possibility for multiplied, parallel failures.
Applicable industry standards, such as DO-254 for avionics and EN 50129 for railways, demand that there must be dissimilarities inside the system architecture. Employing boards that incorporate a resource partitioning MMU is one effective means of mitigating this pitfall. Independently developed, dissimilar applications can run on two partitions, and different I/O hardware can be used. These two dissimilar hardware/software paths have to lead to the same result for a specific action to take place.
The last stepping stone in developing safety-critical systems is the operating system. Luckily, several platforms specifically target safety-critical applications. Sysgo’s PikeOS and various flavors of Wind River’s VxWorks are available. In addition to its general-purpose real-time operating system (RTOS), Wind River also supplies VxWorks platforms that support safety certifications up to DAL-A as defined by RTCA DO-254 and RTCA DO-178B (avionics), and to SIL 4 defined by EN 50129 (railways). PikeOS or VxWorks support resource partitioning mentioned above.
CPU cards with optimized booting characteristics will enable significant application startup times, so that applications can operate immediately after power on and provide a quick restart in case of power interruptions, a critical function in safety-critical environments.
Two triple-redundant cards (Figure 3) can be combined to form a high-availability (HA) cluster to make a system even more failure-safe. In a constellation like this, each channel operates on its own, but only one channel is active. If the active channel fails, the system automatically switches over to the second channel.
SBCs can come with triple redundancy built into one board to save integration costs and system space.
Redundant SBCs for Safe Operation
Only a qualified SBC can be effectively employed in safety-critical applications, such as in aircraft communication, navigation or display control, infotainment, flight control, weather systems, collision avoidance or other management systems. On the ground, airport-related infrastructures also need safety: communications, security systems, traffic control or radar systems also play mission-critical parts in air traffic. Similarly, SBCs used in systems found on board a train or other vehicle as well as for wayside control in the various systems involved in railway traffic, also need these exceptional levels of reliability and redundancy.
Apart from the two classic areas of railways and avionics, there are many other interesting fields where industrial, and at the same time safe computers may be needed. In general, this includes markets where failures may lead to high costs: commerce, logistics, production, the medical industry and server or telecommunication infrastructure.
In a world of ever increasing dependence on computer infrastructure, failure-safe operation has become an important asset. The “ordinary” industrial computer may not be able to live up to the high demands emerging from this.