Designing Very High Performance ATCA Systems
The demand for compute power in today’s high-performance applications brings a variety of technical issues, but a standards-based approach can achieve successful solutions.
THOMAS ROBERTS, MERCURY COMPUTER SYSTEMS
Page 1 of 1
In today’s data-stream processing parameters, bandwidths reaching 10 Gigabits per second (Gbits/s) are considered to be high-bandwidth. For voice-processing applications, 10 Gbits/s is equivalent to more than 150,000 standard voice channels under the G.711 standard, although data compression techniques can further increase system capacity. Video-processing applications require even more.
With sensor data, such as radar, or visual camera data, the incoming data stream can be unrelenting. For example, when a camera scans an area, the data must be processed in real time, because there is too much of it to be stored. If 10 Gbits/s of data is coming in and there is even a minimal delay in processing it, storage (buffering) capacity runs out very quickly.
Latency and determinism are critical characteristics of data-stream processing requirements as well. Processing latency–where latency is units of time measured from ingress port to egress port–must be very low. The parameters that define a low-latency response depend on the application. For example, voice-processing applications must control the signal processing delay across the entire network, including both satellite transmission delays and intra- and inter-system processing delays, to make a phone call understandable. The goal is 200 to 250 milliseconds of maximum delay end-to-end. For industrial control-loop applications, existing in a smaller physical environment, latencies are measured in microseconds rather than milliseconds.
These very low latencies must also be reliable. For the application to perform properly, each data processing step must be performed within a well-known, extremely small window of time, and this window must be the same each time the step is performed. This characteristic is referred to as determinism.
Computational density–the processing power these applications need to manage their high-bandwidth, low-latency requirements–depends on how much processing must be performed on the data. In general, an application that has a lot of data coming in, most likely needs a lot of processing power. However, the amount of processing power it requires can vary by multiple orders of magnitude, depending on what the application needs to do to the data.
Several design components are required when building systems to meet the high-bandwidth, low-latency and deterministic requirements of high-end data-stream processing applications. A successful system architecture does the following:
• Partitions the application across multiple processors, so each processing step is handled by the right type of processor.
• Uses a switch fabric to move the processing stream from processor to processor at high speeds with great efficiency.
• Builds on a physical system infrastructure that can support efficient power and thermal management to densely packed multiprocessor systems.
Deciding upon the optimal multiprocessor architecture for a specific application’s processing requires considering the type of processing needed and amount of data to be processed. Different processing elements, such as FPGAs, DSPs and network processors, are designed to handle different processing requirements.
FPGAs are most effective for simple mathematical operations like add/multiply, because they perform the operations at the gate level as the data moves through, rather than moving the data from memory to a computational unit and back, as other types of processing elements must do. For beamforming applications, which require an enormous number of simultaneous mathematical calculations, FPGAs are far superior to other processor types. Operations that can be performed with Boolean logic are best done on FPGAs. FPGAs are also good for filtering operations, which extract desirable data from an incoming data stream or remove unwanted data. For example, in applications that process antenna-generated data, an FPGA can efficiently filter out carrier information from incoming data channels.
DSPs, on the other hand, are effective for data compression and decompression, known as codec operations. In addition, compression and decompression of data is often combined with echo cancellation operations, another strength of DSPs. Echo cancellation is a critical component of Voice-over Internet Protocol (VoIP).
Network processors, which are specialized programmable ASICs, are best for in-depth packet inspection. The format of a packet is clearly defined, so the desired information can be extracted efficiently. A network processor in a router, for example, can decide, in real time, where to send an incoming packet, what its priority is, and, therefore, how promptly it must be processed. These and many other similar functions of network processors are based on in-depth analysis of the packet content.
When different types of processing must be applied to a data stream, it makes sense to match processing elements to specific processing needs. For example, in voice and video applications, the DSP engine compresses the data, while the network processor on the router identifies packets as voice or data-only, assigns a higher priority to voice packets, and sends them out on the network. Or, in a voice-processing application, an FPGA does waveform processing of the input signal from the antennae, while a network processor behind the FPGA performs packet-level processing.
Using Switched Fabrics
To make a partitioned processing job run at optimal speed, the data stream must move between processors with maximum efficiency. The application must be able to rely on data transfers that occur with very low latency in a very deterministic manner. Switch fabrics are superior to bus architectures for this purpose in several ways (Figure 1).
Since they are fundamentally point-to-point, they avoid bus contention and the current generation is well suited to high-bandwidth applications. At 10 Gbit/s bandwidth, a serial switched fabric has substantially better performance than a bus. Switched fabrics are easier to implement, which simplifies backplane routing, and they are significantly better in terms of implementing fault detection and redundancy.
Very low latency is a hallmark of switched fabrics. For example, using serial RapidIO, latency is under 1 microsecond for a one-way trip across the backplane between any two endpoints in the system. In comparison, using Gigabit Ethernet, latency is in the 1 millisecond range and up. This helps make them deterministic. Latencies are reliable, so the arrival of data from point to point is predictable. A bus is undeterministic, unless a sophisticated set of priority control algorithms is built in. And reliability is high as well because switched fabrics consist of many point-to-point links. A single failing node does not bring down the entire system.
If needed, switched fabrics can also support parallel transactions between two elements. With 10 elements, for example, a switch can have five simultaneous transactions, while a bus can have only one. A switch itself has potential blocking issues, but they are fewer than those of a full bus, and certain architectures allow for the creation of non-blocking switches.
In addition to all that, they are also less costly. For a bus, the connector cost–the number of pins on the connector and how to create a backplane with that many lines–is expensive. Multicast is also easy to implement since nearly all switching silicon supports this functionality.
Multicast is particularly useful when data is processed in parallel. The application can use multicast to send the same data simultaneously to several different processing elements. For example, an application that performs different types of filtering on data arriving from an antenna, can multicast the data to the multiple filter processors in the system. A router can use multicast to send a block of packets out for parallel processing. Multicast can also be used in a control plane for system-wide shutdown in the face of heat-related problems or other type of errors, because all affected processing elements simultaneously receive the shutdown message.
Using an ATCA-Based Framework
A logical approach to designing a system that meets high-performance requirements is to build on an appropriate industry standard. The Advanced Telecom Computing Architecture (Advanced TCA or ATCA) was conceived to specify a carrier-grade-based system infrastructure. It was built from the ground up to support a range of processors.
The form factor and architecture of ATCA boards enable an unprecedented amount of I/O connections for front and rear panels. The carrier blades at the front panel allow optical or other types of I/O connections to bring data in and out, while the rear transition module (RTM) form factor provides a large amount of physical space for I/O connections from the rear.
The ATCA carrier blade form factor supports well-balanced systems delivering teraOPS of processing power in a single sub-rack, and the architecture is flexible as to the types of processors that can co-exist in the system. An advanced mezzanine card (AMC), which connects to a carrier blade, can contain processing elements of very different natures. Processor types can include standard processors–for example, Freescale 8641D PowerPC or Intel single or dual-core processors–DSP engines, FPGAs, or digital-to-analog (DA)/analog-to-digital (AD) converters, if the application needs to handle incoming analog signals or signals that come from some type of measuring equipment. The AMC concept even allows for the creation of highly specialized compute engines for particular applications.
If application requirements change over time, as they often do, a previously deployed AMC can be removed from the carrier blade and a new AMC connected that has a different processing element to best match the new requirements more effectively. AMCs provide the flexibility to process the data and to match it to what the application really needs to do with that data.
ATCA specifies a very sophisticated intelligent platform management interface (IPMI)-based infrastructure that allows for the construction of a consistent system management environment for alarms, configuration and diagnostics that can be run on a completely different medium from the application’s data and control planes. Based on multiple I2C connections, IPMI can be described as a system-management fabric.
The ATCA standard defines a power maximum of 200 watts per slot, exceeding the 120 watts per slot of legacy VME, and 174 watts per slot of VME64x. More significantly, the IPMI infrastructure provides a standards-based mechanism for monitoring internal system temperatures and implementing adaptive cooling techniques, such as adjusting fan speeds based on those internal temperatures (Figure 2).
Serial RapidIO and GbE Both Have a Place
In addition to IPMI, the ATCA system defines two distinct fabrics: the data plane and the control plane. ATCA supports a fabric interface for the data plane, and 1 Gigabit BaseT Ethernet for the control plane. ATCA’s data-plane fabric interface can support many different fabrics, including 1 and 10 Gigabit Ethernet and serial RapidIO, among others.
Serial RapidIO, running at 3.125 GHz, and 10 Gigabit Ethernet are very competitive with respect to high bandwidth. However, while their capabilities overlap in some areas, serial RapidIO and Ethernet were developed to solve different problems, and their underlying architectures differ in many respects.
Serial RapidIO was designed for embedded applications, supporting chip-to-chip and board-to-board communications. Because most of its protocol is implemented in the hardware of its endpoints, serial RapidIO offers extremely low latency and deterministic performance, and it does not require software management to move the data. The latency of serial RapidIO switches is highly deterministic: 112 ns for unicast packets and 163 ns for multicast packets. For endpoints, the latency depends on endpoint design, but is likely to be under 40 ns. With serial RapidIO, an increase in latency occurs only when there is an enormous amount of traffic or when an endpoint in the network is too slow to process its incoming traffic.
Ethernet was originally designed as a way for multiple computers to communicate over a shared coaxial cable. The physical layer has evolved to point-to-point, but each endpoint is assumed to have a processor that is both available and capable of running software that implements the Ethernet (TCP/IP) protocol stack. Because its protocol is implemented in software, Ethernet implies higher latency, non-deterministic performance, and the need for software management. Ethernet is a “best-effort” transmission, unless quality of service (QoS) is built in. There is no guarantee data will arrive at any particular time, and packets can be dropped (and lost). With serial RapidIO, latency is in the hundreds of nanoseconds range. With Ethernet, it could be one microsecond or much more depending on the amount of traffic on the network. Although the Ethernet stack can be implemented in hardware, this approach locks in a particular version of the stack, losing the flexibility that is one of its main advantages.
On the other hand, Ethernet has become the unchallenged communications interconnect for wide area networks, because its stack is highly flexible and supports essentially unlimited numbers of endpoints. Ethernet is also off-the-shelf technology. Nearly everyone knows how to deploy it, whereas serial RapidIO is less well known.
Mercury Computer Systems