Getting What You Pay For: Optimizing PCIe Accelerator Card Designs
PCIe accelerator cards can improve system performance significantly. But, are you getting the most out of your investment? Careful design and implementation are necessary to guarantee that the accelerator produces the maximum benefit possible.
MATTHEW DHARM, JUMPGEN SYSTEMS
Page 1 of 1
According to an old story that periodically makes its way around the industry, a system consisting of a rack of 1U servers was being used in a small room to process data. As the processing requirements grew, more servers were added but the total performance of the system did not improve. What the designers didn’t account for was that as the number of servers increased, the temperature of the room increased. Each server then went into “thermal throttle back,” reducing power consumption to compensate for the higher temperatures. Thus, while there were more servers, each unit was running slower, and the total processing power of the room was (effectively) a constant. Most people take the moral of the story as “not every problem can be solved with more units.” But, perhaps the moral should be “make certain you get all the performance you are paying for.”
One common way to expand system capabilities is the addition of one or more “accelerator” cards. Such cards usually contain multiple network interfaces, some number of general-purpose CPU cores, and some number of hard-accelerator cores for specific tasks such as pattern matching, cryptography, compression, or even storage-related mathematical operations. Often, a system-on-chip (SoC) Network Processor (NPU) or other similar part contains much of this capability. The host CPU and the accelerator card are connected (generally) via a multi-lane PCIe link (Figure 1).
Example of a PCIe accelerator card, the JumpGen R7E-100 uses an 8-core Netlogic processor and can provide up to 10 Gbit/s of bulk encryption.
While accelerator cards can offer significant increases in system performance, the task of the software designer now includes the added responsibility of making certain that all subsystem components are being utilized to their maximum capability. Also, software must maximize throughput and minimize latency in the data path between the general-purpose CPU and accelerator card. Let’s take a look at some of the key design considerations of this sort of system.
The most obvious area of focus for developers interested in optimizing system performance is, perhaps, transporting data between an accelerator and host processor. Regardless of the type of accelerator in use, at some point data must flow from the host to the accelerator, or vice versa. Classically, there are three particular areas that are of critical importance to moving data across a PCIe bus: the use of Direct Memory Access (DMA) transfers rather than Programmed I/O (PIO), the use of writes instead of reads, and the efficient use of interrupt requests (IRQs). As with many things in the software world, there are many ways to accomplish this, and most of them are poor choices.
DMA vs. PIO
Many designers view the use of DMA as an opportunity to reduce the number of processing cycles devoted to moving data from one location to another. While that is one benefit, when dealing with multiple processing complexes (such as in a system with a host processor and an accelerator card), the real benefit is in maximizing the utilization of the interconnecting PCIe bus.
DMA involves the use of a dedicated “engine” specifically designed to copy data from one location to another. A processor sends commands to the engine, and the DMA engine often has a variety of transfer modes suitable for packets, multi-dimensional arrays, storage blocks, or other common data types. Some even support descriptor-based operation, allowing multiple jobs to be queued up and processed automatically.
What is, perhaps, less obvious about DMA engines is that their base unit of operation is usually quite large. For example, a processor may be able to operate on 64-bit values (8 bytes) in a single bus transaction, while a DMA engine may easily be able to operate on 256-bit values (32 bytes) or more in a single bus access. Of course, DMA engines commonly operate on much larger data units. But, since the base access is large, it takes fewer bus transactions to complete a transfer. This reduces transaction overhead and increases throughput.
The difference between DMA and PIO, then, is transaction overhead on the PCIe bus. When using PIO to transfer data, each 8-byte chunk of data gets hit with the overhead of initiating and ending a transaction. When using DMA, that overhead penalty is at least 4 times less, possibly even 8 or 16 (or more) times less, than when using PIO.
PCIe Writes, Not Reads
When transferring data between two entities, many designers will arbitrarily choose to read the data from one side rather than write the data from the other. To many designers, the choice is arbitrary and one of convenience. However, this choice has serious performance implications.
The issue at hand is fundamental to the operation of PCI-like buses, including PCIe. When writing data on such a bus, all intermediate bridge devices between the initiator and target are allowed to buffer the data being written. Since the bridges are buffering data, multiple writes can be happening through any given bridge simultaneously. Several rules apply to this buffering, including mandatory flushing of buffers under certain conditions to give the appearance of coherency.
However, reads performed on such buses are entirely different. In the case of a read, all intermediate bridge devices between an initiator and target must interlock simultaneously, providing a clear and continuous path for the two entities to exchange data until one decides to stop (or the expiration of the latency timer). The overhead to set up the read interlock down the entire chain is significant, and no other reads through the intermediate bridges can happen during this time.
It is worth mentioning that many system bugs are caused by developers not properly understanding the rules for buffering PCI writes, or not properly applying these rules to their system design. This often leads developers to switch to the lower-performance “read” configurations. Familiarity with these rules, known as PCI Write Posting Rules, is recommended for anyone implementing a PCIe-based system.
While most people are familiar with interrupts signaled from an accelerator card to a host CPU, it is also common for a host CPU to signal an interrupt to an accelerator card. These interrupts can be used to indicate any one of a number of meanings, but most commonly they signal that data is either ready for processing, or processing on a block of data has completed. Interrupts can be implemented any number of ways, from physical interrupt request lines, to PCIe in-band interrupt messages, to message signaled interrupts or to mailbox interrupts.
Receiving an interrupt, however, is generally the same regardless of how or where it is generated. Also, receiving an interrupt is a costly process in terms of computational resources. Receiving and processing an interrupt generally requires a processor to switch contexts, handle the interrupt (at least minimally, often deferring most of the work until a later time), and then restore context to what it was before the interrupt happened. These operations involve a large amount of overhead and therefore high rates of interrupts can cripple a system. This phenomenon is the reason many systems have difficulty processing even a low-bandwidth data stream that is composed of extremely short packets (voice and video data have these characteristics); each packet causes an interrupt, and the overhead of the interrupts overwhelms the processing capacity of the system, even at a relatively low total bandwidth.
There are many ways to mitigate this issue. One such way is “interrupt coalescing,” where multiple interrupts that happen close together in time are issued as a single interrupt (this is usually done with some sort of hold-off or delay logic). Another technique to reduce interrupt frequency is to reduce how interrupts are used, changing, for example, from an interrupt for every packet to an interrupt for every 1000 packets or for every flow setup/teardown. Regardless of the method used to manage interrupts, managing the amount of processing power consumed by interrupts can be an important part of system performance.
It is important to note, however, that reducing the total number of interrupts is not the only concern. The PCI Write Posting rules also mention that PCI bridge write buffers must be flushed when an interrupt is passed through the bridge. Thus, if an accelerator transfers data to the host, an interrupt can be used to flush buffers and guarantee that the host sees all the data.
Dividing Up the Workload
While it is important for system designers to be mindful of how data is moved between a host processor and an accelerator card, is it also important to consider where and how the data is manipulated. Specifically, it may be a performance gain to pre- or post-process data on an accelerator card, even if that accelerator card has a different primary function. Also, when transferring data between a host processor and accelerator card, choosing a sensible format for the processor architectures involved can significantly improve performance.
Many designers using an accelerator-based architecture initially target the accelerator for very specific parts of the processing workload. For example, many accelerators have dedicated hardware to support compression or certain types of cryptography (Figure 2). These dedicated acceleration blocks are faster than either the host CPU or the processor on the accelerator card. While this does lead to improved system performance, the processor on the accelerator is often extremely underutilized, leaving a large opportunity to further improve performance.
A typical accelerator architecture on a PCIe card. Note how data can be processed by multiple blocks on the accelerator before/after being sent to/from the host.
When allocating what workload will be done on the host or on the accelerator, the total capabilities of the accelerator, including any general-purpose processing capabilities, should be considered. Moving a task as simple as a packet checksum calculation to an accelerator that is already handling that packet can free up valuable host processor resources and provide a significant performance boost.
Formatting Data between Subsystems
While efficiently moving data between subsystems is important, moving that data in the best possible format is equally important. Different CPUs have different strong and weak aspects of their architecture, and using the accelerator to change the format of the data to be more suitable for host CPU processing can yield performance improvements.
For example, many CPUs are more efficient when data is aligned on a 64-bit boundary, but Ethernet packets have many fields that are not aligned on any reasonable boundaries. Misaligned memory accesses can be a significant source of performance problems and are almost invisible to most debugging and profiling tools. However, if an accelerator card segmented the packet into multiple buffers such that fields of interest were aligned for the convenience of the host CPU, then the host CPU could process the data more efficiently at the relatively minor cost of a few bytes of memory. Alternatively, an accelerator card could extract the desired information from the packet, and then transfer an additional data block to the host CPU, which provides a summary of that information in an easy to access format.
Accelerator cards can offer significant performance gains in a variety of tasks on many platforms. The availability of such accelerator cards in multiple standards-based form factors makes them an attractive option to boost system performance, especially in systems where the increase in performance is viewed as a valued-added addition or option to the system. However, accelerators are not a “silver bullet” that can be implemented without regard for the overall system design or the interface to the host CPU.