Using OpenCL for Network Acceleration

Investigating the practicality of using OpenCL to accelerate AES and DES Encryption and Decryption by leveraging the GPU engines of the APU, reveals a realm of possibilities for exploiting parallelism on hybrid processors.


  • Page 1 of 1
    Bookmark and Share

Article Media

Microprocessor designs are trending in favor of a higher number of cores per socket versus increased clock speed. Increasingly, more cores are being integrated on the same die to fully take advantage of high-speed interconnects for interprocessor communications. Companies like Advanced Micro Devices (AMD) are innovating high-performance computing by integrating graphics with x86 CPUs to create what AMD refers to as Accelerated Processing Units (APUs). 

The advent of the APU creates opportunities for designers to develop solutions not possible a few years ago. These solutions utilize multiple languages and execute across hardware execution domain to enable a wide variety of new applications. One such application is the use of GPU resources as a massively parallel “off-load” engine for computationally intense algorithms in security and networking.

Software Acceleration on the GPU

The foundation of this potential revolves around the use of the OpenCL programming language as an enabler of this acceleration.  OpenCL was originally intended and designed to execute complex graphics algorithms on a discreet GPU device, typically over a serial PCI Express interface, and in concert with related code running on the host CPU. As such, there is an associated cost of moving data over PCI Express to and from the CPU for execution within the OpenCL domain. The APU has effectively minimized this cost to a fixed overhead, enabling algorithms to be more efficiently executed, and with a much lower power budget. Over the past few years, AMD has enjoyed many design wins with this APU strategy, enabling customers to deliver products with better CPU and graphics performance and a lower overall power requirement.

Another potential use for the APU architecture is in networking acceleration. What if, instead of intensive graphic or H.264 types of algorithms, we capture networking algorithms in OpenCL and use the GPU to execute this code? Prior to going too far on this topic, it is important to understand a few of the physical characteristics of the APU hardware. At a high level, there are up to four x86 CPU cores and literally hundreds of GPU cores in one APU device. 

As for performance, while the CPU clock currently ranges from 1.0 to 2.4 GHz, the GPU is clocked at only one-fourth of the CPU. This becomes important when considering which compute engine to use to execute which algorithm. The other key requirement of note is that there is an overhead involved in transferring code execution from the host x86 CPU to the GPU. This overhead mainly consists of the extra cycles required to move the instructions and data to the memory supporting the GPU.  

The performance value of the GPU lies in the massively parallel execution capability of the architecture. OpenCL is designed to exploit this capability by dividing the work load into smaller and more manageable units and using the large number of cores to distribute the workload. The resulting data is then returned and assembled to the CPU for use in the originating program. The fundamental applicability of this type of code acceleration engine depends on the ability of the code to be parallelized. There are actually several theoretical laws that outline the ability to accelerate the execution of code based on the parallel nature of the code. One such example is Amdahl’s Law, which is represented in Figure 1.

Figure 1
Amdahl’s Law—not unsurprisingly—shows that performance increases both as a function of the percentage of parallelism and the number of parallel units available.

This theoretical model is primarily targeted to CPU cores and ideally assumes no overhead for interprocessor communications (IPCs) and/or the servicing of interprocessor interrupts (IPIs). The importance of Amdahl’s Law has unique value when considering the massively parallel nature of the GPU architecture. These cores execute discrete segments of code captured in their own kernel and do not respond to interrupts or communicate with each other as the CPU cores do. This allows for extremely isolated and uninhibited code execution.

The smallest number of cores in the current AMD APU architecture is 80, so by applying Amdahl’s Law it is theoretically apparent that the possible speedup can range from 2 to 16 times through parallelization alone. However, by leveraging the OpenCL programming language along with the GPU architecture, we can actually realize a much higher performance result than would be indicated.  

AES and DES Encryption and Decryption on both CPU and GPU 

In order to illustrate this performance acceleration, we created an escalating network traffic load on a CPU-only implementation of these algorithms. In doing so, the Linux-based platform created an increasing number of concurrent threads of execution to manage the network load. We captured the time required to support this load in relation to the number of concurrent threads. This “load” consists of a combination of the number and size of the network packets being processed through these algorithms.

Next, we repeated this process with a GPU-only implementation of these algorithms. Again, we captured the time required to execute in relationship to the number of concurrent threads (network load). Figure 2 shows a graphical representation of the results of this experiment.

Figure 2
AES / DES encryption and decryption results show the advantage gained using GPU (red lines) as the increasing traffic is distributed among parallel units.

While the results for each of the algorithms differed slightly on both the CPU (blue lines) and GPU (red lines), the overall trend for each was very consistent. It is clear that when the network traffic load was relatively light—meaning that there were not many concurrent threads required to support the algorithm—the CPUs were more than adequate and in fact more efficient than using OpenCL and the GPU cores. However, as the workload and number of concurrent threads increased, the OpenCL and GPU proved to be a significantly better solution. It is important to note that the Y axis for each of these graphs is a log2 representation. This means that the slight differences in the graphs are actually an exponential comparison of the results, and of the acceleration provided.

Further Innovations to Come

The beauty of the APU-based architecture is that it allows the designer to decide when and if to use the GPU resources and how. The fact that this powerful resource is available is sparking many creative implementations of this architecture and paving the road for future innovations that will be enabled by AMD’s next generation of Heterogeneous System Architecture (HSA) enabled devices. -This new HSA architecture will remove much of the overhead involved in transferring work from the CPU to the GPU—or to other acceleration resources on the device.  

Viosoft is investigating other areas in the networking space to leverage this architecture and combine it with its Teranium Acceleration Framework.  

The Teranium Framework enables PCIe-based networking resources to efficiently deliver packets to/from the user space of the CPUs. Customers currently investigating the use of AMD’s Opteron family of products are enjoying full 10G line rate send and receive support without the need for a hardware-based network processor. Even packet forwarding is done at over 50+% of the 10G line rate. Future versions of the Teranium Framework could potentially include leveraging the other resources like the GPUs to deliver even higher throughput to network solutions based on AMD’s APU and HSA architectures.  


San Jose, CA.
(508) 881-4254.