Graphics Processors Running General-Purpose Code Set to Revolutionize Embedded Computing
Originally designed to render graphics, today’s GPUs have become massively parallel devices that can also be programmed using a general-purpose C compiler to speed intense computational tasks that can execute in parallel.
BY SIMON COLLINS, GE INTELLIGENT PLATFORMS
Page 1 of 1
The quest for ever higher performance in embedded computing has seen successive generations of microprocessor come and go, with quad core the latest to rise to the challenge. DSP and FPGA technology have been widely deployed. New architectures, such as the fabric-centric VPX, have gone some way to filling the need. And yet…
The requirement still exists for even greater degrees of processing capability for sophisticated, mission-critical applications. At the same time, possible solutions are increasingly compromised by the need to minimize space, weight and power for applications that are to be fielded in mobile—and even portable—platforms. That may be the most pressing requirement—but not far behind it follows the requirement to minimize time-to-market, and the need to maintain maximum control over cost.
General-purpose processor clock speeds used to increase every few months, but in recent times this has slowed. A primary reason is that power consumption increases exponentially with clock rate. Although this can be mitigated to some extent by decreasing the die geometry size, most silicon vendors are looking at multicore as an alternative, with dual and quad core processors recently introduced—and six and eight cores just around the corner. For game-changing performance, a different processing paradigm is required—a shift from sequential to parallel processing.
Enter general-purpose processing on graphics processing units (GPGPU), which can be seen as the latest in a line of technologies (from the Transputer to DSP, FPGA…) that can deliver a level of performance not possible with general-purpose processors. Current-generation GPUs can have 96, 128, even 256 cores. As a potential way forward, the ability to run general-purpose code on a GPU has been attracting growing interest over the last few years in a broad range of fields including medical research, science and finance as well as in those areas that might be more usually associated with the technology such as signal processing and video processing. Feedback is universally positive. For the right applications, processing speed and productivity can be dramatically improved.
Why is that? Simply because the architecture of a GPU is inherently highly parallel—an architecture that was used extensively for the most demanding supercomputing applications of ten and twenty years ago. The difference is that those supercomputers of the early 1990s cost millions of dollars—but a high-end GPU board for a benign development environment costs only a few hundred, and a rugged, field-deployable board only a few thousand. The fact is, there are classes of problems that lend themselves well to parallel processing—typically those that, like graphics applications, require substantial amounts of data to be processed (or smaller amounts of data to be repeatedly reprocessed), and where that data can be processed simultaneously, rather than sequentially.
But GPUs have been around for many years—so why the recent surge in interest? The first GPUs were specifically designed to accelerate graphics applications, intended for traditional raster-based environments. Today’s GPUs, however, are fully programmable, massively parallel floating point processors with substantial flexibility.
Nvidia’s first GPU was the NV1, and was introduced in 1995: it featured one million transistors. In 2006, Nvidia launched the GeForce 8 Series—with 681 million transistors. 2009 saw the introduction of the GTX280 with 1.4 billion transistors. Anyone who has seen the transition from early PC games like Wolfenstein 3D to the lavish and highly detailed photo-realism of Crysis, will be aware of the difference a modern GPU makes. GPU development has been driven by PC gaming—but the benefits are being experienced much more widely.
It is largely in response to growing interest beyond the company’s traditional area of expertise in graphics that has led Nvidia to introduce the Compute Unified Device Architecture, or CUDA (Figure 1), and to make the changes necessary that would enable the implementation of a C compiler—and thus open up GPU technology to a much wider potential application base. CUDA is described by Nvidia as a general-purpose parallel computing architecture that leverages the parallel compute engine in Nvidia GPUs, and includes the CUDA Instruction Set Architecture (ISA). Over 100 million CUDA-enabled GPUs have been sold to date, and thousands of software developers are already using the free CUDA software development tools. What’s more, they’re doing so on one of the most inexpensive development platforms around—the PC. While rugged, battle-ready CUDA-enabled platforms will inevitably be somewhat more expensive, the investment required in GPU-based hardware will be significantly less than that involved in FPGA-based platforms (or DSP-based platforms, come to that)—thus addressing a key concern in embedded computing.
With CUDA, a ‘traditional’ CPU and GPGPU combine to deliver potentially dramatic improvements in performance.
One problem that has been confronting engineers is how to move the applications successfully tested in the laboratory on commercial-grade GPUs to the harsh deployed world? One example of a product designed to bring the benefits of CUDA-based computing to military applications is the GE Intelligent Platforms rugged 3U VPX GRA111 graphics board. Featuring Nvidia’s GT240 GPU in combination with an Intel Core2 Duo processor, it is the first member of a planned family of CUDA-enabled products (Figure 2).
The GRA111 from GE Intelligent Platforms is a rugged, field deployable CUDA-based platform.
There is more, however, to the cost-effectiveness of GPU-based solutions than just the hardware platform. Two other key elements are developer productivity—and the cost of those developers. For many, FPGA is the technology of choice for solving challenging embedded computing problems. Good programmers who can be productive in the FPGA environment are, however, hard to come by, and their skills are more akin to those of a hardware or electronics engineer than a software engineer. CUDA has been described as “more accessible” to software programmers. The CUDA environment is widely taught in schools and universities, and is extensively used in R&D labs around the world. As such, skilled programmers can be expected to be plentiful—and more affordable (Table1).
Productivity and time-to-market are functions of the software development tools available for a target hardware environment. While these have become widely available for FPGA programming, the library of tools available for the CUDA environment has already surpassed their number. Nvidia provides extensive developer support for the CUDA environment. The CUDA Toolkit is a C language environment for CUDA-enabled GPUs, and includes the nvcc C compiler; CUDA FFT and Basic Linear Algebra Subprograms (BLAS) libraries; a profiler; the gbd debugger for the GPU; a CUDA runtime driver; and a CUDA programming manual. The fact that the development environment is based on the C language—a language with which virtually every software developer is familiar—is an important advantage.
In addition to the extensive development and engineering support GE Intelligent Platforms is providing for developers of CUDA-based systems, third-party support is also becoming widespread. For example, Tech-X Corporation recently announced GPULib. GPULib is designed to provide a library of mathematical functions that facilitate the use of high-performance computing resources available on modern GPUs. Tech-X Corporation says that GPULib allows users to access high-performance computing with minimal modification to their existing programs. By providing bindings for a number of Very High-Level Languages (VHLLs) including Matlab and IDL, GPULib can, according to Tech-X, accelerate new applications or be incorporated into existing applications with minimal effort. No knowledge of GPU programming or memory management is required.
Another effort that will help unlock the capabilities of a GPU in a general-purpose application is the OpenCL framework. Created by the Khronos Group, it features the participation of industry heavyweights such as Intel, NEC, IBM, Nokia, Freescale, GE and AMD. It is the first open standard for writing programs that execute across CPUs, GPUs and other types of processors, and includes a language for writing kernels (the functions that execute on OpenCL devices), defines APIs that are used to define and then control the platforms, and provides parallel computing using task-based and data-based parallelism.
The Vector Signal Image Processing Library (VSIPL) is also acknowledged to be useful for exploiting the capabilities of GPUs, providing a high-level API for signal and image processing and explicit memory access controls. The converse is also true: GPUs are a good fit for VSIPL because they improve prototyping by speeding the testing of algorithms; their affordability means that more engineers can gain access to high-performance computing; and substantial increases in speed can be achieved without the need for explicit parallelism at the application level. Many existing MilAero applications are coded using VSIPL libraries that were optimized for PowerPC/Altivec or Intel/SSE. These applications will port to CUDA very readily by using a VSIPL library that is optimized for CUDA. Georgia Tech Research Institute’s VSIPL is another example of the growing third-party support for CUDA development tools.
Beyond these frameworks, tools and libraries, there is also a substantial—and growing—code base of examples available to aid the development process. Nvidia provides a number of these as part of the CUDA developer SDK. The latest version of the SDK provides support for Microsoft’s Visual Studio.
Among the applications that will certainly benefit from the use of GPGPU technology are software defined radio, encryption, decryption and cryptanalysis, radar, sonar as well as video processing and stabilization, video compression and decompression, target tracking, image enhancement and sensor fusion. LIDAR (Light Detection and Ranging) is an optical remote sensing technology that measures properties of scattered light to find the range and/or other information of a distant target. It is a technology that requires intense computation resources and can largely execute in parallel.
Image processing filters for electro-optical sensors are also a good fit for a GPU, as are the FFTs and other core signal processing algorithms that feature heavily in radar applications. GPGPU technology can deliver significantly more capable detection systems, increase the autonomy of unmanned vehicles and provide a wide-ranging improvement in survivability across a broad spread of applications. To take one example: a major defense prime contractor has ported a radar application to the CUDA environment and achieved a 15x improvement in performance (Figure 3). In another case, the productivity of the CUDA environment is illustrated by the brief time—just over two weeks—it took another prime contractor to migrate an application to the CUDA environment.
Radar is one of many applications that can benefit from GPGPU technology.
What difference can a GPGPU make? An Intel processor, operating on its own, can be expected to deliver peak performance around 16 GFLOPS; its power consumption of 60W delivers performance of 0.27 GFLOPS/W. Add a GPGPU (unlike FPGA technology, a GPGPU will always be hosted by a general-purpose processor) and performance rises to 391 GFLOPS peak, or 3.76 GFLOPS/W. Doubling the number of GPGPUs delivers peak performance of 766 GFLOPS peak, equating to 5.18 GFLOPS/W. While power consumption is 2.5x, performance/watt rises by a factor of more than 20. In comparison, a Virtex 5 FPGA is rated at 192 GFLOPS by Xilinx. Note that the “sweet spot” in configuring GPGPU-based systems is held to be one GPU per CPU core: thus, a dual core Intel processor would ideally be configured with two GPUs for optimum performance.
As a technology, it is highly scalable across many different form factors—from 19” rack high-performance computing (HPC) supercomputers using Nvidia’s Tesla GPU and the same company’s upcoming Fermi products through workstations using Nvidia Quadro technology to rugged 3U and 6U VPX implementations using GeForce.
The real attraction of GPGPU technology for embedded applications lies in the fact that sophisticated, very high-performance applications can be deployed in a fraction of the platform size and weight, and with substantially less power consumption/heat dissipation than would be required for a “traditional” solution. It is not unreasonable to believe that this reduction in size, weight and power (SwaP) could be by a factor of ten (Figure 4). The GPGPU—and, specifically, CUDA—is not the answer to every embedded computing need. For a number of applications, however, they represent a potentially high return solution that not only delivers dramatic potential performance improvements, but does so in a way that allows users to develop and deploy systems more quickly and more affordably than ever before.
CUDA-based systems (right – 0.8 cubic feet, 10lbs, 200 watts, 574 GFLOPS peak) can reduce SWaP by a factor of ten compared with traditional solutions (left – 4 cubic feet, 105lbs, 2,000W, 576 GFLOPS peak).
GE Intelligent Platforms
Santa Clara, CA.