Accelerated Processors: A New Class of Devices?

Family of Accelerated Processors Opens the Doors to New Embedded Possibilities

A new family combines dual and quad-core x86 processing with high-end graphics capability that can also be used for numerically intensive applications.


  • Page 1 of 1
    Bookmark and Share

Article Media

The concept of combining multicore x86 functionality with a powerful parallel architecture graphics processing engine on the same piece of silicon has come of age and could significantly transform the areas and applications for embedded processors. Following on its groundbreaking G-Series of what it calls accelerated processing units (APUs), Advanced Micro Devices is announcing a new series of APUs—named the Embedded R-Series. The R-Series combines even  more x86 cores with advanced features integrated with even more powerful GPUs to offer a wider selection of performance, power and cost that can address a wide selection of embedded applications that increasingly rely on high-performance graphics as well as numerically intense computation.

Where earlier a x86-based design that integrated graphics processing relied on the x86 CPU to interface via a North Bridge connection with a discrete graphics processor, the APU integrated both on the same chip. The GPU can, of course, be used for demanding graphics tasks as well as to offload such things as DSP operations if needed in the same code string. In the past, such operations would involve the x86 CPU sending calls to a DSP or discrete GPU to invoke code running on the coprocessor that would then send results back to the CPU—with all the latency and overhead that would necessarily be involved. With both architectures on the same die, the application can be written as one program using OpenCL thus vastly reducing both latency and overhead.

The new AMD R-Series builds on the earlier G-Series but uses a more advanced multiple pipeline x86 architecture and implements the series as a selection of dual or quad-core devices with AMDs DirectX-11 capable Radeon 7000 Series graphics engines with up to 384 parallel processing units. This integrated architecture allows a combination of dedicated and shared resources. Therefore the two or four x86 cores, each of which has four execution pipelines, a dedicated thread scheduler and a dedicated Level 1 cache, also share instruction fetch and decode, a Level 2 cache and two 128-bit floating point MACs, which can be combined into a 256-bit floating point unit (Figure1).

In addition, the single instruction multiple data (SIMD) parallel processing GPU shares the memory controller with the x86 cores for fast access to memory as well as to provide fast communications between GPU and CPU cores. And that is before we even get to the floating point capabilities of the parallel GPU. Depending on the number of parallel processing units in a given device’s GPU, single precision floating point computation can range from 178 to 578 GLOPs per second (Figure 2).

Figure 2
The R-Series builds on performance and power consumption of the G-Series, whose floating point performance is at the bottom of the graph. Using a GPU with up to 384 parallel units, the high-end device can hit 578 GFLOPs per second.

In addition, it is possible to combine the graphics performance of an APU with that of an SMD discrete GPU for even more graphics performance in terms of raw output or the number of displays that can be driven. For example, combining the performance of the high-end quad core R-464L with that of the AMD E6760 GPU would yield an additional 20% more graphics performance. That could be harnessed for raw graphical or numeric compute power or to be able to drive more displays than the four that can be directly driven from the device.

Interfaces available directly on the chip include HDMI, Display Port and DVI. In addition to the four independent display ports, a x16 PCI Express port can be configured as up to four more outputs for either graphics or other I/O. This can be configured as up to four DVI interfaces to directly drive up to four more displays. By using a discrete GPU on a Windows 7-based board, it is possible to drive up to a total of ten displays. A selection of controller hubs will be available to provide additional I/O including SATA, VGA, USB 2.0, USB 3.0 and PCIe 4x1. There is even a hub available that supports the legacy PCI bus (Figure 3).

Figure 3
The cores and the GPU (SIMD Engine Array) access memory via the shared controller, which can support 1.5V, 1.35V and 1.25V DDR3 memory. Several high-end graphics interfaces are directly available from the device with other I/O available via the controller hub.

Additional Support Features

The ability to support such high-end graphical and compute capabilities on a single chip suggests a range of applications that can benefit from additional features required by such things as video conferencing, surveillance and other kinds of distributed applications. For example, the ability to manage distributed nodes independent of operating system state is supported by AMD’s DAS 1.0 implementation of the Desktop and mobile Architecture for System Hardware (DASH) management scheme. It allows remote operators to go into systems and reset or power down and restart them and perform remote BIOS updates among other things.

A dedicated video compression engine offers hardware support for the encoding and compression needed for distributed applications such as video conferencing and surveillance where high-definition video must be rapidly transmitted across wired or wireless networks at constrained bandwidths. There are also performance enhancements for secure asset management in terms of encryption and decryption for sharing and rendering protected video content.

A broad range of decode support is also provided to support low-power rendering of video content. This includes H.264, MPEG-2, VC-1, MPC, DivX, MPEG-2 IDCT+ MotionComp, Dual HD Decode (1080p+1080i) and MVC for Blu-ray Stereo 3D. Thus, in addition to the raw power of control, numeric and graphical computing, there is a range of interfaces and built-in services that address a wide and growing range of application needs on a single chip that can be run by a single set of code written in a single language, namely OpenCL. Therefore one development discipline can be used to exploit the capabilities built into something like an APU.

Where from Here?

Now why, exactly, this detailed description of an admittedly major new product line? For one thing, the implications are potentially enormous. AMD is currently leading the market in this particular arena, but it is sure to attract competition. What happens then is anyone’s guess, but we can take a shot at predicting. First, there are a good many identified applications for which such a device would be a distinct advantage. These include digital signage where the ability to control multiple displays is a must and the capability for remote management is very desirable.  There are increased possibilities in security and surveillance where there is a need to manage multiple video feeds. Teleconferencing, high-end casino gaming and advanced medical imaging all come immediately to mind. But along with such increased capabilities there arise possibilities we may not have thought of yet and these deserve exploration as well.

It is tempting to compare the significance of the emergence of the APU as a new class of devices along with that of the applications services platform (ASP), which combines a general-purpose CPU—in that case an ARM architecture—on the same die with an FPGA fabric. Such devices have recently been introduced by Xilinx, Microsemi and Altera. While ASPs may address a whole different set of potential applications and are a huge step forward, they need to overcome the fact that they combine two different development disciplines—that of the programmer and the FPGA developer, which are not often mastered by the same person.

Admittedly, graphics development is a more specialized discipline than general programming as well, so there will be hurdles. Here, however, the same code base can access and allocate the resources dynamically, such as by dedicating parts of the parallel engine to numeric computation and others to graphics rendering. One simple example that everyone can relate to is the now notorious game, “Angry Birds.” Anyone who has played the game has no doubt noticed that in addition to the display of flying birds and the various things they destroy to thwart the evil pigs, there is also physics at work. The boards, rocks and other objects sway and either fall or don’t fall, smash or don’t smash, according to the force and angle of how they are struck. These are two different types of calculations.

The same can be said of such more practical applications as computational fluid dynamics, seismic computation and display or many other sophisticated problem solving applications. The ability to combine a CPU architecture that is capable of general-purpose programming including interrupt-driven and real-time control with a graphics processor that is also capable of intense numerical computation, opens a vast number of possibilities.

Consider as only one example the combination of machine vision with motion control. Video data captured by a machine’s “eyes” can be rapidly processed in the GPU’s parallel units, features or other clues extracted, and then used to direct the CPU to move arms, wheels or other aspects of a vision-directed application all on one main device under the control of a unified set of code. We leave the reader to imagine further. 

Advanced Micro Devices
Sunnyvale, CA.
(408) 749-4000