DSP vs. Multicore Systems
DSP and x86 – Getting Past the Hype in Processor Architecture
Matching processors and applications has become increasingly challenging because of the constant change in the types and capabilities of processors available. Recently the x86 general-purpose processor has been shown to outperform some DSPs for particular algorithmic solutions. But is that the whole story?
BRIAN PEEBLES, DIALOGIC
Page 1 of 1
More and more people are now making the argument that it is appropriate to use x86-based general-purpose processors for applications traditionally reserved for Digital Signal Processors (DSPs). Several reasons are being advanced for this assertion. First, the need to get to market quickly often causes developers to favor the programming environment provided by the x86 architecture. Second, recent advances in x86 architectures have greatly improved the general-purpose processors’ performance-per-watt metric. Third, an increasing number of applications are able to use the x86 platform. For example, host media processing, software defined radio (SDR), and financial analysis are now making heavy use of the x86 general-purpose CPU. Others disagree and feel that x86-based general-purpose processors will never be equivalent to modern DSPs. Which side is correct?
We can start by asking how x86-based general-purpose processors perform compared to modern DSPs, and what is an optimal design for using x86-based processors. Media processing will serve as a good context for comparison. Deciding what kind of processor to use for media processing is especially important because of the complexity, volatility and density of media processing algorithms and the dependency of those algorithms on the external environment.
There are several classes of media processing algorithms, each of which has its own performance-related issues that require optimization to provide acceptable service. In some cases, the optimizations required for acceptable performance are so great that the entire function is usually placed in silicon dedicated to the task, such as an ASIC. This level of optimization is found in call progress tone detection and echo cancellation, where dedicated silicon can perform many channels of these functions and be placed close to the line interfaces where time-critical response is needed.
Audio and video play and record are a different matter. They do not require time-critical response, but do require some means of accessing a mass storage device where the file streaming that is critical to these functions is supported. Because DSPs are not a particularly efficient solution for file access due to a lack of native interfaces, such as SCSI and SATA, and because of the overhead of mounting a remote file system, a general-purpose CPU is preferable for servicing play and record media functions.
Speech recognition and speaker identification previously required the computational performance of a DSP; however, the libraries required to support these algorithms have grown to the point where large amounts of memory are necessary, and such memory capacity is best served by a general-purpose CPU. Audio and video compression and decompression require the large amount of processing of which DSPs are capable, with minimal memory footprint and mass storage dependency (unless we are transcoding streaming files to and from a hard disk). So we see that while some functions can be optimized for dedicated silicon and run optimally on DSPs, others are best run on general-purpose CPUs.
Figure 1 illustrates the relative efficiencies of various silicon technologies used in computational processing as a function of their “usability.” The term usability refers to a device’s overall flexibility in terms of the applications that can be implemented on it as well as the development environment (debug tools, compilers, profilers, etc.) that enable designers to implement those algorithms. The comparisons in Figure 1 are illustrated over time to show how the relative efficiency of the x86 CPU is improving.
Key to many recent efficiency improvements is the trend toward multiple processing cores. When array processors emerged in the early part of this decade, they had ten times the performance of traditional DSPs. This efficiency gap was created because of the array processor’s architecture, which had many, simpler Arithmetic Logic Units (ALUs) and high-performance internal fabrics. These architectures are much more flexible than those of their traditional DSP counterparts both in terms of the types of algorithms they can support and the number of algorithmic instances they can provide. However, the array processor’s usability was impacted by the lack of tools necessary to make the multitude of processing cores perform, and as array processor tools are improving, usability is also improving. At the same time, both the DSP and the x86 CPU have an increased number of processing cores and are more competitive.
In the x86 CPU, mathematical processing is limited to few integer ALUs and a floating-point engine. These functions are connected via dispatch ports to schedulers, register files and instruction queues in a tightly coupled fashion. Currently, the only way to increase the number of ALUs or floating-point engines is to replicate the entire core. However, this is likely to change for some variations of this kind of processor in the near future.
The architecture commonly used for multiple core DSPs is currently much more efficient than that of the x86 CPU. In DSPs, a separate general-purpose unit (such as an ARM core) is often added to front-end multiple-execution engines. Therefore, the entire general-purpose structure is not replicated whenever an additional execution engine is added. The general-purpose core is responsible for load-balancing, scheduling, overhead processing and other management tasks. However, this architecture has its limitations in that only a few execution engines can be associated with a single general-purpose core before the general-purpose core becomes a bottleneck.
Array processors have a more uniform architecture, but many of them still require an external control processor in order to manage their overall operation. Most array processor designs are also somewhat deficient in floating-point arithmetic, so while they are adaptable to many different algorithms and applications (and many can be configured to do floating point), they do not do everything optimally. Despite the control processor, the array processor generally must perform its scheduling and load balancing based upon the configuration selected, and the array must process all overhead (such as protocols), which can keep it from attaining its peak efficiency.
Figure 1 also shows that, in terms of size and power performance, x86 processors are not as efficient as DSPs. The x86 processor is designed for many different types of solutions, ranging from embedded and laptop to desktop and server versions. Higher-performance versions, which could compete with DSPs, require from 35W to 90W of power and often need a chipset that doubles the size of the design and requires another 15W to 20W. A high-end DSP or array processor requires from 2W to10W and does not require an additional chipset. This means that DSPs have an inherent efficiency edge of 2x in terms of size and 5x to 10x in terms of power. Even if the x86 CPU were to outperform the DSP or array processor in algorithm performance, it would still be far less efficient.
Four key design principles are important when determining which technology to use in media product architectures: scalability, versatility, density and programmability.
In telecommunications, the media product architecture may need to support a wide range of applications from entry level (a few dozen channels) to high density (several thousand channels). If a separate hardware design is optimized for several ranges, the best price/performance can often be achieved, but at the expense of maintaining numerous designs, perhaps with significantly different components and, worst of all, different code bases. The challenge is to establish a common code base, and, if possible, a common, modular hardware design. A modular design typically leads to replicating a processor and having 1-N of the same processor on some extensible fabric; however, the other principles must be considered before rushing into this kind of design.
We conclude then that the x86 general-purpose processor is simply not competitive with the DSP in terms of efficiency. As a result, no x86-based architecture can produce the same number of channels for a given algorithm as a DSP can in the same space and with the same electrical power. Since the maximum number of channels determines the overall dynamic range of a product offering, a design must attain this maximum density to achieve the lowest cost, size and power consumption and lead the industry.
But the only constant is change. New variations on old algorithms, new algorithms, varying demand on algorithmic instances (what we refer to as “algorithm volatility”) all require a versatile platform. A versatile platform enables a designer to maintain a market leadership position by quickly introducing new algorithms or new features that differentiate the product line. While DSPs have made some strides in this area, they lack the overall versatility of the general-purpose CPU. Designs that attain longevity achieve it through versatility and therefore require some level of general-purpose functionality.
Changing the code on any processor is always problematic, and writing the code initially is even more of an issue. This is why many DSP manufacturers are now providing algorithmic solutions with optimized code that can be licensed by a designer for a fee. The advantage of this approach is that it saves considerable time-to-market and greatly reduces program risk. The disadvantage of this approach is that it eliminates the designer’s ability to include key differentiating value into products and ties the ability of products to roll out new features to the DSP manufacturer. To add timely, differentiating value to a product, a designer must program the device. To reduce time and risk, a solid set of tools must be made available to the programmer in a development environment that they are familiar with.
Virtually every programmer is familiar with the x86 programming environment. It is, in fact, the basis for DSP development. However, optimized code can be produced on the x86 far more easily than it can be developed for most DSPs.
Figure 2 illustrates the applicability of the technologies discussed to algorithmic demands. The best solution appears to be a compromise solution, one that requires a mixture of dedicated silicon (ASICs) and DSP-accelerated x86 CPUs. Functions such as tone processing and echo cancellation can run on dedicated silicon. Algorithms that are stable and optimized, but require high density, can be provided by DSP acceleration modules connected to the x86 CPU. Algorithms that require more versatility for system interaction (memory, mass storage) or are more volatile (many new enhancements over a short period of time) are best run in the x86 CPU.
The challenge that remains is determining the optimal mixture of these components and configuring them in a system solution that maximizes the principles discussed here. In order to determine the best system solution, the designer must take into account a wide variety of issues including overall cost of the design, the licensing of algorithms (which depends on where they run), intercommunications latency, the overhead processing required for partitioning the architecture, and the overall efficiency of the design in terms of device utilization.