TECHNOLOGY IN CONTEXT
The Taming of the Multicore
Using multicore CPUs to build asymmetric multiprocessing systems cuts costs and improves responsiveness in embedded systems. It even lets you mix DSP with general applications on a single multicore processor–just ask Shakespeare.
PAUL FISCHER, TENASYS
Page 1 of 1
In the popular Shakespeare comedy, The Taming of the Shrew, Lucentio travels to Padua to attend school at the local university. Upon his arrival he meets the beautiful Bianca and, smitten with love, his priorities change instantly from attending to his studies to winning her hand. Unfortunately for Lucentio there are a few roadblocks: Bianca already has many suitors, and her father, the wealthy old Baptista, has made it known that no one may court Bianca until her older sister, the ill-tempered Katharina (aka the Shrew), is married first.
The comedy continues with Lucentio and two other suitors disguising themselves with the intent to trick Baptista and win Bianca’s heart. The “Katharina problem” is solved when Petruchio arrives in Padua from Verona to find himself a rich wife. In the end, three couples are married, but only one with “predictable results” that you can “bank on.”
So what does Shakespeare have in common with multicore and real-time? I know it’s a stretch–had Lucentio not been able to “change his priorities instantly” he might have lost the race for Bianca’s hand; he had to make decisions in “real-time” and respond to actions by the other suitors in a “timely fashion.” The appearance of Petruchio and others on the scene allowed for additional plots to be played out “in parallel” with the plot to win Bianca’s heart. Without this “parallel processing” the comedy would have lost its pace and become nothing but a forgotten story.
The same is true for embedded systems. Without the ability to respond to inputs, control priorities and perform tasks in parallel, our protagonist, the embedded system designer, could not meet his goals of a system with “predictable results.” Each suitor in the comedy played different roles but all of them contributed to the final outcome. Modern embedded systems also have many roles to play: a GUI and enterprise network interacts with the outside world, a real-time control loop works with machine-level I/O in a time deterministic manner, and analysis modules requiring numeric-intensive operations are needed to implement complex system functions. Taking on all of these roles in an inexpensive small form factor platform is the challenge faced by today’s embedded system developer.
Asymmetric Multiprocessing Meets the Challenge
Just like the multiple plots that operate in parallel in Shakespeare’s comedy, embedded developers can use the multiple cores available in today’s low-cost processors to implement parallel tasks. But rather than relying on a single operating system (OS) to arbitrarily assign those tasks to the processor cores based on a symmetric multiprocessing (SMP) scheduling algorithm internal to the OS, you can assign the cores to specific tasks using multiple operating systems and an asymmetric multiprocessing (AMP) solution.
Partitioning resources in a biased manner is a common design practice. For example, a DSP board or embedded processor card might be used for real-time data collection, processing and control; while a separate user-interface computer interacts with the outside world. In this case, expensive communication links are needed to coordinate the disparate real-time hardware with the high-level supervisory control computer.
Rather than allocating time-critical software to an expensive stand-alone processor or DSP board, you can run an RTOS on a dedicated processor core and use it to provide the same functions as an entire CPU board. The remaining core(s) can host your general-purpose operating system (GPOS), such as Windows. This lets you optimize the cost of your hardware and your software engineering resources. Using shared memory to communicate between the RTOS and the GPOS (each OS resides on distinct cores of the same machine) provides for very low-cost and efficient transactions without the expense and complexity of extra hardware (Figure1).
Such a system is an asymmetric multiprocessing system because the processor cores are not being load-shared across a single OS but are dedicated to specific tasks, in this case simultaneously running multiple operating systems. You have ultimate control over the priorities and applications that run on each RTOS core, unlike the situation with a GPOS.
Real-time application developers work with low-level hardware and need guaranteed access to I/O, timers, RAM and CPU cycles. User-interface, database and networking programmers must deal with high-level APIs and complex data exchange protocols. By giving each discipline the best environment for their job–a real-time operating system (RTOS) for the former and a general-purpose operating system (GPOS) for the latter–and hosting both on a low-cost multicore platform, you can minimize your cost-of-goods, decrease your time-to-market, and maximize the use of your engineering resources.
Most modern general-purpose processors contain SIMD instructions (Single Instruction, Multiple Data) designed to perform vector and matrix arithmetic. On Intel Architecture (IA) processors they are known as the MMX and SSE instructions. MMX instructions are limited to integer operands; SSE instructions accommodate floating point operands and vector arithmetic. These SIMD instructions are ideal for implementing digital filters, digital control loops, pattern recognition algorithms, and video streaming and mixing applications.
SIMD instructions have been a part of the Intel Architecture since the Pentium III. The instructions operate similar to those found on a DSP. In the latest IA processors the number of SIMD instructions available, with all their variations, is over 200! The instructions perform a variety of packed arithmetic, move, compare, conversion and logical operations, all designed to address the needs of digital signal processing algorithms.
Given the large number and complexity of the SIMD instructions, the notion of using SIMD to achieve DSP-equivalent functionality might feel overwhelming. Fortunately, the Intel Integrated Performance Primitives library provides a relatively easy way to take advantage of SIMD instructions without having to be an SIMD expert.
The Intel Integrated Performance Primitives (the IPP library) is a collection of functions optimized for Intel Architecture SIMD instructions. The library takes full advantage of the advanced MMX and SSE instructions found on x86 processors without requiring that you be an expert at using the SIMD instruction set.
The library is divided into a number of functional groups, or application domains. These domains encompass a broad range of digital signal processing functions for handling tasks like matrix arithmetic, digital filters, audio, image and video encoding and decoding, and string processing. These application domains are summarized in Table 1.
For each application domain, the library provides function primitives that implement key algorithm and performance-sensitive operations. These primitives perform single, specialized operations with minimal overhead and include the ability to control numeric precision and error handling. Variants of each primitive are optimized to specific data operand size and precision requirements.
When you read the literature for the Intel IPP library you will find that it is designed and sold only for use with the Windows, OS X and Linux operating systems. No variants of the library are targeted for use with an RTOS. So how can an embedded developer take advantage of this tool to easily apply SIMD instructions to real-time applications on a dedicated core that is masquerading as a DSP replacement engine?
Fortunately, the IPP library is OS-agnostic; meaning it does not require an OS-specific API. And because the INtime RTOS utilizes the Microsoft Visual Studio compiler and integrated development environment (IDE) as its build and debug platform, real-time applications built for the INtime RTOS use the same calling conventions and library and object code file formats as those used by Windows applications. Thus, you can link the static edition of the IPP library file with your real-time application and utilize the SIMD-optimized IPP library primitives within your real-time application.
Applying the IPP library is a very time-efficient method by which to apply x86 SIMD instructions to a real-time application. It gives you the means to quickly and easily substitute a real-time application on a dedicated CPU core for an expensive DSP board, with the added benefit of avoiding the task of writing your own library of SIMD functions in assembly language.
Raw Performance and Determinism
The raw performance gains possible with the SIMD instructions are very appealing. Depending on the specific operations, speed improvements of more than 10x are possible, compared to the equivalent function performed using general-purpose processor instructions. Your actual performance gains, of course, depend on the nature of the application and the mix of SIMD operations (or IPP functions) used.
If you compare the raw performance of an SIMD application on an idle machine running a GPOS, such as Windows, to that of the same application running on an RTOS, the numbers are virtually identical. But running an application on an idle machine that must perform many tasks in parallel is not a very good measurement of real-world performance. The difference between running an SIMD application on an RTOS and running it on a GPOS is not performance, but determinism. To successfully replace a DSP with a CPU core on a multicore processor, you need both raw performance and determinism
In order to illustrate this difference in determinism for an SIMD application when applied to a GPOS and an RTOS on a multicore machine, we measured the real-world variations in execution time of a JPEG test application. The test machine was “loaded” with multiple simultaneous system-intensive activities consisting of: a string search through all files on the local hard drive, viewing a small video in a continuous loop with a third-party video player, and continuously playing mp3 files in Windows Media Player while it generated a “visualization” display of the music being played. The results are shown in Figure 2.
The measurements shown in the chart tabulate the results of encoding seven uncompressed bitmaps into equivalent compressed JPEG files (1 7). Each red and green cluster contains six bars representing six successive encodings for an identical bitmap. The gray bars overlaying each cluster are the standard deviation of the bars within that cluster, expressed as a percent of the mean execution time of the cluster. In other words, the red and green clusters over x axis label “1” represent the times to encode bitmap “1” over six successive runs: one set of six on the Windows OS and one set of six on the INtime RTOS, and so on.
Windows and INtime share a single multicore machine, therefore, activity on the Windows OS represents potential interference to real-time applications on the INtime RTOS. However, INtime processes always have precedence over Windows processes, assuring control of priorities and real-time determinism for the real-time applications. This is illustrated by the green bars (the INtime measurements), which are consistently faster and more consistent in their execution times than the red bars (the Windows measurements).
In this example, on a busy system, a variation of 3% or less was measured when running the JPEG encoding application as a real-time process on a dedicated core. Whereas the same application running as a Windows process runs longer and experiences variations in mean run times between 6% and 26%.
Consolidating two or more operating systems on a single platform requires a new form of inter-OS communication. Shared memory is the logical choice, since each OS runs on one (or more) core of a single processor. Message signaling between cores can be facilitated using the Inter-Processor Interrupts (IPIs) built into the CPU cores. A shared-memory interface is capable of providing very high-performance communication (Figure 3). More complex protocols may then be built on top of this base.
Virtual devices can be used as the interface for inter-OS protocols, especially for integration with existing legacy applications; for example, a virtual Ethernet or a shared-memory PCI device. In the case of a shared-memory PCI device, each guest operating system contains a virtual PCI device driver configured to point to the shared memory in which applications can post common data. After a virtual PCI device updates the shared data structure it signals the other virtual devices, using the IPI, to indicate that a data update has occurred.
The net gains from the application of multicore processor platforms to real-time embedded applications are the elimination of redundant computer and communication hardware, faster communication and coordination between RTOS and GPOS subsystems, improved reliability and robustness, reuse of proven legacy applications, and simplified development and debugging. Significant cost savings can be achieved by condensing systems comprised of separate GPOS, DSP and real-time hardware subsystems onto a single, multicore, hardware platform.