A Multicore Approach to More Efficient Embedded System Performance
Re-evaluating the “faster-is-better” mentality in favor of a “divide-and-conquer” approach, offers new options in terms of both throughput and performance-per-watt.
ROBERT KüFFNER, MEN MICRO
Page 1 of 1
For years, commercial off-the-shelf (COTS) board designers have typically relied on using faster processors to deliver enhanced performance when designing increasingly powerful single board computers (SBCs) that industrial control and automation application developers could adopt as a foundation for unique embedded control systems. And naturally, we have complemented those more powerful processors with more sophisticated cache, memory, mass storage and I/O capabilities to satisfy increasingly demanding application software.
However, with continued increases in processor speeds, I/O demands and the use of graphics, embedded systems users are discovering that the “faster-is-better” strategy alone can lead to new problems related to heat build-up, less efficient use of power at the highest processor speeds and increased latency in real-time applications. That has prompted the realization that more powerful hardware components alone are not the total answer to keeping pace with today’s escalating demand for graphics-intensive applications or the added pressure to integrate greater functionality into smaller packages. As a result, we investigated how striking a new balance between a multicore processor configuration and software structures can result in higher functionality, greater throughput and fewer conflicts in compute-intensive applications.
Systematically addressing those related concerns through a combination of new multicore CPUs, independently run multiple operating systems, larger shared cache and adapted legacy application software, has yielded performance gains that can benefit any embedded system developer facing a variety of issues beyond processor speed alone. The resulting design (Figure 1) has proved to be particularly advantageous for multimedia and other graphics or real-time applications.
Addressing Multiple Concerns with Multiple Cores
In satisfying the need for processing power, semiconductor technologies providing increasingly higher clock speeds–upwards of 4 GHz–typically require additional transistors, each of which leaks a small amount of current. Cumulatively, those power losses can lead to complications of greater power consumption and heat generation as well as lower performance-per-watt. And even in spite of increased clock speeds, multitask applications often compound latency issues when a high volume of routine application functions multiply queuing conflicts with high-priority real-time functions.
Contrasting those inherent physical and practical limitations of high-speed processors against the more complex requirements of today’s applications, led to the realization that something fundamental had to change. One logical conclusion was to explore more parallel solutions–from both the hardware and software perspectives. As a result, combining newer multicore processor technology with increased L2 cache (4 Mbyte) and system memory (4 Gbyte SDRAM) in COTS single board computers, has enabled us to document performance up to 24,178 MIPS and 16,525 MFLOPS through industry standard performance testing.
The board layout shown in Figure 1 integrates a dual-core 64-bit Intel Core2 Duo processor (ranging from 1.06 GHz to 2.60 GHz) in a versatile 4HP/3U single-slot single-size format. It is supported with a Mobile Intel 965 GM Express chip for memory and graphics control, plus multiple mass storage capabilities and a variety of I/O ports. This execution offers a compact solution for embedded systems demanding high computing performance with comparably low power consumption.
A 533/667/800 MHz frontside bus, 500 MHz 256-bit graphics core and 10.6 Gbyte/s memory bandwidth provide support for CAD tools, 2D/3D modeling, video and rendering in multimedia or other graphics applications. They also aid in other compute-intensive applications for test and measurement or vision and control systems in industrial automation or robotic applications. But that overall hardware configuration is only half of this enhanced embedded system solution.
Restructuring Software for Multiple Cores
In order to take full advantage of the multicore processor, it is also necessary for us to re-evaluate and implement appropriate software strategies–at the operating system software level as well as at the application software level–in order to complement the upgraded hardware design. At the board level, Intel Virtualization Technology (Intel VT) provides the ability to allow one physical “machine” (or board or processor) to function as multiple “virtual” machines. This is true whether that machine or SBC uses a single-core or multicore processor design. By implementing a layer of system software called a Virtual Machine Monitor (VMM) with the multicore processor, we are able to have multiple operating systems sharing one physical hardware platform in a way that is fully transparent to both the operating systems and the applications (Figure 2).
In the case of the dual-core processor SBC described here, Intel VT allows us to devote one of the processor cores to a real-time operating system (RTOS) and one to a general-purpose operating system (GPOS). This division of functions reduces interrupt latency by increasing the likelihood that the RTOS will be available for high-priority time-critical functions, while the GPOS handles less critical functions–such as managing a graphical user interface (GUI). This arrangement also ensures that RTOS operations do not need to be interrupted if the GPOS needs to reboot after an unforeseen crash.
By using an Intel VT-enhanced hardware platform like the one in the Intel Core2 Duo processor, we also eliminate the need for the problematic software workarounds required by previous software-only VMM solutions. This approach allows us to run multiple guest operating systems at the Ring 0 level where they are normally expected to run. Each OS runs in a less-privileged, VMX non-root mode while the VMM system software runs in a more privileged, VMX root mode (Figure 3a). Doing this avoids the issue of complex and costly workarounds needed to compensate for earlier versions of VMM software that ran on Ring 0 and forced the virtualized operating systems to run on Rings 1-3 (Figure 3b). Those workarounds were necessary because the original 386 architecture was fundamentally designed with the ring structure and memory management to run a single operating system at Ring 0 and run the application software on Ring 3.
Adapting the Application Software
Once we have established the ability to run multiple operating systems simultaneously on the multicore processor, it is important to make sure that application software will be able to take advantage of the extra core(s). There are several options for making existing application code more parallel. Each involves different degrees of effort and reward, depending on the nature of the application (Figure 4).
Perhaps the easiest way to accomplish the transition of legacy software to multicore applications is to implement multitasking by using the built-in capability of an operating system to assign specific processes to run on specific cores. Multitasking permits operating systems that are enabled for symmetric multiprocessing to free Core 1 for Application A by scheduling other tasks on other cores. This approach does not scale easily to more cores, but it is often attractive as a short-term approach because of its simplicity.
In installations with long single-thread applications, such as media servers where there is a lot of old legacy code, distributed processing can be used to split the application in two and run each half on different cores. While this is not a perfect method, distributed processing provides coarse-grained distribution of heavyweight processes onto all cores and does offer some flexibility to split the code to optimize load balancing and performance.
The best way to take advantage of upgrading to multicore processing, typically comes from application threading. This approach provides fine-grained distribution of lightweight processes onto all cores, offering the best load balancing and scaling. Application threading is particularly advantageous for repetitive processes–like running SSL transactions or virus checking bit streams. Initiating a new thread for these repetitive processes runs faster than starting a new process because many times the memory segment does not need to be changed and the needed instructions may still be in the instruction cache. When done properly, threading makes the best use of cache and runs repetitive algorithms faster.
Maintaining Application-Specific Flexibility
In addition to the multicore hardware and software framework approach taken in the COTS board design mapped out above, it was necessary to integrate a variety of I/O capabilities in anticipation of the diverse needs of custom embedded system designers across a wide range of potential applications.
This is accomplished with a northbridge circuit using an Intel 965 GM Express chip–capable of supporting up to 4 Gbyte SDRAM system memory as well as VGA and two serial digital video outputs–and with a southbridge circuit using an Intel ICH8-M DH chip supporting numerous I/O options. Those chips support a total of eight USB 2.0 ports, two 10/100/1000Base-T Ethernet channels, a high-definition audio port, both SATA and PATA mass storage devices and a CompactFlash option. The ICH8 I/O controller hub also provides support for Intel Matrix Storage Technology, providing both Advanced Host Controller Interface (AHCI) and integrated RAID functionality. Matrix RAID support is provided to allow multiple RAID levels to be combined on a single set of hard drives.
In order to provide capability for both new and existing embedded industrial automation and control applications, the design includes compatibility with both CompactPCI (CPCI) and CPCI Express bus standards that allows the SBC to function between a PLC control bus and an IT infrastructure. This enables interaction and control with existing sensors, actuators and drives while it logs, analyzes or interacts with data in the IT environment.
In addition to demanding an architecture to accommodate I/O-intensive and compute-intensive applications, this dual-core processor SBC design also requires additional performance capabilities beyond those of typical IT computer room equipment in order to tolerate the anticipated operating conditions of heavy-duty industrial environments.
End-User Flexibility without Custom Development
Harnessing the throughput advantages of multicore technology in an industrial-grade board supported by standard software development tools creates new options for embedded system designers who have neither the time nor the inclination to upgrade their industrial control and automation systems from the board level up. Board support packages for Windows, Linux and VxWorks complement a large percentage of applications already used in those environments.
And through the use of side cards that offer a variety of I/O combinations with standard hardware, systems can easily be expanded depending on the necessary functionality. For example, the MEN Micro F6xx series offers a wide selection of I/O options including USB, UART, FireWire, DVI and audio, depending on the card selected. Regardless, each card provides a slot for an onboard 2.5” SATA hard-disk.
The non-proprietary components and software in this SBC offer end users broad latitude in adapting it to their particular needs, comparable to their implementation of existing single-core processor systems. While it enables their applications to harness the immediate benefits of dual-core processing, it still allows them to work with familiar software, bus interfaces and input signals. Equally important, hardware compatibility provides an easy migration path to next-generation processors without software or system adaptations. This lays the foundation for future potential to be gained from even broader applications of multicore processing.