INDUSTRY INSIGHT
From Multiprocessor to Multicore
Moving from Multiprocessors to Multiple Cores
The advent of powerful multicore architectures like the Cell Broadband Engine can significantly enhance applications that were already boosted by multiprocessor approaches. The trick lies in knowing how to optimize the newly available resources.
WILLIAM LUNDGREN, KERRY BARNES AND JAMES STEED, GEDAE
The use of multiple processing elements has become essential to software development. A variety of multicore and DSP processors are available. While each processing core is capable of doing a variety of tasks, some processing elements may be better suited to some tasks than others. Using traditional development methods, the choice of processor for each task must be done at the beginning of development. By making this choice early, the planning of the partitioning and mapping of work to processors can be done before coding is started to minimize risk to the project. However, this preplanning requires much technical experience and insight both in the type of problem and the capability of the processors. The sense of experimentation that moves most engineers and programmers into entering science is shackled and restrained by the necessary structure needed to help improve the chances of getting an expensive project through to fruition.
Other options are available. Software development tools are available that automate the implementation of distributed software. Using a model of the software that can be constructed on a single workstation, the tool generates separate threads and executables to construct the parallel implementation, and many types of processors can be supported using the same infrastructure. Using these software development tools, the distribution of work to processors, and even the choice of processors themselves, can be delayed until the final stages of software development. Through experimentation and analysis, engineers can find the optimum implementation, not just enabling the search for better software, but also reducing risk to the project by allowing the implementation parameters—that used to be set in stone before coding—to be altered in an iterative fashion.
An example of some of the benefits of using this approach to software development is the work recently done to move a synthetic aperture radar (SAR) benchmark from a quad PowerPC DSP system to the Cell Broadband Engine (Cell/B.E.) processor. The SAR algorithm consists of three main components: range processing, a matrix transpose and azimuth processing. The range and azimuth processing have many compute-intensive vector operations, including FFTs, inverse FFTs and vector multiplies. The work of the range and azimuth processing can be easily distributed to multiple processors, but distributing this work requires the matrix transpose to be distributed—what is called a “corner turn.”
The existing SAR benchmark was implemented in Gedae, a programming language and multithreading compiler that enables experimentation with many different processors and processor topologies. Gedae was used to generate an implementation for the quad PowerPC system, as shown in Figure 1. Each PowerPC in the system runs at 500 MHz and has 256 Mbytes of memory. While the 500 MHz processors are several years old, the suitably ample memory allows the large SAR images to be processed one at a time. In other words, once distributed, one SAR image easily fits in the four memories. Because of this ample memory, the corner turn operation is implemented easily by sending the i-th section of the subimage on the j-th processor to the j-th section of the subimage on the i-th processor; a very trivial implementation of a distributed matrix transpose. The quad PowerPC implementation achieves a frame rate of 3 Hz.

Using traditional development techniques, re-implementing this application on the Cell/B.E. processor presents a significant programming project. The Cell Broadband Engine Architecture is a heterogeneous multicore architecture developed through a collaboration between Sony, Toshiba and IBM. The current implementation of the Cell/B.E. processor combines one Power Processing Element (PPE) with eight identical Synergistic Processing Elements (SPE), as shown in Figure 2. The PPE is a dual-threaded PowerPC core, and each SPE contains a high-speed processor with its own 256 Kbyte local store and DMA (Direct Memory Access) engine. Using the SPEs effectively is a key programming challenge when targeting the processor. While processing can be put on both PPE threads, the power of the processor is only unleashed when the SPEs are heavily utilized. Using the SPEs heavily means the software developer must overcome the hurdle of the SPE’s 256 Kbyte local storage.


Kontron
Interphase