BROWSE BY TECHNOLOGY



RTC SUPPLEMENTS


TECHNOLOGY IN CONTEXT

Developing for Multi-Core Processors

Optimizing Multicore Software for Embedded Processors

Multicore technology offers new performance potential. But to take advantage of that potential, software tools must be leveraged to implement parallelism across single and multiple cores.

BY STEPHEN BLAIR-CHAPPELL AND MAX DOMEIKA, INTEL

  • Page 1 of 1
    Bookmark and Share

Article Media

The use of multicore processors in embedded systems is increasing. These processors, which are two or more processor cores on one package, enable increased computation performance and optimized power utilization compared to single core processors. Multicore processors are already employed in a number of embedded market segments such as telecommunications and digital surveillance. Use is increasing in embedded systems where low power and small form factor constraints are paramount. 

There are many types of multicore processors. Two examples, which are the focus here, are simultaneous multithreading (SMT) and homogeneous multicore. SMT is the ability of one physical processor core to mimic multiple logical processor cores. With SMT, as far as the operating system and the applications executing on the operating system are concerned, that one processor core appears to be a multicore processor. A homogeneous multicore processor is one where each physical processor core is similar. This is opposed to heterogeneous multicore processors where the processor cores implement different instruction set architectures (ISA). Multicore processors improve performance by enabling the software developer to utilize parallelism in their application.

Another form of processor technology that enables developers to take advantage of parallelism in their applications is single instruction, multiple data (SIMD) instructions. SIMD instructions enable the same operation to be performed on multiple data items at the same time. Another term for SIMD is vector, because these data items can be logically grouped into vectors. Intel’s version of SIMD instructions is comprised by Intel Streaming SIMD Extensions. 

Typically, in order to fully take advantage of technologies such as SMT, homogeneous multicore processors and SIMD instructions, software changes must be made. Employing the proper software tools can go a long ways toward being efficient in design, implementation, debug and tuning of your application. 

First, as an example, we share an overview of the Intel Atom processor with a focus on the hardware features that enable developers to take advantage of parallelism. Second, we discuss the software tools that lend themselves to enacting parallelism and helping developers at all phases of the development cycle. We show coding examples that help reinforce the discussion.

Intel Atom Processor Architecture

The Intel Atom processor architecture is an in-order processor capable of retiring two instructions per clock cycle. The clock speeds available in currently shipping products range from 800 MHz to 2.0 GHz. The processor supports Intel MMX, Intel SSE, Intel SSE2, Intel SSE3 and Intel SSSE3. These are all forms of SIMD instructions that enable vector computation on integers, single-precision and double-precision floating point numbers in combinations comprising a 128-bit register. Software modifications are required to take advantage of these extensions. These modifications can be in the form of direct coding or employing tools and libraries that make use of the instructions.

SMT support in the Intel Atom processor is provided by Intel HyperThreading Technology. Internally, the processor enables instructions from two different threads to make use of the microarchitecture so that if one thread is stalling inside the pipeline, a second thread can take advantage of the idle process resources. Multithreading is one means of taking advantage of the additional processing power available from SMT. There are performance issues specific to SMT of which to be aware. Since processor resources are shared between processes, and threads execute concurrently, the cache available per thread is effectively halved. This consideration should be a part of any design effort to take advantage of SMT. In the worst case, it is possible for two threads to cause each other to repeatedly miss in the cache.

The multicore versions of the Intel Atom processor consist of two processor cores only at this point in time. Each processor core supports SMT so a total of four threads can execute concurrently on a system based upon the processor. As with SMT, multithreading is a mechanism to take advantage of the processing power. During multithreaded application development and tuning, attention should be paid to how threads are executing on the processor cores. Performance issues concerning thread contention, workload balance and cache behavior must be remedied. Tools support can help in this endeavor. 

Strategies for Parallelization

Parallelization is a “hot topic.”  Programmers are increasingly facing up to the challenge of writing parallel code. When programming for multicore it is easy to slip into the mistaken belief that writing parallel code is the answer to improving efficiency. In many cases writing programs that take advantage of SIMD extensions can bring a performance benefit in excess of any boost that multithreading can bring. Three programming practices stand out as being hugely beneficial:

  • Accessing SIMD instructions using Intel SSE
  • Automatic vectorization

  • Parallelizing code

Most compilers support the use of SIMD instructions through the use of inline assembler or compiler intrinsics. Using compiler intrinsics rather than assembler is much easier because the compiler takes care of some low-level details such as register allocation. To make best use of these SIMD intrinsics, the programmer will need to think carefully about how an application algorithm can be modified to take full advantage. With a little effort it is possible to obtain some huge speedups. Figure 1 shows some intrinsics that were used in a Sudoku solver—in this case it realized a speedup of 20.

Figure 1
Example of some compiler SIMD intrinsic instructions

Auto-vectorization is a technique where the compiler automatically replaces traditional instructions with SIMD instructions. Calculations in loops are prime candidates for auto-vectorization. By using SIMD instructions, the compiler can reduce the number of iterations such a loop needs to execute. Usually little or no intervention is needed by the programmer other than to make sure this option is enabled in the build. Not all compilers support auto-vectorization.

 While talking about producing optimized code, it is worth repeating that our example processor, the Intel Atom, has an in-order execution engine. Normally compilers rearrange instructions assuming an out-of-order execution. When generating Intel Atom processor specific code, it is better to use a compiler that can generate in-order code. Although this is not essential, it can further improve the performance of the code.

Nearly all the effort in adopting parallelism focuses on how to make existing programs take best advantage of multicore. In a recent roundtable discussion between some key developers in the industry, porting of legacy code was the number one concern.

Figure 2 shows one of the most common strategies employed when turning a non-parallel program into parallel. The process is incremental, the cycle being repeated many times as different parts of the application are made parallel. This same strategy can be used for embedded applications.

Figure 2
This four stage development cycle is widely accepted as the best way to introduce parallelism into an existing program.

At the analysis stage, the application is profiled to find the code that uses the most CPU time, the hotspots being potential candidates for parallelization. It is normal that code higher up in the calling hierarchy is parallelized rather than the hotspot itself. Any profiling tool such as GNU gprof, Intel VTune Analyzer, or Intel Parallel Amplifier can be used to determine the hotspots. 

At the implementation stage, parts of the code are made parallel. Parallelism can be implemented in the code by a variety of means including native threads, language extensions such as OpenMP, or by using threaded libraries.  

At the validation stage, attention is directed to detecting parallel errors such as data races and deadlocks. These types of errors can be notoriously difficult to find. Some programmers choose not to make their code parallel because of their anxiety about detecting such parallel-type errors. To efficiently check for parallel errors it is important to use tools that are dedicated to the purpose such as Intel Thread Checker, Intel Parallel Inspector, or Valgrind. Some checks can be accomplished by a careful code-review or static analysis of the code, but it is advisable also to do some runtime checking since not all errors can be captured by simple code inspection. 

In the tuning stage, the application is analyzed with respect to load balancing and threading overhead. Here is where you address such questions as, “Are all the threads performing an equal amount of work?” and, “Is my program scalable?” 

As previously mentioned, OpenMP can be used to implement parallelism and is a favorite of many programmers. OpenMP allows C\C++ and Fortran programmers to add parallelism to their code by using a combination of compiler pragmas and calls to OpenMP library functions. Various work-sharing constructs can be used to distribute work between a pool of threads. Some programmers use OpenMP for rapid prototyping. Once satisfied that there will be a performance gain, they re-implement the parallelism using other means. OpenMP is a well established standard and is supported by all of the major compiler manufacturers.

Figure 3 shows three different ways of parallelizing code using OpenMP. In all three examples, a pool of threads is automatically created by the #pragma omp parallel statement. The number of threads created defaults to the number of hardware threads the platform can support.  

Figure 3
Three different ways of parallelizing code in OpenMP.

Figure 3(a) gives an example of parallelizing a loop. In this example the iterations of the loop are divided between the numbers of threads in the thread pool. So for example, on a CPU that supports two hardware threads, the first 50,000 loops will run on one thread and the remaining loops will run at the same time on a second thread.   

Figure 3
Three different ways of parallelizing code in OpenMP.

Figure 3(b) shows how to get two functions to run in parallel. Each of the two section statements at line 5 and line 7 run in parallel. Although the code is not scalable, this particular construct can be very useful in embedded programs, where one might want to separate two distinct activities into separate threads running in parallel.

Figure 3
Three different ways of parallelizing code in OpenMP.

Figure 3(c) gives an example of the OpenMP Task construct. This construct can be used in non-loop oriented code such as a linked list or a recursive call. For simplicity the example here uses a loop. The for loop at line 3 runs in a single thread. The loop creates a number of OpenMP tasks at line 7. OpenMP tasks are available for execution the moment they are created; the OpenMP runtime being responsible for distributing the tasks among the thread pool. Support for OpenMP tasks was introduced in OpenMP 3.0, so all compilers do not yet support this construct.

Figure 3
Three different ways of parallelizing code in OpenMP.

Embedded developers are sometimes nervous about using high-level constructs in their real-time applications. Most of the high-level implementations rely on a runtime library that has not been designed for hard real-time requirements. Figure 4 shows one approach to solving this problem. By partitioning the application between hard and soft real-time, low-level threading primitives can be used for the hard real-time code, and high-level constructs used for the soft real-time code.  Interval Zeros’s RTX is an example of exactly this approach.

Figure 4
Some embedded designers partition their embedded applications between hard and soft real-time.

Another popular solution for implementing parallelism is Intel Threading Building Blocks (TBB), an open source C++ template library. Although TBB is not suitable for hard real-time code, it is worth considering for programs that have a soft real-time requirement. In the near future there will be other language options for programming in parallel. Two examples from the Intel stable include Ct and Cilk.  

Intel 
Santa Clara, CA. 
(408) 765-8080. 
[www.intel.com].