Wringing Performance out of Multicore
SequenceL: An Elegant and Efficient Approach to Exploiting the Power of Parallelism
Parallelizing complex code efficiently across multiple processor cores gets to be a task beyond human ability. SequenceL is a high-level language that can automatically analyze and output parallel code as C++ and OpenCL to run on a variety of today’s multicore processors.
DOUG NORTON, TEXAS MULTICORE TECHNOLOGIES; LARRY A. LAMBE, MULTIDISCIPLINARY SOFTWARE SYSTEMS RESEARCH; AND RICHARD LUCZAK, RL AERODYNAMICS
Page 1 of 1
In 2004 CPU providers made a major shift; rather than increasing the clock speed to increase the performance of each new chip generation, they began to add more processors cores to the chip. Clock speeds had risen to about 4 GHz, and at that speed the resulting leakage current meant they used too much power and gave off far too much heat to be practical for many uses. This was particularly true for laptop and mobile devices, but “green” was increasingly becoming a factor in datacenter environments, particularly in dense cities such as New York. As one datacenter manager shared, “You don’t worry about power and cooling until you run out.”
That shift put the challenge on software developers to effectively use these cores. Initial efforts focused on simple partitioning for dual core processors. Next were optimized libraries for compute-intensive functions that could run across multiple cores. As core counts increased, so did programming complexity, such that true parallel programming has become necessary, a skill that had been reserved for the most elite of programmers. The latest processors have a heterogeneous mix of cores, with specialized cores such as (GP)GPUs added to traditional CPU cores. This looks good on paper because GPUs deliver outstanding floating point performance per watt. Unfortunately, this once again makes the programming challenge much harder, since not only do they require parallelization, but also a very different—and relatively low level—programming language such as CUDA or OpenCL.
The simple fact is humans don’t write large scale parallel code well, nor do they want to keep rewriting it every time a processor evolution occurs. Embedded system developers work in an environment with both time-to-market pressure and high software quality requirements. Adding complicated parallel programming and testing complexity on top of that is typically untenable. To do this faster, easier and without rewrites requires a dramatically better approach. SequenceL does this by allowing programmers to work at a high level, without regard to execution or performance, and allowing compiler technology to do the difficult, low-level work. Just as hardware engineers have moved up in abstraction to deal with complexity, from drawing gates to Verilog/VHDL and now to SystemC, it is time for software engineers to move up again. And just like some good EDA tools that are “correct by construction,” the two computational laws that underpin SequenceL are provably race free.
Challenges for Programming
An embarrassingly parallel (EP) task is one for which little or no effort is required to separate the whole task into a number of parallel subtasks. It is easy to both conceptualize and implement algorithms for EP tasks; one simply sends the separate subtasks to different CPUs (Cores) and combines the answers appropriately. Unfortunately, most real-world applications are not EP problems, so the challenge is making the majority of applications able to exploit today’s multicore architectures to maximize both performance and energy efficiency.
Current approaches such as pthreads, OpenMP and MPI, coupled with tools that analyze code for parallelisms, leave the hard work to the user, and it is complex and tricky work for the majority of programmers. These also pose a major QA challenge to test for race conditions and deadlocks. Clearly these approaches have no hope to scale to the even higher core counts and heterogeneous processors on the vendor’s roadmaps.
An example of an EP task is any process q(n) that generates an array of length n with the property that for all 1 <= i <= n, q(i) does not depend on q(i-1). In this case for any chosen partition of 1, ..., n, e.g.,
1 = p < p < ... < p[k] = n,
one can calculate the array elements in q(n) by independently calculating the elements on each subinterval p[i] ... p[i-1] and assembling the result (in the proper order) into an array of length n.
Using Posix Threads (pthreads) on a Multicore Machine
Consider a simple special case of the above EP situation. Let f be a function that takes a non-negative integer as argument and returns a floating point number. So given n, we would like to generate the array [f(1),...,f(n)]. More generally, we could consider the function, using ANSI C notation void f(double *arr, int start, int length) that returns the array [f(start), f(start+1) ..., f(start+length)] in arr.
For an example consider a function (in ANSI C notation) of the form
double f(int n);
We want a multicore program for the form
void seq_f(double *ret, int start, int length)
which take returns the array [f(start), f(start+1),...,f(start+length)] in ret.
This problem is EP. The first step in programming it is to figure out how many threads to use. We want to use as many threads as there are available cores. For this we can require user input or read the number of available cores num_cor from an environment variable. Pseudo code using a C-like syntax will be used.
l1 = length / num_cor; /* most of the subsections */
l2 = l1 + length % num_cor /* last segment with the remaining elements */
Define a structure
double *seq; /* a pointer to the whole array */
int sub_seq_start; /* the start index for this subsequence */
int sub_seq_len; /* the length of this subsequence */
to eventually hold the arguments for the ith subsequence. Now a helper function is needed. That helper function will execute f(arr, start, length) with appropriate input on the ith core. Here it is
void doit (void *arg)
/* apply f to the appropriate fields in (ArgVal *) */
Of course, the user will have to allocate and instantiate the structure for the ith core using the correct input for this. We omit the simple arithmetic (involving l1, l2 above) here. An array arg of length num_cor is used for this.
The code then proceeds as
Allocate an array th of length num_cor of threads.
for (i = 0; i < num_cor; i++)
pthread_create(&th[i], NULL, doit, &arg[i]);
for (i = 0; i < num_cor, i++)
This completes the outline of the pseudo code.
The SequenceL Language and Compiler
SequenceL may be new to most readers, but it has been in development for more than 20 years. Dr. Daniel Cooke, Dr. Nelson Rushton and Dr. Brad Nemanich worked with NASA over a more than 20 year period while at Texas Tech University, with NASA and other agencies providing over $10 million in research grants.
NASA originally wanted to develop a specification language that was easy to read but could also be executed, to eliminate both the ambiguity and the need for software prototyping and its associated costs. They discovered that SequenceL was so readable that programs required no additional documentation, and a plug-in can readably be added to the Eclipse IDE (Figure 1). It was during this development process that the self-parallelizing nature of SequenceL was discovered and enhanced. Texas Mulitcore was formed in 2009 to commercialize SequenceL as an auto-parallelizing software development environment for the multicore computing industry.
Example Screenshot Showing SequenceL Eclipse IDE Plug-In.
SequenceL is a Turing complete, domain-independent, functional programming language. It allows engineers to express problems in engineering terms, thus working at a high level, without regard to execution or performance, and allowing compiler technology to do the difficult, low-level work. A key focus from the beginning was to maintain a compact language (~15 grammar rules compared to 150+ for JAVA). To that end, the inventors chose not to re-invent I/O, instead relying on C++, which is familiar to most and widely supported across platforms. The output of the SequenceL compiler is robust, massively parallelized C++ and OpenCL, allowing it to easily drop into new and existing frameworks (Figure 2).
SequenceL products and development flow showing industry standard support Eclipse and Visual Studio IDEs for input, parallelized C++ and OpenCL output.
In SequenceL, the program is quite concise:
seq_f(n) := f(1...n);
The more general program would simply be
seq_f(start,length) := f(start...start+length);
The three-dot-operator ... in the above SequenceL codes generates a list of integers ranging inclusively between its two operands. For example, “2...6” creates a list of integers [2,3,4,5,6].
It is easy to see that f(1...n) denotes a list of a given function f evaluated at every point of the list of integers ranging inclusively between 1 and n. For example, “f(1...4)” creates a list [f(1),f(2),f(3),f(4)].
The performance of the compiled and executed SequenceL code for the above task was tracked using the “top” program on a Linux machine with an eight core Intel multicore chip. The results were as shown in Table 1.
Results of an 8-core multicore SequenceL performance test.
The first column in Table 1 shows CPU number (numbering starts at 0). The second column with a number followed by postfix “us,” shows percentage of time spent on the user process; for example, “94.7 us,” indicates that 94.7% of total time on a given CPU is spent on the user process. The higher the number and closer t 100 the better. The third column shows percentage of time spent on the system processes; for example, “4.3 sy,” indicates that only 4.3% of total time on a given CPU is spent on the system processes. The remaining columns show percentage of time spent on other processes such as I/O completion or hardware and software interrupts—in most cases they are zeros or close to zeros. The SequenceL results differ insignificantly from those for the hand-coded pthreads code.
Results of an 8-core multicore SequenceL performance test.
SequenceL and Problems Which Are Not EP
As noted earlier, SequenceL exposes all the parallelisms in code, many of which humans would not see or might not think worth the trouble. We recently were given the task to demonstrate the power of SequenceL on just a two core processor. We searched for an algorithm that was deemed not parallelizable and chose the Barnes-Hut N-body simulation, used to model galaxies where each body exerts physical forces on each other. Unlike the embarrassingly parallel, brute-force, n2 approach, the conclusion in the academic white papers was Barnes-Hut was not parallelizable since each step relies on the previous state. Yet when written and compiled in SequenceL, then run on the two core chip, we achieved 2X the performance once the simulation reached 2000 bodies!
When analyzing how this could be, we saw that SequenceL automatically found all the fine grained parallelisms at each step and they clearly added up. We were only mildly surprised since we have seen many similar results where SequenceL code runs faster than was expected. At the low end, on a one core system, we have seen SequenceL code run faster than the C++ reference code. For a large industrial process control application, the SequenceL implementation achieved a 36X speedup on a 16 core server.
Writing parallel code for problems that are not EP is much more difficult and fraught with errors, but not in SequenceL. In each of these non-EP cases, the SequenceL code was written in a fraction of the time of the single-threaded code and worked correctly. In a recent WirelessHART (IEC 62591) customer project, we worked with their team to implement a new mesh networking algorithm in SequenceL to scale the performance to the thousands of nodes needed for real-world environments. Their internal development to that point had been in Java and had taken five months. The SequenceL implementation took just two weeks and ran five times faster on a two core embedded processor. More surprising was that the SequenceL code got different results. They discovered the SequenceL code was working correctly the very first time, and was used to debug the Java code. But the speed of iteration in SequenceL became even more powerful. We then worked with the customer right in the conference room, doing some very fast iterations in SequenceL to further improve the algorithms. This same effort in a traditional language would have taken many more weeks and still not achieved the performance of SequenceL on a multicore system.
Texas Multicore Technologies
Fort Worth, TX.
Multidisciplinary Software Systems Research