TECHNOLOGY IN SYSTEMS
Java Tackles Multi-Core Complexity
While multicore solutions offer superior processing capacity while lowering power consumption, they force software developers to address a bewildering assortment of new issues. Fortunately, the Java language was designed to simplify management of these multicore issues.
KELVIN NILSEN, ATEGO SYSTEMS
Page 1 of 1
As VLSI transistor densities continue to rise, microprocessor vendors are increasingly turning toward multicore architectures as the most effective utilization of available transistors. This trend places a large burden on software engineers who must structure software so that it runs efficiently on multiple processors. Besides dividing a program into independent smaller tasks, each of which can be performed by a distinct processor core, software engineers must carefully craft the implementation of each independent task to (a) minimize its dependencies on other tasks, (b) reliably communicate information that must be shared with other tasks, and (c) arrange for certain tasks to wait at appropriate times for coordination with other tasks.
On top of this, the system must be designed to achieve high throughput and low latency and all code must be structured so that it is easily maintained in the face of anticipated functional and platform evolution. The Java language syntax and its standard libraries were designed from the ground up to support multiprocessing. This is a big improvement over legacy languages like C and C++, for which mastering concurrency issues requires a perplexing mix of compiler optimization choices, non-portable invocations of operating system services, and non-portable assembly language.
The Thread type, part of Java’s standard libraries, provides a portable and convenient notation to allow programmers to state that certain activities are performed concurrently, in a separate thread of control. The Java programmer simply extends the thread class and overrides the run() method with the code that is to be executed by the independent thread. Spawning this thread is as easy as performing a new operation to instantiate it, and then invoking the thread’s inherited start() method to cause it to begin executing.
Once multiple threads are running within shared common memory, it is occasionally necessary to enforce mutual exclusion for certain activities, ensuring that only one thread at a time is involved in those activities. Consider, for an illustrative example, a situation where one thread updates a record representing the name and phone number of a person while many other threads are consulting this same record to look up the person’s name and phone number.
Synchronization between threads is necessary to ensure that the consulted information is always coherent. If one thread fetches the person’s name while the name is being modified, it might receive a name that is an inconsistent mix of the two. Another inconsistency arises if the updating thread is preempted between overwriting the person’s name and his phone number. A reading thread might obtain the new person’s name, but the old person’s phone number.
Addressing this challenge in Java is a simple matter of adding the
synchronized key word to the methods that overwrite and fetch the name and phone number. The benefits of having this keyword built into the language are significant. First, it is easy for a Java programmer to read and understand the intended concurrency aspects of this code. Second, implementation of the concurrency intentions is provided by the Java virtual machine in a portable and efficient way. Third, synchronization provided by the Java virtual machine is more reliable and less error prone than synchronization implemented in handwritten assembly language by software engineers.
Within a synchronized method, Java programs have access to implicit condition variables associated with the object’s synchronization lock. The typical Java thread acquires an object’s mutual exclusion lock by entering a synchronized method, checks for a certain condition and, if that condition is not satisfied, blocks itself by invoking the wait() service. When a thread executes wait(), it is placed in a blocked state on a special queue associated with the enclosing object. While on this queue, the thread temporarily releases its mutual exclusion lock.
Other threads operating on the same object will presumably make the condition true. Before making the condition true, the other thread must acquire the object’s mutual exclusion lock. After making the condition true, but while still holding the mutual exclusion lock, the thread performs a notify() (or notifyAll()) operation. This has the effect of unblocking one (or all) of the threads on the object’s condition queue. Each unblocked thread automatically reacquires the mutual exclusion lock before it executes within the synchronized block that was running when it invoked wait(). The mutual exclusion lock will not be available until the notifying thread releases the lock by leaving the synchronized body of code.
A syntax shared with C and C++ is the volatile qualifier, which can be added to the declaration of certain fields (variables). As with C and C++, variables declared to be volatile are implemented such that their values cannot be cached. Every read or write access must go all the way to physical memory. The semantics of the volatile keyword in Java are stronger than for C or C++. The Java compiler is prohibited from reordering all other read operations (even non-volatile reads) to precede a volatile read, and from reordering all other write operations (even non-volatile writes) to follow a volatile write. With C and C++, the reordering prohibition applies only to other
volatile reads and writes. This subtle difference makes it much easier to write reliable and efficient multicore code in Java.
Another strength of Java for multicore programming is the java.util.concurrent package that was introduced with the Java 5.0 release in 2005. These libraries provide implementations of thread-safe collections, semaphores, prioritized blocking queues, atomically modifiable data abstractions and a reentrant lock data type. The reentrant lock data type offers services not available with synchronized statements, such as an API to conditionally acquire a lock only if the lock is not currently held by some other thread.
Restructuring Code to Maximize Multicore Efficiency
To exploit the full bandwidth capabilities of a multiprocessor system, it is necessary to divide the total effort associated with a particular application between multiple, largely independent threads with at least as many threads as there are processors. Ideally, the application must be divided into independent tasks, each relatively rarely needing to coordinate with other tasks. For efficient parallel operation, an independent task would execute hundreds of thousands of instructions before synchronizing with other tasks, and each synchronization activity would be simple and short, consuming no more than a thousand instructions at a time. Software engineers should seek to divide the application into as many independent tasks as possible, while maximizing the ratio between independent processing and synchronization activities for each task. To the extent that these objectives can be satisfied, the restructured system will scale more effectively to larger numbers of processors.
Serial code is code that multiple tasks cannot execute in parallel. As software engineers work to divide an application into independent tasks, they must be careful not to introduce serial behavior among their “independent tasks” because of contention for shared hardware resources. Suppose, for example, that a legacy application has a single thread that reads values from 1,000 input sensors. A programmer might try to parallelize this code by dividing the original task into ten, each one reading values from 100 sensors. Unfortunately, the new program will probably experience very little speedup because the original thread was most likely I/O bound. There is a limit on how fast sensor values can be fetched over the I/O bus, and this limit is almost certainly much slower than the speed of a modern processor. Asking ten processors to do the work previously done by one will result in ineffective use of all ten processors rather than just one. Similar I/O bottlenecks might be encountered with file subsystem operations, network communication, interaction with a data base server, or even access to a shared memory bus.
Explore pipelining as a mechanism to introduce parallel behavior into applications that might appear to be highly sequential. Figure 1 shows an example of such an application. This same algorithm can be effectively scheduled on two processors as shown in Figure 2. Pipeline execution does not reduce the time required to process any particular set of input values. However, it allows an increased frequency of processing inputs. In the original sequential version of this code, the control loop ran once every 20 ms. In the two-processor version, the control loop runs every 10 ms.
Soft PLC Implementation of Proportional Integral Derivative (PID) Control
Two-Processor Pipelined Implementation of Soft PLC.
This particular example is motivated by discussions with an actual customer who was finding it difficult to maintain a particular update frequency using a sequential version of their programmable logic controller (PLC) software. In the two-processor version of this code, one processor repeatedly reads all input values and then calculates new outputs while the other processor repeatedly logs all input and output values and then writes all output values. This discussion helped them understand how they could use multiprocessing to improve update frequencies even though their algorithm itself was inherently sequential. With two processors working together on a pipelined solution, the system is able to perform, on average, a complete actuator update every 10 ms, with each processor doing roughly half of the computation that would have been required to achieve a 10 ms update frequency on a single processor system.
The two-processor pipeline was suggested based on an assumption that the same I/O bus is fully utilized by the input and output activities. If inputs are channeled through a different bus than outputs, or if a common I/O bus has capacity to handle all inputs and outputs within a 5 ms time slice, then the same algorithm can run with a four-processor pipeline, yielding a 5 ms update frequency, as shown in Figure 3.
Remember that one goal of parallel restructuring is to allow each thread to execute independently, with minimal coordination between threads. Note that the independent threads are required to communicate intermediate computational results (the 1,000 input values and 1,000 output values) between the pipeline stages. A naive design might pass each value as an independent Java object, requiring thousands of synchronization activities between each pipeline stage. A much more efficient design would save all of the entries within a large array, and synchronize only once at the end of each stage to pass this array of values to the next stage.
Relevant code for implementing the pipeline in Figure 3 is shown in the sidebar “Partial Java Implementation of Four-Stage Pipeline,” p27. The Main class provides the main() startup method. This main() program instantiates two FIFO buffers to connect stages of the pipelines, plus four Stage objects. Each FIFO is associated with three buffers. This allows one stage to fill one buffer, while second and third stages are each reading previously stored data from distinct buffers. Code is not shown for the FIFO or Buffer data types.
Four-Processor Pipelined Implementation of Soft PLC.
Code is shown for Stage1 and Stage2. The code for the other stages is very similar. Each Stage extends Thread, and all four threads are started from the main() program. Note that the constructors for each Stage identify the desired start time. Each pipeline stage requires 5 ms. During the first 5 ms, only the first stage can run because there is no data ready for the subsequent stages. During the second 5 ms, only the first two stages can run, and so on.
Refer to the run() methods in the Stage1 and Stage2 classes to see the code that is executed by each stage. The implementations of the FIFO and Buffer classes are not shown. However, it is important to recognize that no synchronization is required for any Buffer operations. That’s because each Buffer is visible to only one thread at a time. Synchronization is required in the implementations of the FIFO operations. That’s because each FIFO represents a communication channel for coordination between two neighboring pipeline stages. In order for Stage1 to receive an empty buffer, Stage2 has to evacuate a previously filled buffer. Thus, the Stage1 loop performs only two synchronizations every 5 ms. And the Stage2 loop performs four synchronizations every 5 ms.
Multicore hardware and system design is an inevitable progression of technology that enables more processing to be done with less power consumption and cost, but these gains can’t be fully realized without incurring associated increases in software complexity. Java enables developers to more efficiently address these complexities than is possible with other languages.
San Diego, CA.