Machine-to-Machine Systems: Autonomy under Control

Accelerated Processing Units and Multicore Programming Innovation at the Cusp of Machine-to-Machine

Implementing parallelism in powerful accelerated processors can be a complex task. Software parallelization can ease the process and quickly help manage growing and complex M2M wireless industrial networks.


  • Page 1 of 1
    Bookmark and Share

Article Media

Heterogeneous system architectures promise to transform many applications and the usability of machine-to-machine (M2M) solutions in these applications, enabling reliable and real-time communication. To support these solutions, multicore programming solutions are required to help software application developers optimize for parallel processing environments. Through the availability of both of these advanced technologies, developers will be able to create intelligent, ultra-responsive M2M enabled systems that, while operating independently, can efficiently be centrally monitored and controlled in real time.

Replacing traditional wired communication networks with wireless M2M communication in applications such as manufacturing process control, offers the benefit of being able to quickly and easily install new equipment or sensors without the need to run difficult-to-install or costly communication wiring. But implementing an M2M solution in this type of application, which may include hundreds or thousands of field devices (nodes), introduces new challenges related to reliability, security and throughput of the network. Therefore, management devices are necessary within the network to take care of access control and encryption, device identification and configuration, wireless channel control and communication paths, message handling and protocol translation, and finally but most importantly, to maintain a database of I/O states and report I/O status. And this all needs to be done with a small, cost-effective, robust and reliable device, referred to as a wireless gateway (Figure 1). The biggest limitation of these types of networks is the ability of the wireless gateway solution to adapt and manage its topology to ensure that a robust connection exists to all of the intelligent devices distributed throughout the plant. Testing of current implementations of these wireless gateway solutions shows that they take three to five minutes to recalculate the network topology of a 50 node network, placing a practical limit to these solutions of 50-75 nodes.

Figure 1
Wireless industrial M2M networks face challenges in terms of connectivity and security as well as the ability of the wireless gateway to recalculate and manage the network topology as the number of nodes continues to grow.

With the advent of new heterogeneous processing solutions, the silicon-level integration of general-purpose, programmable scalar and vector processor cores for high-speed parallel processing establishes a new level of processing performance for embedded systems, at previously unobtainable performance-per-watt ratios. This merging of low-power x86 CPU cores with the parallel processing performance of a discrete-level general-purpose graphics processing unit (GPGPU) in a single device, also known as an Accelerated Processing Unit (APU), is enabling the high-speed processing required to handle intensive numerical computations such as, let’s say, that which is required of a plant-floor wireless gateway tasked with calculating the network topology of a 10,000 node network in under one minute (Figure 2).

Figure 2
An example of an accelerated processing unit (APU), the AMD Embedded G-Series platform has a dual-core x86 processor along with a graphic processing unit that can also do parallel general computation plus multiple display interface options and a full complement of I/O.


APU-Caliber Computational Performance and Power Efficiency

The general-purpose vector processor engines within an APU are able to deliver from 10s to 100s of GFLOPs of performance, far outstripping the computational capabilities of the traditional processors used for these applications. And with this APU architecture, direct access to the unified memory shared between the CPU and GPU enables a true zero-copy data transfer path and the lowest possible latency, yielding a fully optimized data pipeline.

But computational performance and real-time device/network awareness are only part of the equation. It is essential that the solution be available in a small form factor and reliable platform, so power consumption naturally remains a critical concern for implementing an APU in a wireless gateway. The performance-per-watt gains enabled by APUs assure greater power efficiency and lower heat dissipation, which in turn can preclude the need for a cooling fan and therefore facilitate the use of compact and/or vent-less system enclosures distributed throughout a harsh factory floor environment.


Making the Most of Parallel Processing

While many systems today run on multicore CPUs, many software applications have yet to take full advantage of the parallel processing potential of these systems. Indeed, parallel programming limitations remain a lingering but wholly surmountable barrier to exploiting the full performance benefits of multicore architectures, including massively multicore APUs. The fundamental challenge with sequential programming languages is that the sheer complexity of the code obscures opportunities to employ parallel processing, and overcoming this obstacle via manual analysis and coding processes can be very difficult.

Using traditional software development methodologies originally developed for sequential programming, there are three well-defined steps required for parallel programming.

1. Identify parallelisms: Analyze the problem to identify tasks that can execute in parallel.

2. Expose parallelisms: Restructure a problem so tasks can be effectively exploited. This often requires finding the dependencies between tasks and organizing the source code to manage them effectively.

3. Express parallelisms: Express the parallel algorithm in source code using a parallel programming notation.

The expertise, time and effort required to manually develop parallel programs using these traditional techniques can be significant. And it isn’t enough to simply understand how an application can be split into parallel threads. The second barrier to parallel programming is the need for the application developer to understand how the multicore system will process the multiple threads. Among the details of the computing system that the application developer may need to understand to optimize the application are such things as load balancing, memory access and processor communications.

Load balancing means keeping all the processors fed with work at the right times. Though the target system might have multiple cores, some of those cores will be engaged by the operating system or by other critical applications. The developer must understand how to distribute the application across available resources so that the resources are used efficiently.

As processes are distributed, the data accesses may also be distributed across processors. The user must take into account the available memory and how the processors communicate with memory to avoid processors sitting idle waiting for data, creating memory conflicts, or worse, corrupting data. At the same time, application developers must understand multicore bus protocols to ensure the inter-processor communications don’t create bottlenecks that slow the operation of the application, making the parallel code slower than the sequential implementation.

And so it is necessary, in order to optimize an application for parallel processing, that the application developer must understand the target multicore system in great detail, manipulating processes that are normally taken for granted in sequential programming. It is for this reason that software applications developed for specific multicore architectures typically need to be rewritten to accommodate new and different multicore architectures—and this, of course, requires significant investment in labor and support resources. As a result, software applications that previously gained performance with every new processor release have become static, impaired by the ability of developers to add new features and functionality and/or pivot to a new processing architecture. 


Taking the Burden off of Parallelizing Code

The recent emergence of auto-parallelizing software compilation technologies promises to help neutralize the aforementioned challenges and ensure that application developers can take full advantage of new heterogeneous system architectures. While this approach will allow developers to let compilers manage application segments they have already parallelized manually, they will still need to work through the three defined steps and will need at least some understanding of the target multicore system.

Auto-parallelizing programming technology on the other hand introduces a simple declarative language for expressing algorithm intent—without the need to specify parallelisms or implementation structure—and employs compiler mechanisms that automatically identify, expose and express the parallelisms, enabling application developers to quickly generate optimized threaded application code from simple algorithmic descriptions of the code functionality. This approach avoids the need for developers to become experts in heterogeneous processor design, since the technology distributes the application across the available processors and optimizes memory access and communications for parallel implementation. Parallelizing programming technology actually becomes the abstract layer that allows the developer to focus on what the function is without concern about how it will be computed—leading to dramatic improvements in innovation and productivity.

Auto-parallelizing software compilers automatically extract parallelism from the application code, enabling code compilation to any parallel architecture from common source code. As a result, the auto-parallelized code remains independent of computing system architecture, eliminating the need to re-write application code for new multicore processor releases. Applications can be ported to new multicore processors, new machine configurations and new operating systems without incurring the costs, time and effort associated with traditional programming methodologies.


Putting It to the Test

In recent proof-of-concept trials with a major process automation supplier, the combination of an APU and auto-parallelizing software compilation technologies was applied by Texas Multicore Technologies to help improve the processing capability required to calculate the network topology of a hundred-node wireless network to under one minute. Early tests significantly exceeded that goal by demonstrating the ability to recalculate the topology of a 250-node network in less than 14 seconds, promising to significantly improve the performance of the supplier’s wireless gateway solution for its large scale manufacturing customers.  

TMT’s SequenceL was chosen for the test because it provides developers with a declarative, functional language that self-parallelizes, which means that if you express the algorithm in SequenceL, the SequenceL compiler will generate massively parallelized C++ code, which is provably race- and deadlock-free. It will also generate GPU code, if desired, to support parallel processing on a GPU, or APU. It was chosen for this project because of its simplicity of coding, its ability to leverage the multicore and parallel processing nature of the AMD G-Series APU being used for the test, and the ease of embedding the C++ output of the compiler into other code bases.

The section in Code Block 1 explains some of the implementation details of the SRDR algorithms under SequenceL, using the example of “Constructing Sequential Reliable Downlink Routes” published in Reliable and Real-Time Communication in Industrial Wireless Mesh Networks by Chen, Nixon, Han, Zhu and Mok.

Code Block 1

The algorithm in Code Block 1 refers to three constraint conditions, defined as follows: 

Code Block 1

C1: has at least two parents u1, u2, and they form a cycle.

C2: u1 is u2’s parent in u2’s local downlink graph.

C3: u2 (or u1) has at least one parent from the cycle in Gu1 (or Gu2).

As an example, we will look first at the evaluation of C1.

In this case, the problem is to find the parents P of a vertex v, and see if any two of them form a cycle (i.e., there are 2 edges connecting a pair of nodes p1 and p2, one in each direction, where p1 and p2 are both elements of P).

In a procedural language, you’d construct an algorithm for this, by either asking v for its parents (if it knew them), or looping through the entire graph looking for nodes that have v as a child (if it didn’t keep track of its parents). Thus armed with the list of parents P, you’d construct another loop that would either generate all of the (|P| choose 2) parent pairs, and then analyze each for cycles; or traverse P looking for parents P? that connected to other parents P??—and then analyze which ones of those might connect back to P?.

In SequenceL, we use the definitions themselves to write the code. For instance, when finding the parents of a vertex, a vertex v’s parents are defined as those vertices P that are connected to v by an edge p→v (where p is an element of P).  

Or more specifically, those vertices that are the source of an edge, whose destination is v.  

Or, in actual SequenceL:

parent(node, edge) := 

 edge.source when edge.destination = node;


The above function, given a single node and a single edge, will return the source of the edge (i.e., the parent of the node), if and only if the destination of the edge is the node. In other words, it will return the source of the edge, if that source is the node’s parent. 

But SequenceL also has the unique ability to “normalize” input data, so if you pass it the entire list of edges in the graph, instead of a single edge, it will return all of the sources that have a destination of v. In other words,all of the parents.  

So the code snippet above is all that’s required to generate the full list of parents for any node—just by using the definition of a parent: it’s the source of an edge whose destination is the child.

Likewise, if a pair of parents p1 and p2 were to form a cycle, then by definition, there must be two edges somewhere in the graph that connect p1→p2, and p2→p1. That is, two vertices p1 and p2 form a cycle if [p1→p2 ,p2→p1] is a subset of the set of edges in the graph. Or, in SequenceL, just as shown in Code Block 2. No looping, traversing or analyzing involved—just the definition of a cycle.  

Code Block 2

This returns true or false. If we want it to operate in the same way that parent() does, acting on entire sets of data, we might want to modify it to return the parents themselves, and to eliminate potential duplicates by forcing an ordering on the parents. We can do this just by inserting the line in Code Block 3.

Code Block 3

Now, to generate a list of all cyclic parents of a node, we just use the functions in combination, by using a helper function to call them and concatenate the results as shown in Code Block 4.

Code Block 4

The second helper function cp_helper() serves merely to normalize the input, so that we can pass the entire list of parents to cyclicNodes(), instead of just two nodes at a time.  

But given that, with just those four one or two line functions, we can easily generate the cyclic parents of any node, saying, for example:


cyclicParents(6, edgeList)

to see all of the parents of node 6 that form cycles.

As an illustration, assigning the sample topology shown in Figure 3 to “edges” (and representing the Access Points AP1 and AP2 as nodes -1 and -2, respectively), then we can just ask for the parents:

Figure 3
Original network topology.


=> parent(2, edges)



=> parent(3, edges)



and the cyclic parents:


=> cyclicParents(2, edges)



=> cyclicParents(3, edges)



Note that while node 2 has more parents than node 3, it only has the Access Points as its cyclic parents, while node 3 has the Access Points, as well as the (AP1, node 2) pair. Using SequenceL, then, to write the algorithm, the main body can be expressed as in Code Block 5.

Code Block 5

In this case, the main body is really the three lines that are boxed at the end, after the “in” statement. All of the “let” statements are just helpers, providing arguments to those 3 lines.  Other functions outside of the main() function implement other helpers, such as intersection(), head(), minHops(), etc., but they are similar to the previous examples in simplicity. Note the comments in the code that match specific lines of SequenceL to specific lines in the algorithm.

As these trials proceed toward the aforementioned 10,000 node in one minute calculation performance goal, the supplier will be well enabled to optimize its wireless network solution to adapt and manage its topology to the intelligent devices distributed throughout a plant with unprecedented speed, enabling reliable and ultra-responsive monitoring and/or control of these systems.  


Advanced Micro Devices

Sunnyvale, CA.

(408) 749-4000


Texas Multicore Technologies

Austin, TX.