NVIDIA: New MIlestones in Performance

Parallel Processing Platform Opens Bridge to High Performance Embedded Systems

The Compute Unified Device Architecture from NVIDIA has enabled supercomputing levels of graphics and numeric processing on desktop and rack-mount systems. Now it is moving to a new class of mobile processors and enabling huge new levels of performance.


  • Page 1 of 1
    Bookmark and Share

Article Media

Of all the vague expressions that have been bandied about lately, perhaps none is more unclear than the phrase “high-performance computing.” First of all, it is connected to time. The concept of a 2 GHz x86 would have been a fantastic dream 15 years ago. And, of course, that also makes it relative. “High performance” is higher than what is at any given time considered “normal performance.” That being said, it is nonetheless justifiable to characterize the idea of “high performance” by virtue of the tasks a system can perform. It definitely takes a higher class of performance to tackle a higher class of operations such as robotic vision compared to running a vending machine.

Despite all this, we can definitely say that embedded computing is entering a phase of “high performance,” where it is gaining the ability to take on tasks that have never before been possible on a small, mobile embedded system. As with the trend in multicore processors, performance is gained not so much from clock speed as from parallelism—the development of processing units and supporting software that can perform compute-intensive operations on massively parallel platforms. One of the leaders of this charge is NVIDIA with its recently introduced Tegra K1 processor (Figure 1). 

Figure 1
Tegra K1 delivers higher CPU performance and power efficiency with a quad core + 1 ARM Cortex A15.

Figure 1
Tegra K1 delivers higher CPU performance and power efficiency with a quad core + 1 ARM Cortex A15.

NVIDIA is well known for its graphics processing as well as for its high-end CPU technology. In fact, 18,688 of its Tesla K20X GPUs paired with 18,688 AMD Opetron 16-core CPUs are incorporated in the Titan, the fastest supercomputer in the U.S., which was built by Cray for the Oak Ridge National Laboratory. Each Tesla K20X has 2,688 CUDA cores, the units of parallel architecture that are at the heart of NVIDIA’s technology.

The Compute Unified Device Architecture (CUDA) is the platform that gives programmers writing in C and C++ access to the parallel processing aspects of the GPU. And it is CUDA that forms the bridge between the high-end NVIDIA processors and the Tegra K1. While NVIDIA processors are technically referred to as GPUs, CUDA also enables highly parallel numeric processing for addressing a vast array of applications. The architecture has undergone advances with such names as Tesla, Fermi and Kepler. The Tegra K1 has 192 Kepler CUDA cores.

NVIDIA has actually developed two pin-compatible versions of the Tegra K1—a 32-bit and a 64-bit version, both based on the ARM instruction set. The 64-bit version appears to be scheduled for later release and is a dual Super Core CPU based on the ARMv8 architecture. The 32-bit version uses a 4-Plus-1 quad-core ARM Cortex A15 CPU first used in the Tegra 4. This arrangement enables power saving by using variable symmetric multiprocessing (vSMP) for performance-intensive tasks on the quad-core complex, and can also switch to the (plus-1) “battery saver” A15 core for lower-performance tasks. NVIDIA states that it has optimized the 4-Plus-1 architecture to use half the power for the same CPU performance as the earlier Tegra 4, and to deliver almost 40% more performance at the same power consumption.

Unlike the previous Tegra processors—which were very popular in applications like tablets and are used in the Tesla S class electric automobile—the Tegra K1 is compatible with CUDA. That means that CUDA-based software, especially libraries, tools and many applications, can be moved to the K1. Again, 192 cores are not 2,688, so the performance is not comparable. But, the performance in the embedded space—delivering approximately 327 GFLOPS—definitely qualifies as “high.” In addition, other APIs supported by NVIDIA are available to the Tegra K1. These include OpenGL, originally developed by Silicon Graphics. It was designed to interact with graphics processing units (GPUs) and is widely used in CAD, scientific simulation and video games. In addition, there are APIs targeted at vision such as OpenCV and NVIDIA’s VisionWorks. Then, of course, there is CUDA.

CUDA offers a general-purpose interface for programming on the GPU, but is not specific to graphics. It does require some extra effort on the part of the programmer to figure out which parts of his or her code are inherently parallelizable. Then a precompile macro is used to tell the compiler which parts to run on the GPU and which on the quad core CPU. The precompile constructs move the code to the GPU memory and copy the results back to main memory (Figure 2). The ability to do high-intensity parallel numeric processing as well as high-end graphics is important in a wide range of applications. For example, global climate simulation done on a full-blown supercomputer entails a huge amount of computing before the results are ever displayed graphically. The same can be said for gaming where trajectories of flying robots, vehicles and debris must be computed before each step where they are rendered in their new positions on the screen.

Figure 2
The CUDA processing flow copies data from main to GPU memory and the CPU instructs the process to the GPU. The GPU executes parallel in each core then copies the result from GPU memory to main memory.

Development Direction and Support

NVIDIA, of course, has ideas about where the new Tegra K1 will be applied, and these include computer vision for robotics, surveillance and security, defense and image processing for portable medical devices. However, NVIDIA’s product manager for Jetson Tegra K1, Jesse Clayton, readily admits, “We think we know what Tegra K1 will be useful for, but we hope and expect to be surprised at what people will do with it.” Clayton also reports that as of the date of our interview, “We’ve blown out order predictions for the development kit.” And they haven’t even tried offering it via Amazon yet.

To that end, NVIDIA has released its Jetson Tegra K1 development kit at a price of $192, which makes it easily available to everyone from the research scientist to the home hobbyist (Figure 3). The board features 2 Gbytes of memory plus 16 Gbytes of eMMC along with Gigabit Ethernet, USB 3.0, SATA, miniPCIe, RS-232 as well as a set of expansion ports for displays, GPIO and high-bandwidth camera.

Figure 3
The Jetson TK1 Development Kit will put powerful GPU development tools into the hands of anyone from engineers to hobbyists.

On the software side, the kit comes with the CUDA platform, OpenGL 4.4 and the NVIDIA VisonWorks toolkit. It also supports Linux for Tegra release 19 that comes with a Tegra Linux driver package, which includes a kernel image, boot loader, NVIDIA drivers and flashing utilities. Development is recommended on a host PC running Ubuntu Linux.

While power consumption is a key issue for embedded and mobile development, Clayton notes that they have not yet done extensive experiments to see what the maximum power draw can be. However, they have had it in the range of 6 to 7 watts. For power management, there is a way to lock the clock at a certain level to experiment with optimizing power. There is also a clock governor in the Linux version that works with heuristics to determine when an increased work load may require cranking the clock up and where a drop in the work load would allow turning the clock down to conserve power. “But sometimes,” Clayton says, “the user will want to lock the clock at a certain level and can also do that.”

In addition there is also the Jetson Pro kit, which is primarily oriented toward development of automotive systems such as in-vehicle infotainment, advanced driver assist systems (ADAS) and collision avoidance. The kit includes the Jetson main board with Tegra K1 processor, a breakout board with many connectivity options, a discrete CUDA-capable GPU, Wi-Fi, Bluetooth and GPS antennas, and touchscreen capability. The Jetson Pro kit also supports Linux and Android and is compliant with the in-vehicle open source development platform supported by the GENIVI Alliance. The Jetson Pro kit is available for approved developers, Tier 1 suppliers and automobile manufacturers.

One of the advantages of the CUDA “bridge” that has already been demonstrated is in the cooperative development of a vehicle vision system with Audi. Object detection and recognition learning is being carried out on a multi-GPU system running neural networks. The results can then be ported to a Tegra K1-based vision system that fits in a small enclosure in the vehicle. Research and development in this area is ongoing.

So we can definitely state that “high-performance computing” is entering the world of mobile and embedded systems—because it breaks the barriers for tasks that were previously not feasible. Stand by for that already high bar to be raised even higher in the future.

Santa Clara, CA
(408) 486-2000