NVIDIA: New MIlestones in Performance

NVIDIA’s Tegra K1: A Game-Changer for Rugged Embedded Computing

The migration of a powerful parallel GPU architecture along with a compatible software platform and computing model into a low-power ARM multicore SoC, promises to bring a range of capabilities into the mobile and embedded arena that have so far not been possible.


  • Page 1 of 1
    Bookmark and Share

Article Media

Military and aerospace embedded computing customers are increasingly embracing GPU technology for applications—such as radar, sonar and ISR (intelligence, surveillance, reconnaissance)—that can readily benefit from the very high degree of parallelism offered by graphics processors. Such applications are widely known as GPGPU—general purpose computing on a graphics processing unit.

Historically, as any PC gamer will confirm, this extensive compute capability has come with challenges in power consumption and heat dissipation. Increasingly, these are of concern to military and aerospace customers, who are looking to deploy highly capable embedded computing solutions on smaller, lighter weight platforms that do not have significant power at their disposal and that are problematic to cool. Size, weight and power (SWaP) have become, in many such environments, a more important decision criterion than, for example, performance/watt.

192 Cores, 327 GFLOPS: Less Than 10W of Power

NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) to support an integrated Compute Unified Device Architecture (CUDA). With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance (Figure 1), TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power, including the SoC and DRAM. CUDA is a parallel computing platform and programming model developed by NVIDIA that gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs, and allows them to be used for general purpose processing.

Figure 1
Simple TK1 block diagram.

To appreciate the enormity of what NVIDIA has achieved: a 6U VPX multiprocessor single board computer featuring earlier GPU technology consumes approximately 100 watts to deliver 645 GFLOPS of performance. In other words: twice as much compute horsepower, but with over ten times the power consumption. Such platforms, with their raw compute capability and flexibility, will continue to be at the heart of many leading-edge embedded computing deployments. There are, however, a growing number of applications being envisaged for the future that will significantly benefit from “only” half the processing performance if the power/cooling ratio can be reduced so substantially.

The Tegra K1, then, brings game-changing performance to low-SWAP and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC. Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors and network interfaces via built-in USB 3.0 and PCIe Gen2 x4/x1 ports. TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA. This includes H.264 encoding and decoding engines, dual MIPI CSI-2 camera interfaces and image service processors (ISPs). There are many exciting embedded applications for TK1 that leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.

As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra extends these capabilities into the sub-10W domain. Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power K1 (Figure 2). A full build of CUDA toolkit 6 is available for TK1 that includes samples and math libraries such as Compute Unified Fast Fourier Transform (CUFFT), Compute Linear Unified Basic Linear Algebra Subprograms (CUBLAS) and NVIDIA Preformance Primitives as well as NVIDIA’s NVCC compiler. Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine. Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra. There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit. Additionally, the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies. In many instances applications can be simply recompiled for ARM with little modification, as long as the source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs. So what are the practical possible applications for Tegra K1? Let’s consider a couple of case studies that highlight TK1’s ability to easily integrate sensors and support high-bandwidth streaming.

Figure 2
The launch of Tegra K1 sees the possibility of a range of GPU-based solutions with substantially different capabilities – but with code compatibility enabling one development for multiple deployments.

Case Study #1:  Robotics/Unmanned Vehicle Platform

Embedded applications commonly require elements of video processing, digital signal processing (DSP), command and control and so on. In this example architecture with Tegra K1, CUDA is used to process imagery from high-definition GigEVision gigabit cameras and simultaneously perform world mapping and obstacle detection operations on a 180° light detection and ranging (LIDAR) scanning rangefinder, a remote sensing technology that measures distance by illuminating a target object with a laser and analyzing the reflected light (Figure 3). Additionally, devices such as GPS receivers, inertial measurement units (IMUs), motor controllers and other sensors are integrated to demonstrate using TK1 for autonomous navigation and motion control of a mobile platform such as a robot or unmanned vehicle.  

Figure 3
Sensor processing pipeline implemented using Tegra K1 for autonomous navigation.

Teleoperation capability is provided by applying Tegra’s hardware-accelerated H.264 compression to the video and streaming over Wi-Fi, 3G/4G, or satellite downlink to a remote groundstation or other networked platform. This architecture provides an example framework for perception modeling and unmanned autonomy using TK1 as the system’s central processor and sensor interface. It’s Tegra’s low power consumption and minimal heat dissipation that make it an attractive processor for confined environments such as robots or small unmanned vehicles, giving them a local processing capability that would previously have been unthinkable.

The scanning LIDAR produces range samples every 0.5 degree over 180 degrees. These are grouped into clusters using mean shift and tracked when motion is detected. CUDA was used to process all range samples simultaneously and perform change detection versus the octree-partitioned 3D point cloud built from previous georeferenced LIDAR scans. This produced a list of static and moving obstacles refreshed in real time for collision detection and avoidance. A radar-like plan position indicator (PPI) is then rendered on the OpenGL side (Figure 4). This particular LIDAR was connected via RS-232 to a serial port; other LIDARs support Gigabit Ethernet as well. The open-source SICK Toolbox library, which compiles and runs out of the box on TK1, was used for connecting to the sensor. Having easy access to LIDAR sensors provides TK1 with sub-millimeter accurate readings to exploit with CUDA for real-time 3D environment mapping and parallel path planning. 

Figure 4
LIDAR-driven PPI display visualizes static and moving obstacles in the platform’s environment.

On the imaging side, Tegra K1 has a number of interfaces for streaming high-definition video, such as CSI-2, USB 3.0 and Gigabit Ethernet. Frame grabbers for other media like HD-SDI, CameraLink, LVDS and others can be integrated with TK1 via its PCIe Gen2 x4 port. For this case study, testing was carried out with multiple Gigabit Ethernet cameras from GigEVision-compliant vendors. These had resolutions ranging from 1920 x 1080 up to 2448 x 2048, and the testing found an individual ARM CPU core sufficient per Gigabit Ethernet port for handling network protocols and packetization using the sockets API. Using the cudaMallocManaged() feature new to CUDA 6, the video stream is depacketized by the CPU into a buffer shared with the GPU, requiring zero copies to get the video “into GPU memory.” In the case of TK1, it’s physically all the same memory.

Using freely available libraries like OpenCV, NVIDIA NPP and VisionWorks, users have the ability to run a myriad of CUDA-accelerated video filters on-the-go including optical flow, SLAM, stereo disparity, robust feature extraction and matching, mosaicking, and multiple moving target indicator (MMTI). Trainable pedestrian and vehicle detectors can run in real time on TK1 using available Histogram of Oriented Gradients (HoG) implementations. There are many existing CUDA codes available that previously ran on discrete GPUs and are now able to be deployed on Tegra.

In addition to LIDAR devices and cameras, TK1 supports navigational sensors such as GPS and IMU for improved autonomy. These are commonly available as USB or serial devices and can easily be integrated with TK1. One quick way to make a GPS-enabled application is to use libgps/gpsd, which provides a common software interface and GPS datagram for a wide class of National Marine Electronics Association (NMEA)-compliant devices. Meanwhile IMU sensors are integrated to provide accelerometer, gyro and magnetometer readings at refresh rates of up to 100 Hz or more. TK1 fuses the rapid IMU and GPS data using high-quality Kalman filtering to deliver real-time interpolated platform positions in 3-space, and then uses these interpolations to further refine visual odometry from optical flow. While less standardized than the NMEA-abiding GPS units, IMU devices commonly ship with vendor-supplied C/C++ code intended to link with libusb, a standard user space driver interface for accessing USB devices on Linux. Such users pace drivers leveraging libusb require little effort to migrate from x86 to ARM and enable developers to quickly integrate various devices with TK1. Some examples include MOSFET or PWM motor controllers for driving servos and actuators, voltage and current sensors for monitoring battery life, gas/atmospheric sensors, ADCs /DACs and more, depending on the application at hand. Also, Tegra K1 features six GPIO ports for driving discrete signals, which are useful for connecting switches, buttons, relays and LEDs.

This case study accounts for common sensory and computing aspects typically found in robotics, machine vision, remote sensing and so on. TK1 provides a developer-friendly environment that takes the legwork out of integration and makes deploying embedded CUDA applications easy while delivering superior performance.

Case Study #2 : Tiled Tegra

Some applications may require multiple Tegras working in tandem to meet their requirements. Clusters of Tegra K1 SoCs can be tiled and interconnected with PCIe or Ethernet. The size, weight and power advantages gained from implementing such a tiled architecture are substantial and extend the applicability of TK1 into the datacenter and high-performance computing (HPC). Densely distributed topologies with 4, 6, 8 or more K1 SoCs tiled per board are possible and provide scalability beneficial for embedded applications and HPC alike. Consider the example based on an existing embedded system, employing six Tegra K1s (Figure 5).

Figure 5
SWaP-optimized tiled architecture, six Tegra K1’s interconnected with non-transparent PCIe switching and RDMA.

The six TK1s are interconnected via PCIe gen2 x4 and a 32-lane PCIe switch with nontransparent (NT) bridging and DMA offload engines. This, along with a lightweight userspace RDMA library, provides low-overhead inter-processor communication between the TK1s. Meanwhile sensor interfaces are provided by a Gigabit Ethernet NIC/PHY connected to each Tegra’s PCIe gen2 x1 port. There’s also a spare PCIe x8 expansion brought out from the PCIe switch for up to 4 Gbyte/s of off-board connectivity to user-determined I/O interfaces.

A tiled solution like this is capable of nearly 2 TFLOPS of compute performance while drawing less than 50W, and represents a large increase in the efficiency of low-power clustered SoCs over architectures that utilize higher-power discrete components. The decrease in power enables the placement and routing of all components on board, resulting in connectorless intercommunication with improved signal integrity and ruggedization. Useful for big data analytics, multi-channel video and signal processing, and machine learning, distributed architectures with TK1 offer substantial performance gains for those truly resource-intensive applications that require computational density while minimizing SWaP.  

The ground-breaking computational performance of Tegra K1, driven by NVIDIA’s low-power optimizations and the introduction of integrated CUDA, leads a new generation of embedded devices and platforms that leverage TK1’s SWaP density to deliver advanced features and capabilities. NVIDIA and GE have partnered to bring rugged SFF modules and systems powered by TK1 to the embedded space. Applications in robotics, medical and man-wearable devices, software-defined radio, security, surveillance and others are prime candidates for acceleration with Tegra K1. Beyond this, TK1’s ease-of-use promotes scalable, portable embedded systems with shortened development cycles, only furthered by the wealth of existing CUDA libraries and software that now run on Tegra. 

The powerful GPU-based multiprocessing platforms of today will continue to be favored in deployments in which the maximum possible pure processing capability is an absolute requirement. There can be little doubt, though, that Tegra K1 offers significant opportunity to bring powerful, rugged embedded computing to places and applications where it was previously impossible.

GE Intelligent Platforms