BROWSE ARTICLES BY TECHNOLOGY

Device Developers Conference 2013

Bristol: 14th May
Cambridge: 16th May
Manchester: 22nd May

RTECC

IS SOURCEBOOK


DIGITAL EDITION

RTC Magazine Digital Edition

AMD SOLUTIONS GUIDE

INDUSTRY NEWS

QUICK DOWNLOADS

 

TECHNOLOGY CONNECTED

PCI Over Cable vs. Ethernet

PCI Express over Cable Middleware Maximizes Application Performance

New software and flexible PCI Express over cable solutions simplify interconnect requirements in high-performance embedded solutions.

BY HERMAN PARAISON, DOLPHIN INTERCONNECT SOLUTIONS

  • Page 1 of 1
    Bookmark and Share

Article Media

System architects increasingly look to PCI Express over cable as a means of improving real-time system performance. The ability to partition systems with PCI Express over cable makes development of real-time control systems easier and faster for embedded applications. Yet, PCI Express has been basically isolated to the I/O portion of system architectures, with little emphasis placed on inter-process communication such as shared memory, RDMA, replication, or accelerated networking performance. Instead, embedded system designers introduce a separate interconnect for multiprocessing. This increases complexity and cost, and limits flexibility. PCI Express over cable’s cost, functionality and performance characteristics suit multiprocessing applications, but it’s been slow to gain traction. 

PCI Express over cable provides numerous advantages. In an asymmetric multiprocessing architecture, which partitions resources into discreet known entities, PCI Express over cable enables system architects to partition systems between processors, I/O and memory. Several vendors offer various adapter cards, backplanes, cables and switch solutions to accomplish this task. The resulting architecture features reduced system latency and increased data throughput. For inter-process communication, PCI Express supports both CPU-driven programmed I/O (PIO) and Direct Memory Access (DMA) as transports through non-transparent bridging (NTB). So why are many within the embedded community reluctant to use PCI Express for inter-process communication, but instead focus on implementing a dual interconnect solution adding Ethernet or another interconnect solution? The hesitation results from a lack of middleware software that addresses multiprocessing architectures. PCI Express acceptance as an inter-process solution requires new easy-to-use functional drivers and middleware. Such middleware takes advantage of PCI Express unique performance attributes while exposing standard interfaces to higher level applications including Ethernet-based applications.  

This middleware is now available and implements a software transport interface between applications and the PCI Express data transfer layer. The middleware software interfaces with applications through shared memory libraries, Berkeley sockets and TCP/UDP functionality. It utilizes the data transfer layer to move data between processors, memories and I/O devices through the NTB functions of PCI Express. In addition, by using direct remote memory reads and writes and standard DMA operations (RDMA), the middleware implements a reliable and very efficient data transport for applications. These architecture layers seamlessly integrate to take advantage of the performance attributes of PCI Express and create a simplified environment for software development and support.

Software Overview

The middleware includes two key components, a Berkeley compliant sockets interface and an optimized shared memory API. Figure 1 shows the components of the software architecture based on Dolphin Interconnect Solutions shared memory programming API and middleware for PCI Express. The Berkeley compliant sockets library reduces the number of system resources and interrupts needed for data transfer in order to optimize data throughput and latency.

Figure 1
The middleware is organized for simultaneous communication. The PCIe driver is used for setup, DMA and plug and play management. PIO transfers originate directly from the application.

The optimized shared memory API allows applications to safely map chunks of remote memory and supports data transfers based on PIO and DMA. It also allows triggering remote interrupts and managing events generated by the data transfer layer.

Within the sockets interface, standard mechanisms like Windows Layered Service provider API and Linux Sockets Direct enable standard applications to run over PCI Express without modification. The PCI Express middleware takes advantage of these techniques to deliver a sockets interface that significantly accelerates application performance compared to traditional 1G and 10G Ethernet hardware. Figure 2 illustrates the latency results of PCI Express sockets vs. 10 Gbit Ethernet. The sockets interface deploys differently on Linux and Windows.

Figure 2
Application-to-application socket latency is as low as 1.75 ?s and far lower than 10 Gbit Ethernet.

To implement this interface on Linux, the address family changes from the regular sockets family. The PCI Express sockets interface implements an AF_INET compliant socket transport called AF_SSOCK. The Linux LD_PRELOAD functionality preloads the standard socket library with a special sockets library that intercepts the socket() call and replaces the AF_INET address family with AF_SSOCK. All other socket calls follow the usual code path and the PCI Express sockets module accelerates them if the destination address falls within the PCI Express network.

For Windows, most applications use dynamic link libraries provided by the operating system (WS2_32. DLL and KERNEL32.DLL) for socket communication. Windows sockets are regular handles, so normal handle operations, such as closure, duplication and inheritance when a new child process is created, are applied on them. PCI Express sockets middleware intercepts the required WinSock2 API calls. During runtime, all socket calls route through a sockets switch.

Basic configuration files control the communication channel. These files identify systems in the processor cluster and their associated node ID in the network. Sockets configured for a PCI Express end point route through the low latency sockets library. All other connections route back through the standard WinSock2 or AF_INET transport library and regular Ethernet. Measurements show that the redirection is virtually instantaneous and adds no overhead to system calls. An optional configuration file can explicitly enable or disable individual PCI Express sockets communication.

The sockets library implements PIO transfers for small messages and engages DMA engines to transmit large messages. Since the sockets interface uses standard calls, it is application transparent. The combination leads to low latency and high throughput without application changes or tuning, Figure 3 shows the results of a LINBIT DRBD benchmark using the sockets interface. The PCI Express socket library outperforms other interconnects in write performance when using DRBD.

Figure 3
DRBD random write test on Intel 10Gbit Ethernet controller, Mellanox 40Gbit InfiniBand controller and Dolphin IX PCI Express controller.

In addition, the sockets middleware comes with built-in high availability. If the PCI Express network is unavailable, socket communication reverts to the regular network stack. The Linux version comes with an instant fail-over and fail-forward mechanism that transparently switches between PCI Express and regular networking if the PCI Express communication is disabled, for example, due to a disconnected cable).

Shared Memory API

Setup and configuration of an NTB environment requires knowledge of chipsets, programming and PCI Express experience. Vendors of PCI Express chipsets provide example code to simplify this process, but to reach an optimal and flexible solution requires a significant amount of work. The PCI Express shared memory API simplifies the development of NTB applications for those seeking maximum performance. The shared memory API includes drivers that allow developers to allocate memory segments on the local node and make this memory available to other nodes. The local node then connects to memory segments on remote nodes. 

Once available, a memory segment is accessed  in two ways, either mapped into the address space of your process and accessed as a normal memory access, such as via pointer operations, or by way of the DMA engine in the PCI Express chipset to transfer data. Mapping the remote address space and using PIO may be appropriate for control messages and data transfers up to something like 1 Kbyte, since the processor moves the data with very low latency. PIO optimizes small write transfers by requiring no memory lock down and using a write-posted store instruction, since data may already exist in the cache and the actual transfer is a single CPU instruction. A large transfer with this PIO methodology creates processor overhead, so DMA is often preferred.

The DMA approach allows the CPU not to be involved in the data movement. Latencies usually increase slightly because of the time required to lock down memory and set up the DMA engine and interrupt completion time. However, more data transfers joined and sent together to the PCI Express Switch in order amortizes the overhead. The shared memory API sets up the DMA queue and passes one or more specifications of data transfer to the DMA engine.

The shared memory API also manages interrupts. The API includes management support for local and remote interrupts, along with other advanced features such as caching, managing data transfer errors, event generation and a callback mechanism.

Simulator applications distributed over several systems illustrate a shared memory API implementation. The PCI Express over cable solution delivers direct system communication with the lowest possible latency and data delivery jitter using uni- or multi-cast communication. Data written to remote nodes typically arrives in remote memory within less than 0.74 microseconds as shown in Figure 4. Cacheable main system memory stores data. This gives a significant performance and cost benefit over interconnect solutions that rely on device memory for communication.

Figure 4
Half round trip latency is under 1 ?s with the optimized shared API library.

Remote interrupts or polling signals the arrival of data from a remote node. Since the memory segments are normal cacheable main memory, polling is very fast and consumes no memory bandwidth. The CPU polls for changes in its local cache. When new data arrives from the remote node, the I/O system automatically invalidates the cache and caches the new value.

Overall, the addition of PCI Express middleware enables powerful new PCI Express over cable solutions for embedded systems and commercial applications. The powerful tool set enables PCI Express to tackle applications traditionally occupied by Ethernet or proprietary interconnects. This is the next step in the PCI Express evolution. 

Dolphin Interconnect Solutions
Woodsville, NH.
(603) 747-4101
www.dolphinics.com