Advances in Industrial Networking

Discover Problems in Your Distributed System Before It’s Too Late

In a distributed system, components running on different OSs and applications written in different languages all work together as a single reliable system. As the complexity of these systems increases, it becomes more important to gain visibility and expose potential problems before they jeopardize the production system.


  • Page 1 of 1
    Bookmark and Share

Article Media

Distributed systems are composed of many sub components, often times built separately by different teams, and are implemented and deployed in multiple phases. There are many ways to facilitate communications between different components, but to truly design systems in a loosely coupled manner where you can easily plug in additional components without affecting existing deployed applications remains a very difficult problem. Once the system is deployed, the need to have visibility of what’s happening will certainly arise, and without a flexible architecture to insert additional tools for monitoring the system, this would become an impossible mission. 

The term “distributed systems” refers to myriad variations. One example is the technologies in a passenger car that assist driver safety. Different radars and sensors are located around the car to monitor the surroundings, and these data need to be transmitted in real time to other components for analysis, which triggers further actions to yet other components to help the driver avoid collisions. Other examples of distributed systems include military systems inside navy ships and unmanned vehicles, medical imaging systems, asset tracking systems and air traffic control systems. These systems may look very different from a high level, but beneath the hood they all share many similar characteristics. All of these are large scale systems composed of lots of separate processors and embedded devices, and it is absolutely critical that all components in the system work together in real time and be highly reliable.

Getting the system up and running properly is only the first problem. Administering and maintaining the system to make sure it is behaving as you expect without interrupting its components adds additional challenges. Far too often the need to add additional tools and components is considered only as an afterthought, and by the time you realize you need these tools, adding them into the deployed system requires major and costly changes. For instance, between two communicating nodes, how do you intercept and visualize the live data flow, without interrupting either node? How do you verify that the attributes of nodes that are meant to communicate have compatible parameters? How do you keep a record of all the data on the wire, and possibly replay the same or a subset of the data for testing or debugging purposes? And with components that are designed to be loosely coupled and can run independently, how can you manage and administer all these components in one place?

Building on Top of a Proven Architecture

One of the key criteria for a flexible architecture is the ability to plug new modules into the existing system without modifications to any deployed components. Data Distribution Service (DDS) is an open and growing standard maintained by the Object Management Group (OMG) that is designed specifically to address the challenges of a loosely coupled, flexible, yet seamlessly integrated distributed system.

When applications need to send data and messages between each other in a distributed environment, one way is the “message-centric” approach where the infrastructure typically has very little information about the message. The infrastructure’s job is to deliver these indistinguishable messages equally. In contrast to that is a “data-centric” middleware, where the infrastructure understands the data format. It is similar to a relational database, where the database is fully aware of the schema of the data. It knows the definition of all the tables, all the columns, primary keys and foreign keys, and you can have triggers where, upon the occurrence of a certain event, the system can automatically react by taking a corresponding action. In a data-centric approach, the data model is also explicit and well defined, so that all applications fully understand the meaning of the data. 

The open Data Distribution Standard specifies both the API interface and the wire protocol. This allows applications written in different languages and running on different platforms to interoperate effortlessly. DDS implementations from different vendors can also interoperate, as long as the implementation is coded according to the standard specification. Various vendors, including IBM, have provided their own implementation of DDS and participated in demos to show interoperability across vendors. 

Automatic System Analysis

With Data Distribution Service, the middleware infrastructure takes care of the discovery and communications between all available applications. While a lot of this work is done automatically, there are many ways to customize the behavior of the communication through Quality of Service (QoS) values of each data publisher and subscriber. There are QoS values for configuring reliability, durability and filtering, among many other behaviors that you can configure. Configuration is done by simply specifying the appropriate value through an XML file or through code in your application. Having these QoS values significantly simplifies development, improves efficiency, and also makes your system very flexible. You can change many behaviors in the communication by simply updating these QoS values without changing your application code. 

To illustrate with a specific example, one of the many QoS values available is the Deadline. The Deadline QoS specifies the maximum time between data samples. You can set it on a data writer, which declares a new data sample will be published at least every x seconds. You can also set it on a reader, which specifies that a reader wants to receive one data sample at least every y seconds. Consider a case where you have a sensor application with a data writer that sets the deadline value to 1 second. This lets other components know that it agrees to send out a new data at least once every second. It can send a lot more frequently than that, but after it sends out one sample, before the next second on the clock ticks in, it promises to send at least one more data sample. 

For the control application that is listening for data from the sensor application, you can specify a deadline on the reader of say 10 seconds. This means that the reader requires a new data every 10 seconds. Now it becomes obvious that the QoS value of one application may have impact in another application, and the values need to be compatible for them to communicate. Using the current example, if a writer promises to write a data every one second, and the reader wants a data only every other 10 seconds, the values are compatible since the writer publishes data more frequently than the reader requires. And since the middleware is aware of these QoS values, it will do all the work to match readers and writers that have compatible QoS values. Same for the opposite situation: If the values were incompatible, the middleware would also know not to attempt sending data between these applications and waste bandwidth.

Leveraging Appropriate Tools 

RTI’s comprehensive tool suite is used in all phases of development. In initial phases, a common issue that prevents communication is incompatible settings between components. To quickly see all the participants and whether the QoS values are compatible, the Analyzer offers the QoS Match Analysis. Analyzer will discover all participating entities on the network and generate an automatic report that highlights the number of readers and writers that are reading or writing a particular data type and whether they are matched or mismatched to each other. The example in Figure 1 discovered four mismatches and you can drill down into the mismatched entities to see why.

Figure 1
Screenshot of Analyzer checking for entities that have incompatible settings.

Double clicking on the entity reveals the exact QoS that didn’t match. In Figure 2, it highlighted the Deadline QoS that are incompatible. The writer has a default value of Infinite, where the data reader actually requires a data sample every 100 ms. Since the values are incompatible, there will be no data between these data writers and readers.

Figure 2
Analyzer displaying detailed explanations for mismatched entities.

There are many other QoS values that can be tuned, and it is not easy to keep track of all the values without having the right tools. Using the Analyzer tool to run a quick report makes it much easier to identify the problematic values. 

Wire Tapping 

Since the data model is explicitly defined, the architecture is well adapted for additional applications or tools to tap onto the wire. The tools simply subscribe to the same data topics, and no changes are required in the data publisher or subscriber. One use case for wiretapping is to visualize the data that is on the wire. RTI offers a plugin for Excel that allows you to bring in data from the Connext Data bus right into a spreadsheet. You can install it on top of Microsoft Excel. This Excel plugin would detect all the available topics. You simply select the data topic and data fields that you are interested in, and then you can specify which cell location in an Excel worksheet  to display this data.

Once you have set up a cell to receive data, the values in the cells will be updated automatically as new data arrives, and you can use the data like any other cells in a spreadsheet. For example, you can plot the values in a bar / pie chart, and the charts will also be updated dynamically based on the latest value in the cells in the spreadsheet.

Wiretapping is also necessary for advanced debugging. The popular network analyzer WireShark supports the Real-Time Publish Subscribe Protocol that DDS uses, and you can use it to see low-level details in each data packet that is flowing on the network (Figure 3).

Figure 3
WireShark analyzing data packets sent by an RTI Connext application.

Deep Monitoring

It is important to monitor your system to make sure it is running without any problems. In addition to the data that you can retrieve from simply tapping onto the wire, additional insights require deeper monitoring and instrumentation. RTI provides an optional monitoring library  that, once enabled, can collect various statistics from readers, writers and DDS entities, and then publish these data using DDS. The companion RTI Monitor application can be used to see the data collected by the monitoring library. There are many different views and panels that will help you understand your system. 

On the left panel in Figure 4, you can see a system tree displaying the hosts in your system, the processes, the topics, the data writers and the data readers. On the right panel you see a system overview. Near the bottom is a diagram where you get a graphical view of what your system looks like. In this example you can see there’s one host with two processes, and you can see the number of writers and readers in this process. The view is also color coded to highlight errors or warnings. 

Figure 4
Monitor’s system overview displaying a summary of the entities on the network, and highlighting components that have warnings or errors.

Centralized Administration and Logging

If the application is running entirely on one host or in some sort of server, traditional log files serve as a convenient way to track critical information during run time. However, in a distributed system, gathering and collecting log files from all the components becomes a big hassle. To solve this problem, RTI has created a Distributed Logger API, which allows the application to publish log messages remotely. The API is very simple. You can log messages of different error severity, and they can be published on the network so that you can view them in a centralized location. To view live updates of these distributed log messages in one place, you can use the Administration Console. The different color-coded messages reflect each message’s error severity (Figure 5).

Figure 5
Administration Console displaying log messages from an application.

In addition to log information, it is also difficult to see system information from each host in one place without proper tools that collect live statistics from these hosts. The Administration Console displays the CPU and memory usage of each remote host. In Figure 6 you can see one host’s CPU is at 14%, and the other one is at 0%. You can also see the amount of free and used memory of each host summarized in one single table. The information is also color coded to help easily spot systems that may need attention.

Figure 6
Administration Console displaying the status from various hosts.

As the name may suggest, the Administration Console serves as the centralized administration tool for components running remotely. It can monitor and manage the state of other services. One of these services is the routing service, which allows you to route data traffic between different networks and different transports. It also allows you to apply transformations to the data. The Administration Console provides a dashboard of all applications and running services. One thing to note is that all of the services also use the distributed logger. Whenever any service or any application that uses the distributed logger logs a message, it will appear in this one single view. Administration Console gives you a summary of the number of applications and services that are having problems, and it will indicate the warnings and the errors by marking it with corresponding error or warning icons, so you can easily spot them and drill down into the detailed log messages to figure out the problems. 

To design a distributed system that is reliable, flexible and easy to maintain, the key is to start with the right architecture. Systems unavoidably need to evolve at some time, and the flexibility in the underlying architecture allows you to easily add additional tools to help monitor and discover problems in your system.  

Real-Time Innovations
Sunnyvale, CA.
(408) 990-7400.