Embedded Data Integrity
Building Fault Tolerance into Embedded Data Management
Maintaining data integrity in embedded applications while also ensuring 24/7 operation is a complex challenge, especially when the constraints of real-time performance are added.
DUNCAN BATES, BIRDSTEP TECHNOLOGY
Page 1 of 1
Small resource-constrained applications are getting so complex that we can’t even start comparing the requirements from 10 years back with what we see today. Nonetheless there has been a trend of implementing homegrown data managing solutions for both volatile and persistent storage media. Building fault tolerance into these devices was not much of a requirement back then, but it is now increasingly being demanded by the end customers. Today we expect embedded applications to recover from any power failure in addition to operating 24/7 without system downtime.
No matter the fault tolerance requirement, building transactional or replication capabilities into the data management solution is a complex task and should only be attempted if you have time to spare, money to waste, and can live with an inferior application. The fact of the matter is that data management can get complicated. You may want to manage parts of your data on disk and other parts in memory, add an efficient data indexing subsystem, manage concurrent access to the data across multiple threads and applications, and implement an elaborate data caching system to avoid I/O overhead and increase performance—to mention a few. By now this amounts to a sophisticated piece of software if you need to cope with the fault tolerance scenarios for the embedded applications in addition to the other data requirements.
The first piece of the fault tolerance puzzle is the ability to recover from any application or power failure. Making sure that the data is not corrupt or that the loss of data hasn’t been too great is usually the first requirement. Data management solutions implement this by supporting the Atomic, Consistent, Isolated and Durable (ACID) transaction model. As a familiar example of the “Atomic” concept, consider what happens when you enter the bank and instruct the teller to move money from your checking account to savings. The money transaction breaks down into two operations, deduction from one account and addition to another, and it’s important to both parties that both happen or that neither happens. This is what’s meant by Atomic and is normally implemented through a data journaling system.
Figure 1 shows the current state of the database. Some of the information is in the consistent database image and the rest is in the transactional journal. But also note that the atomic operations are wrapped in Begin and Commit marks. “A” in ACID is ensured by having flushed the Commit mark to persistent storage.
The property of consistency is ensured by the data engine aborting transactions that break with any defined rules. Say you define a rule that the checking account can’t drop below $0. If the money transaction above violates this rule the system would automatically invalidate the transaction and revert back to the previous state of the database through the journal.
Now let’s add your spouse to the equation, who is tapping into an ATM to view the balance of your accounts just as you’ve instructed the teller to move the money. We will need to make sure that only the state of the two accounts prior to your request or the states after are visible. This is called transaction isolation and is also achieved by only offering the state of the database based on the image and the committed transactions.
Lastly, the durability property means we can’t accept the money transfer if power was lost or an application crash happened at an inconvenient time. There are different ways to accomplish this but simply put, we must at all times ensure that we have the information to either replay transactions or reinstate the old state of the database image. This process relies on quite a bit of disk I/O and disk cache flushing to guarantee crash recovery.
Implementing the four properties becomes quite complex and you will additionally want to have the flexibility of relaxing some of these properties for some operations in order to trade off safety with speed, etc., adding even more complexity to the transaction system that needs to be in play to build fault tolerance into your application.
No Errors from Downtime
The second piece to the puzzle is 24/7 operations. An identical secondary copy of the system, including the data, needs to be available to be able to hand over operation to a failover card in case of unplanned or planned downtime of the primary system. From a data management standpoint there are a few tricky functions that need to be available. First we need to have the ability to move the transactions over to the secondary system in real time in case of downtime. Second, when a planned upgrade is made, the system will need to deal with migrating transactions between two different versions of the database image. The latter is referred to as a hitless upgrade and is important to ensure 24/7 operations even if the system is not encountering a fault but rather a controlled system upgrade.
Figure 2 illustrates a real-time replication of an ACID transaction. Implementing real-time transactional replication can be done in two ways, asynchronously and synchronously. If the application needs to guarantee that both database copies are in sync you’ll need a synchronous replication solution. The effect of this is that the time to commit a transaction is the time it takes to get it into the master and the time it takes to move it over to the slave plus the time it takes to update the slave. In a synchronous replication environment the application will not regain control until both systems have been updated, which in most cases is not acceptable due to other performance requirements. An asynchronous model, which is the most common model, will queue the transaction up at the master when it’s committed and then return control to the application. The queued transactions will later get moved over to the slave followed by a slave update. The drawback is obvious. In an asynchronous model the slave may be delayed by a number of transactions, and in case of failure the system handover will hit a system with transactions lacking.
This rather high-level discussion should trigger thinking about all the details of fault tolerance and what they mean to your data management. It rapidly becomes apparent that this is a complex task, especially if you combine it with other data management abilities required by the application. The solution is of course to start looking at third-party data management solutions. Solutions for these problems have been around for the last three decades, with a few recent startups that offer data management systems and libraries to handle the problems outlined. Basing your application on an established third-party product will increase the overall quality, at a fraction of the cost compared to in-house development. And of course it will get you to the market in a much shorter time than if you implement the solution yourself. Obviously we highly recommended that you consult with the embedded database specialists when faced with the challenges described in this article.