TECHNOLOGY IN CONTEXT
Managing Network Systems
Mandates for Power Efficiency Push Telcom Providers Toward Software Optimization
New hardware investments offer a range of features for power savings. But these can only be truly realized with the proper software techniques. Together these can attain up to 32 percent power savings.
CARTER EDMONDS, KONTRON
Page 1 of 1
Power efficiency has emerged as one of the key areas for long-term improvement in telecom applications. Reduced energy usage means lower costs and diminished environmental impact. In turn, potential savings for carriers are significant when evaluated against the “always on” central office or data center. Hardware is commonly the starting point when evaluating telecom power efficiencies, given current silicon advances that provide capabilities for effectively managing a server’s power consumption. Software is considered less often and is routinely overlooked in the quest for power savings. However, dramatic energy savings can be achieved by focusing attention on the operating system, its configuration and the application itself. Software optimization techniques add significant value to hardware investments, and can contribute up to a 32 percent reduction in power consumption under various workloads common to the data center or central office.
Industry-wide focus on energy savings, driven by Verizon’s initiative targeting an aggressive 20 percent annual power reduction on deployed systems, illustrates the urgency carriers are placing on power efficiency policies. Energy cost management is extensive, and includes not only the initial cost to supply energy but also the expense of removing it again as heat. The resulting thermal management requirements can double the cost of the energy usage alone. Moreover, waste heat limits equipment density, consuming valuable space and restricting service capacity, especially in well-established central offices with fixed building outlines.
Verizon’s initial poll of telecom vendors and manufacturers indicated confidence in achieving a 10 to 15 percent reduction in power consumption for new equipment; the resulting initiative was intended to push that envelope by setting a 20 percent goal. The initiative is based on formulas designed to test the power consumption of equipment in various operating conditions, and includes a specific measurement process and series of Telecommunications Equipment Energy Efficiency Ratings.
Opportunities to Find Power Savings
Telecom equipment is typically deployed adequately for expected peak traffic plus headroom. As a result, portions remain partly idle and the system rarely operates at peak load. For telcos, this creates a unique opportunity to increase power savings by effectively matching power consumption to server workload. Applying software techniques to control CPU power usage, for example, creates different levels of usage by defining a performance cycle and a sleep cycle.
P-states, or the level of CPU performance, represent particular CPU frequencies. This refers to how fast the CPU and its various cores process data, along with its corresponding power requirement. C-states represent sleep states achieved when portions of the processors are directed to remain inactive. Deeper sleep states consume less power but require more time to return back to work.
Since higher speeds consume more power, system architects would logically assume that reducing processing speeds will save power. However, occasionally P-states and C-states work against each other, requiring deeper knowledge of the application itself. For example, applying C-states may be a particularly prudent option given the high number of cores that can be found in enterprise servers or data center systems. A server may be implemented with eight cores but only require one to complete a particular task.
An installed operating system would make some of these decisions by default; however, system expertise is often required to define the ideal settings for performance and power, often locked in for long-term operation. Optimization techniques address this conflict, matching the workload to the best hardware management scheme and evaluating P-states and C-states for ideal performance. Self-tuning policies on performance vs. power are anticipated in the future. However, today’s system architects must not only evaluate power/performance schemes up front but also understand how the application itself impacts chosen software techniques and options.
Not long ago, servers were largely unaware of power as a strategic asset in achieving top performance. Servers were always on, or at best turned on and off to match usage patterns—and idle servers used as much power as servers under load. Recent hardware generations have included power reduction circuitry that cooperates with software enhancements to reduce idle power consumption as well as power consumption under load. This hardware has power savings built in, however, benefits are only realized if the software implements power saving algorithms.
Unused parts of the chip can be turned off automatically through hardware and software, akin to turning off the lights as you walk through the house. Unlike power management schemes that turn entire servers on and off, these power transitions take milliseconds instead of minutes and the OS remains alive and operational during the process. Note that the power readings presented here are not intended as benchmarks. Rather, they describe techniques for optimizing hardware and workload. Tests used a Kontron CG2100, commercially available Linux distribution, and a modified version of the open-source eBizzy workload generator.
Coupling new hardware with a recent OS is a great step forward for many telco systems. In Linux, for example, more recent kernels have an improved scheduler that makes better use of the hardware’s power and sleep states. While all recent Linux kernels contain some support for sleep states, 2.6.21 introduced the “tickless” kernel. The tickless kernel leverages the High Precision Event Timer (HPET) found on today’s chipsets to schedule events; processors sleep longer, conserve significant power and no longer require a CPU to wake up in order to increment a counter. A demonstration of the effects of using the tickless kernel is shown in Figure 1. This simple advantage is not necessarily common to every OS distributor; each has a different policy for releasing new kernels, and several major distributors in the server space do not yet include the tickless kernel.
A sample workload using two popular releases of the Linux kernel, 2.6.27 and 2.6.18. In particular, 2.6.27 adds the tickless kernel and 2.6.18 does not. Using the less sophisticated timing mechanism on the earlier kernel, the idle machine consumed 163W versus 133W with the tickless kernel, which delivered an 18 percent savings in power. Even with a significant workload, savings topped 12 percent by using the more sophisticated timing feature of the current Linux OS.
Servers need a strategy for how fast to process data and how often to sleep, i.e. controlling the P-states and C-states to achieve the largest energy and performance advantage. Policies such as these are implemented in the Linux power governors, and often start by asking some tough questions. Since processors consume more power at higher frequencies and minimal power while sleeping, is it better to finish a task quickly and sleep more or is it better to sleep less but consume less power while awake?
For some workloads, system administrators may determine that it is ideal to have the processor running as fast as possible. Although consuming greater power, it completes its task quickly and returns to C-state. Other workloads, however, perform to improved power settings by letting the CPU run as slowly as possible. Even though a particular core is kept awake longer, it consumes less power during the task.
The answer is workload dependent and requires tradeoffs between throughput, latency and power consumption. Three different types of workloads must be considered, including processes that are CPU-bound, memory-bound or I/O-bound. For example, some workloads are CPU-bound for brief spikes of activity, such as when new packets come in to be processed. In these cases, the processors run at high frequencies to complete their work quickly and then immediately return to sleep until the next spike, maximizing the amount of sleep time and minimizing power consumption. The Linux “on-demand” governor implements this particular policy and it is the default in most distributions.
Figure 2 illustrates a memory-bound application and shows the power savings achieved by choosing a lower power state for this particular workload. Changes in processor frequency affected it slightly, but increasing cache size improved throughput greatly. As a result, the sample workload showed greatest power savings when run at the lowest frequency because much of the processor’s time was consumed waiting for data to return from the memory controller.
The power savings achieved by choosing a lower power state for our particular workload. The top two curves are copied from Figure 1, which used only the “on-demand” kernel. The bottom line shows the power savings achieved by taking the tickless kernel (2.6.27) and applying the “user space” governor to place the processors in the lowest power state (i.e. lowest frequency).
Overall, telecom applications driven by I/O present an interesting challenge. If the thread begins with the arrival of a packet, the best strategy depends on what is happening to that packet. A packet compared against an in-memory lookup table might benefit from the lower processor speed since execution speed is gated by memory throughput whereas a mathematical operation on the packet might benefit from a higher processor speed. Further, none of this considers cache locality. In all cases, the answer can only be known by characterizing the workload on a real machine or suitable simulator. Moreover, power efficiency is only one goal and must be considered within quality of service, and the metrics of throughput and latency.
Interrupt handlers present telcos with tradeoff options between power and performance. Dispersing hardware interrupts as widely as possible may maximize throughput, however, at less than peak load this merely wakes processors that could otherwise sleep. Consider a packet forwarding application that receives incoming packets on multiple network interfaces. At peak load, it often makes sense to assign each interrupt handler to a separate core. At less than peak load, it is possible to achieve the requested throughput and latency while consolidating interrupt handlers on a smaller number of cores.
The OS makes no attempt to optimize this sequence for ideal power usage. Achieving power reduction here requires continual re-balancing of the interrupts based on quality of service measurements such as throughput and latency. A software daemon would consolidate or disperse interrupt handlers to achieve the desired balance. For consistency, these tests pushed all interrupts to a single core. A real-world telecom application would need to spread interrupt handlers more widely when quality of service required better throughput or latency.
One non-power-aware application in the mix can spoil overall power savings. For example in a packet inspection application, worker threads might be dispatched to perform the actual decoding, analysis and lookup as new packets arrive. Without optimization for power awareness, the application could let the worker threads sit in a polling loop while waiting for new work items to appear in the queue. The processor handling the thread would be fully awake, consuming full power. A power-aware application would allow these threads to block, returning to the scheduler while waiting for the new event. In this instance, the processor would sleep until needed, again saving significant power.
Core selection is another power vs. performance tradeoff that can be controlled by the user. If coded correctly, idle cores consume minimal power. As threads are assigned to cores however, performance tradeoffs may arise because certain resources are shared. For instance, hyperthreaded core siblings share most of the same CPU resources, and cores within a single CPU share input/output (I/O) and cache. Adding a second CPU doubles the cache, and sharing cache may or may not be preferred depending on the application. By default, the OS scheduler will dispatch threads as widely as possible although this can be adjusted through CPU affinity. If threads do share data, cache locality suggests that threads should be kept as close together as possible, for example using cores in the same package behind the same cache. In contrast, many applications benefit from sharing as little hardware as possible. In a dual-processor server, bringing the second package online also doubles the amount of cache, a real benefit to performance in most cases. Figure 3 show some gains that can be achieved by changing the number of threads and active cores.
Changes in data points, resulting from changing the number of threads and the number of active cores. Data points below the original three lines indicate the parameters that outperformed the built-in options, implemented with an optimized power governor.
Putting It All Together
The most competitive power efficiencies result from a well-written application running on the latest hardware and software. By adding greater levels of software optimization, power savings are advanced even further with a daemon that adaptively adjusts CPU affinity, interrupt handlers and CPU frequencies or power states. By using a workload generator and tuning each system, a dramatic 18 to 32 percent power savings was realized at various workload levels when compared to the original power/performance curve with the out-of-the-box (2.6.18) kernel. A truly adaptive policy would monitor incoming requests and quality of service metrics to determine if additional hardware resources would benefit the workload presented at any given time.
Telcos are challenged to meet new energy protocols, and future goals will likely set an even higher bar, intended to continually improve energy savings on a global basis. As a result, system architects must understand the range of hardware and software options for meeting and exceeding energy efficiency standards today. Software optimization can differentiate significantly greater results than achieved with hardware alone, ideally driving telcos to leverage both resources for the right combination of bandwidth, performance and reliability within the most competitive power threshold. Blending the know-how of hardware development with extensive software expertise provides the fine tuning that distinguishes an efficient system from one optimized for long-term, application-specific power awareness.