Energy Efficient and Resilient Computer Systems

As computing devices become more complex, the underlying components within them continue to shrink. One side effect of this continued scaling is that these devices cease to function as the precise machines of the past, and are becoming rather unpredictable with varying degrees of hardware "variability". While hardware designers cope with this by instituing more guard bands in their designs, thereby hiding it from the software, it often comes at the cost of reduced performance and energy in-efficiency. As part of the NSF Variability Expeditions, we have been investigating techniques to make the software stack not only more resilient, but also adaptive so that it can leverage these underlying differences between devices. We began by detailed characterization of power variability across multiple classes of processors, and the impact it has on power modeling. We are now investigating the right abstractions for exposing this variability to systems software and methods by which it can be leveraged to build more reliable and/or energy efficient systems. Since one manifestation of hardware variability is reduced reliability, we are looking at making programs more resilient by decomposing them into parts that can handle errors (run under relaxed hardware guarantees) and sections that cannot handle any errors (run under strict hardware guarantees). While a fuild hardware-software interface can not only mitigate but also leverage variability for increased robustness and energy efficiency, many research challenges remain!