TY - GEN
T1 - What is system hang and how to handle it
AU - Zhu, Yian
AU - Li, Yue
AU - Xue, Jingling
AU - Tan, Tian
AU - Shi, Jialong
AU - Shen, Yang
AU - Ma, Chunyan
PY - 2012
Y1 - 2012
N2 - Almost every computer user has encountered an unresponsive system failure or system hang, which leaves the user no choice but to power off the computer. In this paper, the causes of such failures are analyzed in detail and one empirical hypothesis for detecting system hang is proposed. This hypothesis exploits a small set of system performance metrics provided by the OS itself, thereby avoiding modifying the OS kernel and introducing additional cost (e.g., hardware modules). Under this hypothesis, we propose SHFH, a self-healing framework to handle system hang, which can be deployed on OS dynamically. One unique feature of SHFH is that its "light-heavy" detection strategy is designed to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature is that its diagnosis-based recovery strategy offers a better granularity to recover from system hang. Our experimental results show that SHFH can cover 95.34% of system hang scenarios, with a false positive rate of 0.58% and 0.6% performance overhead, validating the effectiveness of our empirical hypothesis.
AB - Almost every computer user has encountered an unresponsive system failure or system hang, which leaves the user no choice but to power off the computer. In this paper, the causes of such failures are analyzed in detail and one empirical hypothesis for detecting system hang is proposed. This hypothesis exploits a small set of system performance metrics provided by the OS itself, thereby avoiding modifying the OS kernel and introducing additional cost (e.g., hardware modules). Under this hypothesis, we propose SHFH, a self-healing framework to handle system hang, which can be deployed on OS dynamically. One unique feature of SHFH is that its "light-heavy" detection strategy is designed to make intelligent tradeoffs between the performance overhead and the false positive rate induced by system hang detection. Another feature is that its diagnosis-based recovery strategy offers a better granularity to recover from system hang. Our experimental results show that SHFH can cover 95.34% of system hang scenarios, with a false positive rate of 0.58% and 0.6% performance overhead, validating the effectiveness of our empirical hypothesis.
UR - http://www.scopus.com/inward/record.url?scp=84876400446&partnerID=8YFLogxK
U2 - 10.1109/ISSRE.2012.12
DO - 10.1109/ISSRE.2012.12
M3 - 会议稿件
AN - SCOPUS:84876400446
SN - 9780769548883
T3 - Proceedings - International Symposium on Software Reliability Engineering, ISSRE
SP - 141
EP - 150
BT - Proceedings - 2012 IEEE 23rd International Symposium on Software Reliability Engineering, ISSRE 2012
T2 - 2012 IEEE 23rd International Symposium on Software Reliability Engineering, ISSRE 2012
Y2 - 27 November 2012 through 30 November 2012
ER -