Critical event prediction for proactive management in large-scale computer clusters

R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, A. Sivasubramaniam

Research output: Contribution to conferencePaper

181 Citations (Scopus)

Abstract

As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.

Original languageEnglish (US)
Pages426-435
Number of pages10
DOIs
StatePublished - Dec 1 2003
Event9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States
Duration: Aug 24 2003Aug 27 2003

Other

Other9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
CountryUnited States
CityWashington, DC
Period8/24/038/27/03

Fingerprint

Time series
Distributed computer systems
Bayesian networks
Computer systems
Automation
Health
Availability
Control systems
Monitoring

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Cite this

Sahoo, R. K., Oliner, A. J., Rish, I., Gupta, M., Moreira, J. E., Ma, S., ... Sivasubramaniam, A. (2003). Critical event prediction for proactive management in large-scale computer clusters. 426-435. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956799
Sahoo, R. K. ; Oliner, A. J. ; Rish, I. ; Gupta, M. ; Moreira, J. E. ; Ma, S. ; Vilalta, R. ; Sivasubramaniam, A. / Critical event prediction for proactive management in large-scale computer clusters. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States.10 p.
@conference{c9bb6b42ac384868a59ad8af57ed4283,
title = "Critical event prediction for proactive management in large-scale computer clusters",
abstract = "As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70{\%} accuracy.",
author = "Sahoo, {R. K.} and Oliner, {A. J.} and I. Rish and M. Gupta and Moreira, {J. E.} and S. Ma and R. Vilalta and A. Sivasubramaniam",
year = "2003",
month = "12",
day = "1",
doi = "10.1145/956750.956799",
language = "English (US)",
pages = "426--435",
note = "9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 ; Conference date: 24-08-2003 Through 27-08-2003",

}

Sahoo, RK, Oliner, AJ, Rish, I, Gupta, M, Moreira, JE, Ma, S, Vilalta, R & Sivasubramaniam, A 2003, 'Critical event prediction for proactive management in large-scale computer clusters', Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States, 8/24/03 - 8/27/03 pp. 426-435. https://doi.org/10.1145/956750.956799

Critical event prediction for proactive management in large-scale computer clusters. / Sahoo, R. K.; Oliner, A. J.; Rish, I.; Gupta, M.; Moreira, J. E.; Ma, S.; Vilalta, R.; Sivasubramaniam, A.

2003. 426-435 Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Critical event prediction for proactive management in large-scale computer clusters

AU - Sahoo, R. K.

AU - Oliner, A. J.

AU - Rish, I.

AU - Gupta, M.

AU - Moreira, J. E.

AU - Ma, S.

AU - Vilalta, R.

AU - Sivasubramaniam, A.

PY - 2003/12/1

Y1 - 2003/12/1

N2 - As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.

AB - As the complexity of distributed computing systems increases, systems management tasks require significantly higher levels of automation; examples include diagnosis and prediction based on real-time streams of computer events, setting alarms, and performing continuous monitoring. The core of autonomic computing, a recently proposed initiative towards next-generation IT-systems capable of 'self-healing', is the ability to analyze data in real-time and to predict potential problems. The goal is to avoid catastrophic failures through prompt execution of remedial actions.This paper describes an attempt to build a proactive prediction and control system for large clusters. We collected event logs containing various system reliability, availability and serviceability (RAS) events, and system activity reports (SARs) from a 350-node cluster system for a period of one year. The 'raw' system health measurements contain a great deal of redundant event data, which is either repetitive in nature or misaligned with respect to time. We applied a filtering technique and modeled the data into a set of primary and derived variables. These variables used probabilistic networks for establishing event correlations through prediction algorithms. We also evaluated the role of time-series methods, rule-based classification algorithms and Bayesian network models in event prediction.Based on historical data, our results suggest that it is feasible to predict system performance parameters (SARs) with a high degree of accuracy using time-series models. Rule-based classification techniques can be used to extract machine-event signatures to predict critical events with up to 70% accuracy.

UR - http://www.scopus.com/inward/record.url?scp=77952378080&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952378080&partnerID=8YFLogxK

U2 - 10.1145/956750.956799

DO - 10.1145/956750.956799

M3 - Paper

AN - SCOPUS:77952378080

SP - 426

EP - 435

ER -

Sahoo RK, Oliner AJ, Rish I, Gupta M, Moreira JE, Ma S et al. Critical event prediction for proactive management in large-scale computer clusters. 2003. Paper presented at 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03, Washington, DC, United States. https://doi.org/10.1145/956750.956799