Filtering failure logs for a BlueGene/L prototype

Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K. Sahoo, Jose Moreira, Manish Gupta

Research output: Contribution to conferencePaper

93 Citations (Scopus)

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

Original languageEnglish (US)
Pages476-485
Number of pages10
StatePublished - Nov 9 2005
Event2005 International Conference on Dependable Systems and Networks - Yokohama, Japan
Duration: Jun 28 2005Jul 1 2005

Other

Other2005 International Conference on Dependable Systems and Networks
CountryJapan
CityYokohama
Period6/28/057/1/05

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Liang, Y., Zhang, Y., Sivasubramaniam, A., Sahoo, R. K., Moreira, J., & Gupta, M. (2005). Filtering failure logs for a BlueGene/L prototype. 476-485. Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan.
Liang, Yinglung ; Zhang, Yanyong ; Sivasubramaniam, Anand ; Sahoo, Ramendra K. ; Moreira, Jose ; Gupta, Manish. / Filtering failure logs for a BlueGene/L prototype. Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan.10 p.
@conference{35f108baaea9409ba077c03ce0bc6503,
title = "Filtering failure logs for a BlueGene/L prototype",
abstract = "The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96{\%} of the 828,387 original entries, and more accurately portray the failure occurrences on this system.",
author = "Yinglung Liang and Yanyong Zhang and Anand Sivasubramaniam and Sahoo, {Ramendra K.} and Jose Moreira and Manish Gupta",
year = "2005",
month = "11",
day = "9",
language = "English (US)",
pages = "476--485",
note = "2005 International Conference on Dependable Systems and Networks ; Conference date: 28-06-2005 Through 01-07-2005",

}

Liang, Y, Zhang, Y, Sivasubramaniam, A, Sahoo, RK, Moreira, J & Gupta, M 2005, 'Filtering failure logs for a BlueGene/L prototype' Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan, 6/28/05 - 7/1/05, pp. 476-485.

Filtering failure logs for a BlueGene/L prototype. / Liang, Yinglung; Zhang, Yanyong; Sivasubramaniam, Anand; Sahoo, Ramendra K.; Moreira, Jose; Gupta, Manish.

2005. 476-485 Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Filtering failure logs for a BlueGene/L prototype

AU - Liang, Yinglung

AU - Zhang, Yanyong

AU - Sivasubramaniam, Anand

AU - Sahoo, Ramendra K.

AU - Moreira, Jose

AU - Gupta, Manish

PY - 2005/11/9

Y1 - 2005/11/9

N2 - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

AB - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

UR - http://www.scopus.com/inward/record.url?scp=27544497222&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=27544497222&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:27544497222

SP - 476

EP - 485

ER -

Liang Y, Zhang Y, Sivasubramaniam A, Sahoo RK, Moreira J, Gupta M. Filtering failure logs for a BlueGene/L prototype. 2005. Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan.