Abstract
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
Original language | English (US) |
---|---|
Pages | 476-485 |
Number of pages | 10 |
State | Published - Nov 9 2005 |
Event | 2005 International Conference on Dependable Systems and Networks - Yokohama, Japan Duration: Jun 28 2005 → Jul 1 2005 |
Other
Other | 2005 International Conference on Dependable Systems and Networks |
---|---|
Country | Japan |
City | Yokohama |
Period | 6/28/05 → 7/1/05 |
All Science Journal Classification (ASJC) codes
- Software
- Hardware and Architecture
- Computer Networks and Communications
Cite this
}
Filtering failure logs for a BlueGene/L prototype. / Liang, Yinglung; Zhang, Yanyong; Sivasubramaniam, Anand; Sahoo, Ramendra K.; Moreira, Jose; Gupta, Manish.
2005. 476-485 Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan.Research output: Contribution to conference › Paper
TY - CONF
T1 - Filtering failure logs for a BlueGene/L prototype
AU - Liang, Yinglung
AU - Zhang, Yanyong
AU - Sivasubramaniam, Anand
AU - Sahoo, Ramendra K.
AU - Moreira, Jose
AU - Gupta, Manish
PY - 2005/11/9
Y1 - 2005/11/9
N2 - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
AB - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
UR - http://www.scopus.com/inward/record.url?scp=27544497222&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=27544497222&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:27544497222
SP - 476
EP - 485
ER -