Fault-aware job scheduling for BlueGene/L systems

A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, Anand Sivasubramaniam

Research output: Chapter in Book/Report/Conference proceedingConference contribution

52 Scopus citations

Abstract

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. In this paper evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, overage response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as 10%) can significantly improve the performance of the BlueGene/L system.

Original languageEnglish (US)
Title of host publicationProceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)
Pages899-908
Number of pages10
StatePublished - Dec 1 2004
EventProceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM) - Santa Fe, NM, United States
Duration: Apr 26 2004Apr 30 2004

Publication series

NameProceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)
Volume18

Other

OtherProceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)
CountryUnited States
CitySanta Fe, NM
Period4/26/044/30/04

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Fingerprint Dive into the research topics of 'Fault-aware job scheduling for BlueGene/L systems'. Together they form a unique fingerprint.

  • Cite this

    Oliner, A. J., Sahoo, R. K., Moreira, J. E., Gupta, M., & Sivasubramaniam, A. (2004). Fault-aware job scheduling for BlueGene/L systems. In Proceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM) (pp. 899-908). (Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM); Vol. 18).