Rain or Shine? - Making Sense of Cloudy Reliability Data

Iyswarya Narayanan, Bikash Sharma, Di Wang, Sriram Govindan, Laura Caulfield, Anand Sivasubramaniam, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.

Original languageEnglish (US)
Title of host publicationProceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017
EditorsKisung Lee, Ling Liu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages218-229
Number of pages12
ISBN (Electronic)9781538617915
DOIs
StatePublished - Jul 13 2017
Event37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017 - Atlanta, United States
Duration: Jun 5 2017Jun 8 2017

Publication series

NameProceedings - International Conference on Distributed Computing Systems

Other

Other37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017
CountryUnited States
CityAtlanta
Period6/5/176/8/17

Fingerprint

Rain
Climate control
Hardware
Factor analysis
Decision making
Availability
Costs

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Narayanan, I., Sharma, B., Wang, D., Govindan, S., Caulfield, L., Sivasubramaniam, A., ... Vaid, K. (2017). Rain or Shine? - Making Sense of Cloudy Reliability Data. In K. Lee, & L. Liu (Eds.), Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017 (pp. 218-229). [7979969] (Proceedings - International Conference on Distributed Computing Systems). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDCS.2017.103
Narayanan, Iyswarya ; Sharma, Bikash ; Wang, Di ; Govindan, Sriram ; Caulfield, Laura ; Sivasubramaniam, Anand ; Kansal, Aman ; Liu, Jie ; Khessib, Badriddine ; Vaid, Kushagra. / Rain or Shine? - Making Sense of Cloudy Reliability Data. Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017. editor / Kisung Lee ; Ling Liu. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 218-229 (Proceedings - International Conference on Distributed Computing Systems).
@inproceedings{ad45bd5577d64742a0c0912550fafe21,
title = "Rain or Shine? - Making Sense of Cloudy Reliability Data",
abstract = "Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.",
author = "Iyswarya Narayanan and Bikash Sharma and Di Wang and Sriram Govindan and Laura Caulfield and Anand Sivasubramaniam and Aman Kansal and Jie Liu and Badriddine Khessib and Kushagra Vaid",
year = "2017",
month = "7",
day = "13",
doi = "10.1109/ICDCS.2017.103",
language = "English (US)",
series = "Proceedings - International Conference on Distributed Computing Systems",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "218--229",
editor = "Kisung Lee and Ling Liu",
booktitle = "Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017",
address = "United States",

}

Narayanan, I, Sharma, B, Wang, D, Govindan, S, Caulfield, L, Sivasubramaniam, A, Kansal, A, Liu, J, Khessib, B & Vaid, K 2017, Rain or Shine? - Making Sense of Cloudy Reliability Data. in K Lee & L Liu (eds), Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017., 7979969, Proceedings - International Conference on Distributed Computing Systems, Institute of Electrical and Electronics Engineers Inc., pp. 218-229, 37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017, Atlanta, United States, 6/5/17. https://doi.org/10.1109/ICDCS.2017.103

Rain or Shine? - Making Sense of Cloudy Reliability Data. / Narayanan, Iyswarya; Sharma, Bikash; Wang, Di; Govindan, Sriram; Caulfield, Laura; Sivasubramaniam, Anand; Kansal, Aman; Liu, Jie; Khessib, Badriddine; Vaid, Kushagra.

Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017. ed. / Kisung Lee; Ling Liu. Institute of Electrical and Electronics Engineers Inc., 2017. p. 218-229 7979969 (Proceedings - International Conference on Distributed Computing Systems).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Rain or Shine? - Making Sense of Cloudy Reliability Data

AU - Narayanan, Iyswarya

AU - Sharma, Bikash

AU - Wang, Di

AU - Govindan, Sriram

AU - Caulfield, Laura

AU - Sivasubramaniam, Anand

AU - Kansal, Aman

AU - Liu, Jie

AU - Khessib, Badriddine

AU - Vaid, Kushagra

PY - 2017/7/13

Y1 - 2017/7/13

N2 - Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.

AB - Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.

UR - http://www.scopus.com/inward/record.url?scp=85027264757&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85027264757&partnerID=8YFLogxK

U2 - 10.1109/ICDCS.2017.103

DO - 10.1109/ICDCS.2017.103

M3 - Conference contribution

T3 - Proceedings - International Conference on Distributed Computing Systems

SP - 218

EP - 229

BT - Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017

A2 - Lee, Kisung

A2 - Liu, Ling

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Narayanan I, Sharma B, Wang D, Govindan S, Caulfield L, Sivasubramaniam A et al. Rain or Shine? - Making Sense of Cloudy Reliability Data. In Lee K, Liu L, editors, Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 218-229. 7979969. (Proceedings - International Conference on Distributed Computing Systems). https://doi.org/10.1109/ICDCS.2017.103