Metastable failures in distributed systems

Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, Timothy Zhu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We describe metastable failures-A failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown meta-stable failures remains an open problem.

Original languageEnglish (US)
Title of host publicationHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems
PublisherAssociation for Computing Machinery, Inc
Pages221-227
Number of pages7
ISBN (Electronic)9781450384384
DOIs
StatePublished - Jun 1 2021
Event18th Workshop on Hot Topics in Operating Systems, HotOS 2021 - Virtual, Online, United States
Duration: Jun 1 2021Jun 3 2021

Publication series

NameHotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems

Conference

Conference18th Workshop on Hot Topics in Operating Systems, HotOS 2021
Country/TerritoryUnited States
CityVirtual, Online
Period6/1/216/3/21

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Networks and Communications
  • Hardware and Architecture

Cite this