netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks

Srikar Tati, Scott Rager, Bong Jun Ko, Guohong Cao, Ananthram Swami, Thomas La Porta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

In this paper we present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes in the network. netCSI consists of two parts: hypotheses generation algorithm, and ranking algorithm. When constructing the hypotheses list of potential causes, we make novel use of the positive and negative symptoms to improve the precision of the results. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and achieve an average gain of 128% in accuracy for realistic topologies.

Original languageEnglish (US)
Title of host publicationProceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011
Pages167-176
Number of pages10
DOIs
StatePublished - 2011
Event2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011 - Madrid, Spain
Duration: Oct 4 2011Oct 7 2011

Publication series

NameProceedings of the IEEE Symposium on Reliable Distributed Systems
ISSN (Print)1060-9857

Other

Other2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011
CountrySpain
CityMadrid
Period10/4/1110/7/11

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks'. Together they form a unique fingerprint.

Cite this