netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks

Srikar Tati, Scott Rager, Bong Jun Ko, Guohong Cao, Ananthram Swami, Thomas F. La Porta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

In this paper we present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes in the network. netCSI consists of two parts: hypotheses generation algorithm, and ranking algorithm. When constructing the hypotheses list of potential causes, we make novel use of the positive and negative symptoms to improve the precision of the results. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and achieve an average gain of 128% in accuracy for realistic topologies.

Original languageEnglish (US)
Title of host publicationProceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011
Pages167-176
Number of pages10
DOIs
StatePublished - Dec 14 2011
Event2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011 - Madrid, Spain
Duration: Oct 4 2011Oct 7 2011

Other

Other2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011
CountrySpain
CityMadrid
Period10/4/1110/7/11

Fingerprint

Computer Networks
Computer networks
Fault Diagnosis
Failure analysis
Ranking
Fault
Topology
Failure Probability
Probability Model
Incomplete Information
Conditional probability
Outages
Evaluate
Vertex of a graph

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Tati, S., Rager, S., Ko, B. J., Cao, G., Swami, A., & La Porta, T. F. (2011). netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks. In Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011 (pp. 167-176). [6076774] https://doi.org/10.1109/SRDS.2011.28
Tati, Srikar ; Rager, Scott ; Ko, Bong Jun ; Cao, Guohong ; Swami, Ananthram ; La Porta, Thomas F. / netCSI : A generic fault diagnosis algorithm for large-scale failures in computer networks. Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011. 2011. pp. 167-176
@inproceedings{e096e23e3a6f4bb18b38905626759005,
title = "netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks",
abstract = "In this paper we present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes in the network. netCSI consists of two parts: hypotheses generation algorithm, and ranking algorithm. When constructing the hypotheses list of potential causes, we make novel use of the positive and negative symptoms to improve the precision of the results. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and achieve an average gain of 128{\%} in accuracy for realistic topologies.",
author = "Srikar Tati and Scott Rager and Ko, {Bong Jun} and Guohong Cao and Ananthram Swami and {La Porta}, {Thomas F.}",
year = "2011",
month = "12",
day = "14",
doi = "10.1109/SRDS.2011.28",
language = "English (US)",
isbn = "9780769544502",
pages = "167--176",
booktitle = "Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011",

}

Tati, S, Rager, S, Ko, BJ, Cao, G, Swami, A & La Porta, TF 2011, netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks. in Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011., 6076774, pp. 167-176, 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011, Madrid, Spain, 10/4/11. https://doi.org/10.1109/SRDS.2011.28

netCSI : A generic fault diagnosis algorithm for large-scale failures in computer networks. / Tati, Srikar; Rager, Scott; Ko, Bong Jun; Cao, Guohong; Swami, Ananthram; La Porta, Thomas F.

Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011. 2011. p. 167-176 6076774.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - netCSI

T2 - A generic fault diagnosis algorithm for large-scale failures in computer networks

AU - Tati, Srikar

AU - Rager, Scott

AU - Ko, Bong Jun

AU - Cao, Guohong

AU - Swami, Ananthram

AU - La Porta, Thomas F.

PY - 2011/12/14

Y1 - 2011/12/14

N2 - In this paper we present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes in the network. netCSI consists of two parts: hypotheses generation algorithm, and ranking algorithm. When constructing the hypotheses list of potential causes, we make novel use of the positive and negative symptoms to improve the precision of the results. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and achieve an average gain of 128% in accuracy for realistic topologies.

AB - In this paper we present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes in the network. netCSI consists of two parts: hypotheses generation algorithm, and ranking algorithm. When constructing the hypotheses list of potential causes, we make novel use of the positive and negative symptoms to improve the precision of the results. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and achieve an average gain of 128% in accuracy for realistic topologies.

UR - http://www.scopus.com/inward/record.url?scp=83155184620&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=83155184620&partnerID=8YFLogxK

U2 - 10.1109/SRDS.2011.28

DO - 10.1109/SRDS.2011.28

M3 - Conference contribution

AN - SCOPUS:83155184620

SN - 9780769544502

SP - 167

EP - 176

BT - Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011

ER -

Tati S, Rager S, Ko BJ, Cao G, Swami A, La Porta TF. netCSI: A generic fault diagnosis algorithm for large-scale failures in computer networks. In Proceedings - 2011 30th IEEE International Symposium on Reliable Distributed Systems, SRDS 2011. 2011. p. 167-176. 6076774 https://doi.org/10.1109/SRDS.2011.28