Analyzing the soft error resilience of linear solvers on multicore multiprocessors

Konrad Malkowski, Padma Raghavan, Mahmut Kandemir

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16%, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 10 9, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5%, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5% and 14% respectively.

Original languageEnglish (US)
Title of host publicationProceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
DOIs
StatePublished - Jul 1 2010
Event24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010 - Atlanta, GA, United States
Duration: Apr 19 2010Apr 23 2010

Other

Other24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010
CountryUnited States
CityAtlanta, GA
Period4/19/104/23/10

Fingerprint

Error Resilience
Soft Error
Multiprocessor
Cache
Conjugate Gradient
Energy utilization
Vulnerability
Energy Consumption
Energy efficiency
Transistors
Chip
Iterative Solvers
Multi-core Processor
Resilience
Flip
Energy
Energy Efficiency
Continue
Trade-offs

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software
  • Theoretical Computer Science

Cite this

Malkowski, K., Raghavan, P., & Kandemir, M. (2010). Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 [5470411] https://doi.org/10.1109/IPDPS.2010.5470411
Malkowski, Konrad ; Raghavan, Padma ; Kandemir, Mahmut. / Analyzing the soft error resilience of linear solvers on multicore multiprocessors. Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010. 2010.
@inproceedings{71ef7c57bdf844278de9ac818262466c,
title = "Analyzing the soft error resilience of linear solvers on multicore multiprocessors",
abstract = "As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16{\%}, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 10 9, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5{\%}, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5{\%} and 14{\%} respectively.",
author = "Konrad Malkowski and Padma Raghavan and Mahmut Kandemir",
year = "2010",
month = "7",
day = "1",
doi = "10.1109/IPDPS.2010.5470411",
language = "English (US)",
isbn = "9781424464432",
booktitle = "Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010",

}

Malkowski, K, Raghavan, P & Kandemir, M 2010, Analyzing the soft error resilience of linear solvers on multicore multiprocessors. in Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010., 5470411, 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010, Atlanta, GA, United States, 4/19/10. https://doi.org/10.1109/IPDPS.2010.5470411

Analyzing the soft error resilience of linear solvers on multicore multiprocessors. / Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut.

Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010. 2010. 5470411.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Analyzing the soft error resilience of linear solvers on multicore multiprocessors

AU - Malkowski, Konrad

AU - Raghavan, Padma

AU - Kandemir, Mahmut

PY - 2010/7/1

Y1 - 2010/7/1

N2 - As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16%, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 10 9, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5%, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5% and 14% respectively.

AB - As chip transistor densities continue to increase, soft errors (bit flips) are becoming a significant concern in networked multiprocessors with multicore nodes. Large cache structures in multicore processors are especially susceptible to soft errors as they occupy a significant portion of the chip area. In this paper, we consider the impacts of soft errors in caches on the resilience and energy efficiency of sparse linear solvers. In particular, we focus on two widely used sparse iterative solvers, namely Conjugate Gradient (CG) and Generalized Minimum Residuals (GMRES). We propose two adaptive schemes, (i) a Write Eviction Hybrid ECC (WEH-ECC) scheme for the L1 cache and (ii) a Prefetcher Based Adaptive ECC (PBA-ECC) scheme for the L2 cache, and evaluate the energy and reliability trade-offs they bring in the context of GMRES and CG solvers. Our evaluations indicate that WEH-ECC reduces the CG and GMRES soft error vulnerability by a factor of 18 to 220 in L1 cache, relative to an unprotected L1 cache, and energy consumption by 16%, relative to a cache with strong protection. The PBA-ECC scheme reduces the CG and GMRES soft error vulnerability by a factor of 9 × 103 to 8.6 × 10 9, relative to an unprotected L2 cache, and reduces the energy consumption by 8.5%, relative to a cache with strong ECC protection. Our energy overheads over unprotected L1 and L2 caches are 5% and 14% respectively.

UR - http://www.scopus.com/inward/record.url?scp=77953987800&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77953987800&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2010.5470411

DO - 10.1109/IPDPS.2010.5470411

M3 - Conference contribution

SN - 9781424464432

BT - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

ER -

Malkowski K, Raghavan P, Kandemir M. Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010. 2010. 5470411 https://doi.org/10.1109/IPDPS.2010.5470411