Exploiting Staleness for Approximating Loads on CMPs

Prasanna Venkatesh Rengasamy, Anand Sivasubramaniam, Mahmut Kandemir, Chitaranjan Das

Research output: Contribution to journalConference article

3 Citations (Scopus)

Abstract

Coherence misses are an important factor in limitingthe scalability of multi-threaded shared memory applicationson chip multiprocessors (CMPs) that are envisaged to containdozens of cores in the imminent future. This paper proposesa novel approach to tackling this problem by leveraging thegrowingly important paradigm of approximate computing. Manyapplications are either tolerant to slight errors in the output or ifstringent, have in-built resiliency to tolerate some errors in the ex-ecution. The approximate computing paradigm suggests breakingconventional barriers of mandating stringent correctness on thehardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applicationsin the SPLASH-2 benchmark suite, we note that nearly all theseapplications have such inherent resiliency and/or tolerance toslight errors in the output. Based on this observation, we proposeto approximate coherence-related load misses by returning stalevalues, i.e., the version at the time of the invalidation. We showthat returning such values from the invalidated lines alreadypresent in d-L1 offers only limited scope for improvement sincethose lines get evicted fairly soon due to the high pressure ond-L1. Instead, we propose a very small (8 lines) Stale VictimCache (SVC), to hold such lines upon d-L1 eviction. While thisdoes offer significant improvement, there is the possibility ofdata getting very stale in such a structure, making it highlysensitive to the choice of what data to keep, and for how long. Toaddress these concerns, we propose to time-out these lines fromthe SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup insome SPLASH-2 applications, with an average speedup between10-15% across the entire suite, becoming comparable to an idealexecution that does not incur coherence misses. Further, theconsequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, becauseof inherent application resilience, in eleven applications, and themaximum error was at most 0.08% across the entire suite.

Original languageEnglish (US)
Article number7429318
Pages (from-to)343-354
Number of pages12
JournalParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
DOIs
StatePublished - Jan 1 2015
Event24th International Conference on Parallel Architecture and Compilation, PACT 2015 - San Francisco, United States
Duration: Oct 18 2015Oct 21 2015

Fingerprint

Chip multiprocessors
Line
Resiliency
Correctness
Speedup
Paradigm
Entire
Shared-memory multiprocessors
Computing
Output
Resilience
Tolerance
Scalability
Flexibility
Benchmark
Data storage equipment
Approximation

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Cite this

@article{94dc2333438e4f5197335b20cd33d2ab,
title = "Exploiting Staleness for Approximating Loads on CMPs",
abstract = "Coherence misses are an important factor in limitingthe scalability of multi-threaded shared memory applicationson chip multiprocessors (CMPs) that are envisaged to containdozens of cores in the imminent future. This paper proposesa novel approach to tackling this problem by leveraging thegrowingly important paradigm of approximate computing. Manyapplications are either tolerant to slight errors in the output or ifstringent, have in-built resiliency to tolerate some errors in the ex-ecution. The approximate computing paradigm suggests breakingconventional barriers of mandating stringent correctness on thehardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applicationsin the SPLASH-2 benchmark suite, we note that nearly all theseapplications have such inherent resiliency and/or tolerance toslight errors in the output. Based on this observation, we proposeto approximate coherence-related load misses by returning stalevalues, i.e., the version at the time of the invalidation. We showthat returning such values from the invalidated lines alreadypresent in d-L1 offers only limited scope for improvement sincethose lines get evicted fairly soon due to the high pressure ond-L1. Instead, we propose a very small (8 lines) Stale VictimCache (SVC), to hold such lines upon d-L1 eviction. While thisdoes offer significant improvement, there is the possibility ofdata getting very stale in such a structure, making it highlysensitive to the choice of what data to keep, and for how long. Toaddress these concerns, we propose to time-out these lines fromthe SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6{\%} speedup insome SPLASH-2 applications, with an average speedup between10-15{\%} across the entire suite, becoming comparable to an idealexecution that does not incur coherence misses. Further, theconsequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, becauseof inherent application resilience, in eleven applications, and themaximum error was at most 0.08{\%} across the entire suite.",
author = "Rengasamy, {Prasanna Venkatesh} and Anand Sivasubramaniam and Mahmut Kandemir and Chitaranjan Das",
year = "2015",
month = "1",
day = "1",
doi = "10.1109/PACT.2015.27",
language = "English (US)",
pages = "343--354",
journal = "Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT",
issn = "1089-795X",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Exploiting Staleness for Approximating Loads on CMPs

AU - Rengasamy, Prasanna Venkatesh

AU - Sivasubramaniam, Anand

AU - Kandemir, Mahmut

AU - Das, Chitaranjan

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Coherence misses are an important factor in limitingthe scalability of multi-threaded shared memory applicationson chip multiprocessors (CMPs) that are envisaged to containdozens of cores in the imminent future. This paper proposesa novel approach to tackling this problem by leveraging thegrowingly important paradigm of approximate computing. Manyapplications are either tolerant to slight errors in the output or ifstringent, have in-built resiliency to tolerate some errors in the ex-ecution. The approximate computing paradigm suggests breakingconventional barriers of mandating stringent correctness on thehardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applicationsin the SPLASH-2 benchmark suite, we note that nearly all theseapplications have such inherent resiliency and/or tolerance toslight errors in the output. Based on this observation, we proposeto approximate coherence-related load misses by returning stalevalues, i.e., the version at the time of the invalidation. We showthat returning such values from the invalidated lines alreadypresent in d-L1 offers only limited scope for improvement sincethose lines get evicted fairly soon due to the high pressure ond-L1. Instead, we propose a very small (8 lines) Stale VictimCache (SVC), to hold such lines upon d-L1 eviction. While thisdoes offer significant improvement, there is the possibility ofdata getting very stale in such a structure, making it highlysensitive to the choice of what data to keep, and for how long. Toaddress these concerns, we propose to time-out these lines fromthe SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup insome SPLASH-2 applications, with an average speedup between10-15% across the entire suite, becoming comparable to an idealexecution that does not incur coherence misses. Further, theconsequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, becauseof inherent application resilience, in eleven applications, and themaximum error was at most 0.08% across the entire suite.

AB - Coherence misses are an important factor in limitingthe scalability of multi-threaded shared memory applicationson chip multiprocessors (CMPs) that are envisaged to containdozens of cores in the imminent future. This paper proposesa novel approach to tackling this problem by leveraging thegrowingly important paradigm of approximate computing. Manyapplications are either tolerant to slight errors in the output or ifstringent, have in-built resiliency to tolerate some errors in the ex-ecution. The approximate computing paradigm suggests breakingconventional barriers of mandating stringent correctness on thehardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applicationsin the SPLASH-2 benchmark suite, we note that nearly all theseapplications have such inherent resiliency and/or tolerance toslight errors in the output. Based on this observation, we proposeto approximate coherence-related load misses by returning stalevalues, i.e., the version at the time of the invalidation. We showthat returning such values from the invalidated lines alreadypresent in d-L1 offers only limited scope for improvement sincethose lines get evicted fairly soon due to the high pressure ond-L1. Instead, we propose a very small (8 lines) Stale VictimCache (SVC), to hold such lines upon d-L1 eviction. While thisdoes offer significant improvement, there is the possibility ofdata getting very stale in such a structure, making it highlysensitive to the choice of what data to keep, and for how long. Toaddress these concerns, we propose to time-out these lines fromthe SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup insome SPLASH-2 applications, with an average speedup between10-15% across the entire suite, becoming comparable to an idealexecution that does not incur coherence misses. Further, theconsequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, becauseof inherent application resilience, in eleven applications, and themaximum error was at most 0.08% across the entire suite.

UR - http://www.scopus.com/inward/record.url?scp=84975498879&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84975498879&partnerID=8YFLogxK

U2 - 10.1109/PACT.2015.27

DO - 10.1109/PACT.2015.27

M3 - Conference article

SP - 343

EP - 354

JO - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

JF - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

SN - 1089-795X

M1 - 7429318

ER -