Quantifying and optimizing the impact of victim cache line selection in manycore systems

Mahmut Kandemir, Wei Ding, Diana Guttman

Research output: Contribution to journalConference article

Abstract

In both architecture and software, the main goal of data locality-oriented optimizations has always been 'minimizing the number of cache misses' (especially, costly last-level cache misses). However, this paper shows that other metrics such as the distance between the last-level cache and memory controller as well as the memory queuing latency can play an equally important role, as far as application performance is concerned. Focusing on a large set of multithreaded applications, we first show that the last-level cache 'write backs' (memory writes due to displacement of a victim block from the last-level cache) can exhibit significant latencies as well as variances, and then make a case for 'relaxing' the strict LRU policy to save (write back) cycles in both the on-chip network and memory queues. Specifically, we explore novel architecture-level schemes that optimize on-chip network latency, memory queuing latency or both, of the write back messages, by carefully selecting the victim block to write back at the time of cache replacement. Our extensive experimental evaluations using 15 multithreaded applications and a cycle-accurate simulation infrastructure clearly demonstrate that this tradeoffs (between cache hit rate and on-chip network/memory queuing latency) pays off in most of the cases, leading to about 12.2% execution time improvement and 14.9% energy savings, in our default 64-core system with 6 memory controllers.

Original languageEnglish (US)
Article number7033676
Pages (from-to)385-394
Number of pages10
JournalProceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS
Volume2015-February
Issue numberFebruary
DOIs
StatePublished - Feb 5 2015
Event2014 22nd Annual IEEE International Symposium on Modeling, Analysis and Simulation of Computer, and Telecommunication Systems, MASCOTS 2014 - Paris, France
Duration: Sep 9 2014Sep 11 2014

Fingerprint

Many-core
Cache
Data storage equipment
Latency
Line
Queuing
Controller
Cycle
Data Locality
Controllers
Energy Saving
Hits
Experimental Evaluation
Large Set
Execution Time
Replacement
Queue
Energy conservation
Infrastructure
Trade-offs

All Science Journal Classification (ASJC) codes

  • Electrical and Electronic Engineering
  • Computer Networks and Communications
  • Software
  • Modeling and Simulation

Cite this

@article{0f977d6b0ff04d28a1800a6dc96e5f94,
title = "Quantifying and optimizing the impact of victim cache line selection in manycore systems",
abstract = "In both architecture and software, the main goal of data locality-oriented optimizations has always been 'minimizing the number of cache misses' (especially, costly last-level cache misses). However, this paper shows that other metrics such as the distance between the last-level cache and memory controller as well as the memory queuing latency can play an equally important role, as far as application performance is concerned. Focusing on a large set of multithreaded applications, we first show that the last-level cache 'write backs' (memory writes due to displacement of a victim block from the last-level cache) can exhibit significant latencies as well as variances, and then make a case for 'relaxing' the strict LRU policy to save (write back) cycles in both the on-chip network and memory queues. Specifically, we explore novel architecture-level schemes that optimize on-chip network latency, memory queuing latency or both, of the write back messages, by carefully selecting the victim block to write back at the time of cache replacement. Our extensive experimental evaluations using 15 multithreaded applications and a cycle-accurate simulation infrastructure clearly demonstrate that this tradeoffs (between cache hit rate and on-chip network/memory queuing latency) pays off in most of the cases, leading to about 12.2{\%} execution time improvement and 14.9{\%} energy savings, in our default 64-core system with 6 memory controllers.",
author = "Mahmut Kandemir and Wei Ding and Diana Guttman",
year = "2015",
month = "2",
day = "5",
doi = "10.1109/MASCOTS.2014.54",
language = "English (US)",
volume = "2015-February",
pages = "385--394",
journal = "Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS",
issn = "1526-7539",
number = "February",

}

TY - JOUR

T1 - Quantifying and optimizing the impact of victim cache line selection in manycore systems

AU - Kandemir, Mahmut

AU - Ding, Wei

AU - Guttman, Diana

PY - 2015/2/5

Y1 - 2015/2/5

N2 - In both architecture and software, the main goal of data locality-oriented optimizations has always been 'minimizing the number of cache misses' (especially, costly last-level cache misses). However, this paper shows that other metrics such as the distance between the last-level cache and memory controller as well as the memory queuing latency can play an equally important role, as far as application performance is concerned. Focusing on a large set of multithreaded applications, we first show that the last-level cache 'write backs' (memory writes due to displacement of a victim block from the last-level cache) can exhibit significant latencies as well as variances, and then make a case for 'relaxing' the strict LRU policy to save (write back) cycles in both the on-chip network and memory queues. Specifically, we explore novel architecture-level schemes that optimize on-chip network latency, memory queuing latency or both, of the write back messages, by carefully selecting the victim block to write back at the time of cache replacement. Our extensive experimental evaluations using 15 multithreaded applications and a cycle-accurate simulation infrastructure clearly demonstrate that this tradeoffs (between cache hit rate and on-chip network/memory queuing latency) pays off in most of the cases, leading to about 12.2% execution time improvement and 14.9% energy savings, in our default 64-core system with 6 memory controllers.

AB - In both architecture and software, the main goal of data locality-oriented optimizations has always been 'minimizing the number of cache misses' (especially, costly last-level cache misses). However, this paper shows that other metrics such as the distance between the last-level cache and memory controller as well as the memory queuing latency can play an equally important role, as far as application performance is concerned. Focusing on a large set of multithreaded applications, we first show that the last-level cache 'write backs' (memory writes due to displacement of a victim block from the last-level cache) can exhibit significant latencies as well as variances, and then make a case for 'relaxing' the strict LRU policy to save (write back) cycles in both the on-chip network and memory queues. Specifically, we explore novel architecture-level schemes that optimize on-chip network latency, memory queuing latency or both, of the write back messages, by carefully selecting the victim block to write back at the time of cache replacement. Our extensive experimental evaluations using 15 multithreaded applications and a cycle-accurate simulation infrastructure clearly demonstrate that this tradeoffs (between cache hit rate and on-chip network/memory queuing latency) pays off in most of the cases, leading to about 12.2% execution time improvement and 14.9% energy savings, in our default 64-core system with 6 memory controllers.

UR - http://www.scopus.com/inward/record.url?scp=84937955703&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937955703&partnerID=8YFLogxK

U2 - 10.1109/MASCOTS.2014.54

DO - 10.1109/MASCOTS.2014.54

M3 - Conference article

VL - 2015-February

SP - 385

EP - 394

JO - Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS

JF - Proceedings - IEEE Computer Society's Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, MASCOTS

SN - 1526-7539

IS - February

M1 - 7033676

ER -