Compiler-directed data prefetching scheme for chip multiprocessors

Seung Woo Son, Mahmut Kandemir, Mustafa Karakoy, Dhruva Chakrabarti

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.

Original languageEnglish (US)
Pages (from-to)209-218
Number of pages10
JournalACM SIGPLAN Notices
Volume44
Issue number4
StatePublished - Nov 6 2009

Fingerprint

Data storage equipment
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

Son, S. W., Kandemir, M., Karakoy, M., & Chakrabarti, D. (2009). Compiler-directed data prefetching scheme for chip multiprocessors. ACM SIGPLAN Notices, 44(4), 209-218.
Son, Seung Woo ; Kandemir, Mahmut ; Karakoy, Mustafa ; Chakrabarti, Dhruva. / Compiler-directed data prefetching scheme for chip multiprocessors. In: ACM SIGPLAN Notices. 2009 ; Vol. 44, No. 4. pp. 209-218.
@article{51bcdb5d154e4db4accd198cd17aa693,
title = "Compiler-directed data prefetching scheme for chip multiprocessors",
abstract = "Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3{\%} over the no-prefetch case and 6.4{\%} over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4{\%} (over the no-prefetch case) and 11.7{\%} (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.",
author = "Son, {Seung Woo} and Mahmut Kandemir and Mustafa Karakoy and Dhruva Chakrabarti",
year = "2009",
month = "11",
day = "6",
language = "English (US)",
volume = "44",
pages = "209--218",
journal = "ACM SIGPLAN Notices",
issn = "1523-2867",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

Son, SW, Kandemir, M, Karakoy, M & Chakrabarti, D 2009, 'Compiler-directed data prefetching scheme for chip multiprocessors', ACM SIGPLAN Notices, vol. 44, no. 4, pp. 209-218.

Compiler-directed data prefetching scheme for chip multiprocessors. / Son, Seung Woo; Kandemir, Mahmut; Karakoy, Mustafa; Chakrabarti, Dhruva.

In: ACM SIGPLAN Notices, Vol. 44, No. 4, 06.11.2009, p. 209-218.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Compiler-directed data prefetching scheme for chip multiprocessors

AU - Son, Seung Woo

AU - Kandemir, Mahmut

AU - Karakoy, Mustafa

AU - Chakrabarti, Dhruva

PY - 2009/11/6

Y1 - 2009/11/6

N2 - Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.

AB - Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.

UR - http://www.scopus.com/inward/record.url?scp=70350594602&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350594602&partnerID=8YFLogxK

M3 - Article

VL - 44

SP - 209

EP - 218

JO - ACM SIGPLAN Notices

JF - ACM SIGPLAN Notices

SN - 1523-2867

IS - 4

ER -