COMP

Compiler Optimizations for Manycore Processors

Linhai Song, Min Feng, Nishkam Ravi, Yi Yang, Srimat Chakradhar

Research output: Contribution to journalConference article

4 Citations (Scopus)

Abstract

Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.

Original languageEnglish (US)
Article number7011425
Pages (from-to)659-671
Number of pages13
JournalProceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume2015-January
Issue numberJanuary
DOIs
StatePublished - Jan 15 2015
Event47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2014 - Cambridge, United Kingdom
Duration: Dec 13 2014Dec 17 2014

Fingerprint

Data storage equipment
Data transfer
Factorization
Data structures

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Song, Linhai ; Feng, Min ; Ravi, Nishkam ; Yang, Yi ; Chakradhar, Srimat. / COMP : Compiler Optimizations for Manycore Processors. In: Proceedings of the Annual International Symposium on Microarchitecture, MICRO. 2015 ; Vol. 2015-January, No. January. pp. 659-671.
@article{a1283c13fe934bfd9debf22f14d1c0f6,
title = "COMP: Compiler Optimizations for Manycore Processors",
abstract = "Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.",
author = "Linhai Song and Min Feng and Nishkam Ravi and Yi Yang and Srimat Chakradhar",
year = "2015",
month = "1",
day = "15",
doi = "10.1109/MICRO.2014.30",
language = "English (US)",
volume = "2015-January",
pages = "659--671",
journal = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",
issn = "1072-4451",
number = "January",

}

COMP : Compiler Optimizations for Manycore Processors. / Song, Linhai; Feng, Min; Ravi, Nishkam; Yang, Yi; Chakradhar, Srimat.

In: Proceedings of the Annual International Symposium on Microarchitecture, MICRO, Vol. 2015-January, No. January, 7011425, 15.01.2015, p. 659-671.

Research output: Contribution to journalConference article

TY - JOUR

T1 - COMP

T2 - Compiler Optimizations for Manycore Processors

AU - Song, Linhai

AU - Feng, Min

AU - Ravi, Nishkam

AU - Yang, Yi

AU - Chakradhar, Srimat

PY - 2015/1/15

Y1 - 2015/1/15

N2 - Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.

AB - Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.

UR - http://www.scopus.com/inward/record.url?scp=84937719621&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937719621&partnerID=8YFLogxK

U2 - 10.1109/MICRO.2014.30

DO - 10.1109/MICRO.2014.30

M3 - Conference article

VL - 2015-January

SP - 659

EP - 671

JO - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

JF - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

SN - 1072-4451

IS - January

M1 - 7011425

ER -