Data layout optimization for GPGPU architectures

Jun Liu, Wei Ding, Ohyoung Jang, Mahmut Kandemir

Research output: Contribution to journalArticle

Abstract

GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing highperformance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.

Original languageEnglish (US)
Pages (from-to)283-284
Number of pages2
JournalACM SIGPLAN Notices
Volume48
Issue number8
DOIs
StatePublished - Aug 1 2013

Fingerprint

Data storage equipment
Graphics processing unit

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

Liu, Jun ; Ding, Wei ; Jang, Ohyoung ; Kandemir, Mahmut. / Data layout optimization for GPGPU architectures. In: ACM SIGPLAN Notices. 2013 ; Vol. 48, No. 8. pp. 283-284.
@article{512c57f549354e7e9e5e00a1cfde8067,
title = "Data layout optimization for GPGPU architectures",
abstract = "GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing highperformance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.",
author = "Jun Liu and Wei Ding and Ohyoung Jang and Mahmut Kandemir",
year = "2013",
month = "8",
day = "1",
doi = "10.1145/2517327.2442546",
language = "English (US)",
volume = "48",
pages = "283--284",
journal = "ACM SIGPLAN Notices",
issn = "1523-2867",
publisher = "Association for Computing Machinery (ACM)",
number = "8",

}

Data layout optimization for GPGPU architectures. / Liu, Jun; Ding, Wei; Jang, Ohyoung; Kandemir, Mahmut.

In: ACM SIGPLAN Notices, Vol. 48, No. 8, 01.08.2013, p. 283-284.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Data layout optimization for GPGPU architectures

AU - Liu, Jun

AU - Ding, Wei

AU - Jang, Ohyoung

AU - Kandemir, Mahmut

PY - 2013/8/1

Y1 - 2013/8/1

N2 - GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing highperformance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.

AB - GPUs are being widely used in accelerating general-purpose applications, leading to the emergence of GPGPU architectures. New programming models, e.g., Compute Unified Device Architecture (CUDA), have been proposed to facilitate programming general-purpose computations in GPGPUs. However, writing highperformance CUDA codes manually is still tedious and difficult. In particular, the organization of the data in the memory space can greatly affect the performance due to the unique features of a custom GPGPU memory hierarchy. In this work, we propose an automatic data layout transformation framework to solve the key issues associated with a GPGPU memory hierarchy (i.e., channel skewing, data coalescing, and bank conflicts). Our approach employs a widely applicable strategy based on a novel concept called data localization. Specifically, we try to optimize the layout of the arrays accessed in affine loop nests, for both the device memory and shared memory, at both coarse grain and fine grain parallelization levels. We performed an experimental evaluation of our data layout optimization strategy using 15 benchmarks on an NVIDIA CUDA GPU device. The results show that the proposed data transformation approach brings around 4.3X speedup on average.

UR - http://www.scopus.com/inward/record.url?scp=84885202351&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84885202351&partnerID=8YFLogxK

U2 - 10.1145/2517327.2442546

DO - 10.1145/2517327.2442546

M3 - Article

AN - SCOPUS:84885202351

VL - 48

SP - 283

EP - 284

JO - ACM SIGPLAN Notices

JF - ACM SIGPLAN Notices

SN - 1523-2867

IS - 8

ER -