Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Chita R. Das

Research output: Contribution to journalConference article

53 Citations (Scopus)

Abstract

Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new runtime techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25% and 42%, respectively, on average) and energy efficiency (by 28% and 27%).

Original languageEnglish (US)
Pages (from-to)31-44
Number of pages14
JournalParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
DOIs
StatePublished - Jan 1 2016
Event25th International Conference on Parallel Architectures and Compilation Techniques, PACT 2016 - Haifa, Israel
Duration: Sep 11 2016Sep 15 2016

Fingerprint

Graphics Processing Unit
Scheduling
Data storage equipment
Processing
kernel
Prediction Model
Energy Efficiency
Affine transformation
Energy efficiency
Schedule
Architecture
Graphics processing unit
Dynamic random access storage
Data Transfer
Data transfer
Experimental Evaluation
Execution Time
Penalty
Concurrent
Regression

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Cite this

@article{c8212169de74460097ebb7a640724425,
title = "Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities",
abstract = "Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new runtime techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25{\%} and 42{\%}, respectively, on average) and energy efficiency (by 28{\%} and 27{\%}).",
author = "Ashutosh Pattnaik and Xulong Tang and Adwait Jog and Onur Kayiran and Mishra, {Asit K.} and Kandemir, {Mahmut T.} and Onur Mutlu and Das, {Chita R.}",
year = "2016",
month = "1",
day = "1",
doi = "10.1145/2967938.2967940",
language = "English (US)",
pages = "31--44",
journal = "Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT",
issn = "1089-795X",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. / Pattnaik, Ashutosh; Tang, Xulong; Jog, Adwait; Kayiran, Onur; Mishra, Asit K.; Kandemir, Mahmut T.; Mutlu, Onur; Das, Chita R.

In: Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, 01.01.2016, p. 31-44.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

AU - Pattnaik, Ashutosh

AU - Tang, Xulong

AU - Jog, Adwait

AU - Kayiran, Onur

AU - Mishra, Asit K.

AU - Kandemir, Mahmut T.

AU - Mutlu, Onur

AU - Das, Chita R.

PY - 2016/1/1

Y1 - 2016/1/1

N2 - Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new runtime techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25% and 42%, respectively, on average) and energy efficiency (by 28% and 27%).

AB - Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and applications, where main memory bandwidth is a critical bottleneck, can benefit from the use of PIM. To this end, an application should be properly partitioned and scheduled to execute on either the main, powerful GPU cores that are far away from memory or the auxiliary, simple GPU cores that are close to memory (e.g., in the logic layer of 3D-stacked DRAM). This paper investigates two key code scheduling issues in such a GPU architecture that has PIM capabilities, to maximize performance and energy-efficiency: (1) how to automatically identify the code segments, or kernels, to be offloaded to the cores in memory, and (2) how to concurrently schedule multiple kernels on the main GPU cores and the auxiliary GPU cores in memory. We develop two new runtime techniques: (1) a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and (2) a concurrent kernel management mechanism that uses the affinity prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU cores in memory. Our experimental evaluations across 25 GPU applications demonstrate that these two techniques can significantly improve both application performance (by 25% and 42%, respectively, on average) and energy efficiency (by 28% and 27%).

UR - http://www.scopus.com/inward/record.url?scp=84989315064&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84989315064&partnerID=8YFLogxK

U2 - 10.1145/2967938.2967940

DO - 10.1145/2967938.2967940

M3 - Conference article

AN - SCOPUS:84989315064

SP - 31

EP - 44

JO - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

JF - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

SN - 1089-795X

ER -