Exploiting core criticality for enhanced GPU performance

Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut Kandemir, Onur Mutlu, Ravishankar Iyer, Chitaranjan Das

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Citations (Scopus)

Abstract

Modern memory access schedulers employed in GPUS typi- cally optimize for memory throughput. They implicitly as- sume that all requests from different cores are equally im- portant. However, we show that during the execution of a subset of CUDA applications, different cores can have dif- ferent amounts of tolerance to latency. In particular, cores with a larger fraction of warps waiting for data to come back from DRAM are less likely to tolerate the latency of an outstanding memory request. Requests from such cores are more critical than requests from others. Based on this observation, this paper introduces a new memory sched- uler, called (C)ritica(L)ity (A)ware (M)emory (S)cheduler (CLAMS), which takes into account the latency-tolerance of the cores that generate memory requests. The key idea is to use the fraction of critical requests in the memory request buffer to switch between scheduling policies optimized for criticality and locality. If this fraction is below a threshold, CLAMS prioritizes critical requests to ensure cores that can- not tolerate latency are serviced faster. Otherwise, CLAMS optimizes for locality, anticipating that there are too many critical requests and prioritizing one over another would not significantly benefit performance. We first present a core-criticality estimation mechanism for determining critical cores and requests, and then dis- cuss issues related to finding a balance between criticality and locality in the memory scheduler. We progressively de- vise three variants of CLAMS, and show that the Dynamic CLAMS provides significantly higher performance, across a variety of workloads, than the commonly-employed GPU memory schedulers optimized solely for locality. The results indicate that a GPU memory system that considers both core criticality and DRAM access locality can provide sig- nificant improvement in performance.

Original languageEnglish (US)
Title of host publicationSIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science
PublisherAssociation for Computing Machinery, Inc
Pages351-363
Number of pages13
ISBN (Electronic)9781450342667
DOIs
StatePublished - Jun 14 2016
Event13th Joint International Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS / IFIP Performance 2016 - Antibes Juan-les-Pins, France
Duration: Jun 14 2016Jun 18 2016

Publication series

NameSIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science

Other

Other13th Joint International Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS / IFIP Performance 2016
CountryFrance
CityAntibes Juan-les-Pins
Period6/14/166/18/16

Fingerprint

Data storage equipment
Dynamic random access storage
Graphics processing unit
Scheduling
Switches
Throughput

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computational Theory and Mathematics
  • Hardware and Architecture

Cite this

Jog, A., Kayiran, O., Pattnaik, A., Kandemir, M., Mutlu, O., Iyer, R., & Das, C. (2016). Exploiting core criticality for enhanced GPU performance. In SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science (pp. 351-363). (SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science). Association for Computing Machinery, Inc. https://doi.org/10.1145/2896377.2901468
Jog, Adwait ; Kayiran, Onur ; Pattnaik, Ashutosh ; Kandemir, Mahmut ; Mutlu, Onur ; Iyer, Ravishankar ; Das, Chitaranjan. / Exploiting core criticality for enhanced GPU performance. SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science. Association for Computing Machinery, Inc, 2016. pp. 351-363 (SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science).
@inproceedings{f7f4e14d582f40b689c27998d73ed97c,
title = "Exploiting core criticality for enhanced GPU performance",
abstract = "Modern memory access schedulers employed in GPUS typi- cally optimize for memory throughput. They implicitly as- sume that all requests from different cores are equally im- portant. However, we show that during the execution of a subset of CUDA applications, different cores can have dif- ferent amounts of tolerance to latency. In particular, cores with a larger fraction of warps waiting for data to come back from DRAM are less likely to tolerate the latency of an outstanding memory request. Requests from such cores are more critical than requests from others. Based on this observation, this paper introduces a new memory sched- uler, called (C)ritica(L)ity (A)ware (M)emory (S)cheduler (CLAMS), which takes into account the latency-tolerance of the cores that generate memory requests. The key idea is to use the fraction of critical requests in the memory request buffer to switch between scheduling policies optimized for criticality and locality. If this fraction is below a threshold, CLAMS prioritizes critical requests to ensure cores that can- not tolerate latency are serviced faster. Otherwise, CLAMS optimizes for locality, anticipating that there are too many critical requests and prioritizing one over another would not significantly benefit performance. We first present a core-criticality estimation mechanism for determining critical cores and requests, and then dis- cuss issues related to finding a balance between criticality and locality in the memory scheduler. We progressively de- vise three variants of CLAMS, and show that the Dynamic CLAMS provides significantly higher performance, across a variety of workloads, than the commonly-employed GPU memory schedulers optimized solely for locality. The results indicate that a GPU memory system that considers both core criticality and DRAM access locality can provide sig- nificant improvement in performance.",
author = "Adwait Jog and Onur Kayiran and Ashutosh Pattnaik and Mahmut Kandemir and Onur Mutlu and Ravishankar Iyer and Chitaranjan Das",
year = "2016",
month = "6",
day = "14",
doi = "10.1145/2896377.2901468",
language = "English (US)",
series = "SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science",
publisher = "Association for Computing Machinery, Inc",
pages = "351--363",
booktitle = "SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science",

}

Jog, A, Kayiran, O, Pattnaik, A, Kandemir, M, Mutlu, O, Iyer, R & Das, C 2016, Exploiting core criticality for enhanced GPU performance. in SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science. SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science, Association for Computing Machinery, Inc, pp. 351-363, 13th Joint International Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS / IFIP Performance 2016, Antibes Juan-les-Pins, France, 6/14/16. https://doi.org/10.1145/2896377.2901468

Exploiting core criticality for enhanced GPU performance. / Jog, Adwait; Kayiran, Onur; Pattnaik, Ashutosh; Kandemir, Mahmut; Mutlu, Onur; Iyer, Ravishankar; Das, Chitaranjan.

SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science. Association for Computing Machinery, Inc, 2016. p. 351-363 (SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Exploiting core criticality for enhanced GPU performance

AU - Jog, Adwait

AU - Kayiran, Onur

AU - Pattnaik, Ashutosh

AU - Kandemir, Mahmut

AU - Mutlu, Onur

AU - Iyer, Ravishankar

AU - Das, Chitaranjan

PY - 2016/6/14

Y1 - 2016/6/14

N2 - Modern memory access schedulers employed in GPUS typi- cally optimize for memory throughput. They implicitly as- sume that all requests from different cores are equally im- portant. However, we show that during the execution of a subset of CUDA applications, different cores can have dif- ferent amounts of tolerance to latency. In particular, cores with a larger fraction of warps waiting for data to come back from DRAM are less likely to tolerate the latency of an outstanding memory request. Requests from such cores are more critical than requests from others. Based on this observation, this paper introduces a new memory sched- uler, called (C)ritica(L)ity (A)ware (M)emory (S)cheduler (CLAMS), which takes into account the latency-tolerance of the cores that generate memory requests. The key idea is to use the fraction of critical requests in the memory request buffer to switch between scheduling policies optimized for criticality and locality. If this fraction is below a threshold, CLAMS prioritizes critical requests to ensure cores that can- not tolerate latency are serviced faster. Otherwise, CLAMS optimizes for locality, anticipating that there are too many critical requests and prioritizing one over another would not significantly benefit performance. We first present a core-criticality estimation mechanism for determining critical cores and requests, and then dis- cuss issues related to finding a balance between criticality and locality in the memory scheduler. We progressively de- vise three variants of CLAMS, and show that the Dynamic CLAMS provides significantly higher performance, across a variety of workloads, than the commonly-employed GPU memory schedulers optimized solely for locality. The results indicate that a GPU memory system that considers both core criticality and DRAM access locality can provide sig- nificant improvement in performance.

AB - Modern memory access schedulers employed in GPUS typi- cally optimize for memory throughput. They implicitly as- sume that all requests from different cores are equally im- portant. However, we show that during the execution of a subset of CUDA applications, different cores can have dif- ferent amounts of tolerance to latency. In particular, cores with a larger fraction of warps waiting for data to come back from DRAM are less likely to tolerate the latency of an outstanding memory request. Requests from such cores are more critical than requests from others. Based on this observation, this paper introduces a new memory sched- uler, called (C)ritica(L)ity (A)ware (M)emory (S)cheduler (CLAMS), which takes into account the latency-tolerance of the cores that generate memory requests. The key idea is to use the fraction of critical requests in the memory request buffer to switch between scheduling policies optimized for criticality and locality. If this fraction is below a threshold, CLAMS prioritizes critical requests to ensure cores that can- not tolerate latency are serviced faster. Otherwise, CLAMS optimizes for locality, anticipating that there are too many critical requests and prioritizing one over another would not significantly benefit performance. We first present a core-criticality estimation mechanism for determining critical cores and requests, and then dis- cuss issues related to finding a balance between criticality and locality in the memory scheduler. We progressively de- vise three variants of CLAMS, and show that the Dynamic CLAMS provides significantly higher performance, across a variety of workloads, than the commonly-employed GPU memory schedulers optimized solely for locality. The results indicate that a GPU memory system that considers both core criticality and DRAM access locality can provide sig- nificant improvement in performance.

UR - http://www.scopus.com/inward/record.url?scp=84978764321&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978764321&partnerID=8YFLogxK

U2 - 10.1145/2896377.2901468

DO - 10.1145/2896377.2901468

M3 - Conference contribution

AN - SCOPUS:84978764321

T3 - SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science

SP - 351

EP - 363

BT - SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science

PB - Association for Computing Machinery, Inc

ER -

Jog A, Kayiran O, Pattnaik A, Kandemir M, Mutlu O, Iyer R et al. Exploiting core criticality for enhanced GPU performance. In SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science. Association for Computing Machinery, Inc. 2016. p. 351-363. (SIGMETRICS/ Performance 2016 - Proceedings of the SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Science). https://doi.org/10.1145/2896377.2901468