FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads

Jie Zhang, Myoungsoo Jung, Mahmut Kandemir

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

Original languageEnglish (US)
Title of host publicationProceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages426-439
Number of pages14
ISBN (Electronic)9781728114446
DOIs
StatePublished - Mar 26 2019
Event25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019 - Washington, United States
Duration: Feb 16 2019Feb 20 2019

Publication series

NameProceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019

Conference

Conference25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
CountryUnited States
CityWashington
Period2/16/192/20/19

Fingerprint

Torque
Data storage equipment
Static random access storage
Graphics processing unit
Costs

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Zhang, J., Jung, M., & Kandemir, M. (2019). FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. In Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019 (pp. 426-439). [8675216] (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/HPCA.2019.00055
Zhang, Jie ; Jung, Myoungsoo ; Kandemir, Mahmut. / FUSE : Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 426-439 (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019).
@inproceedings{b1b8375836d14783b1b493c8fec2e358,
title = "FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads",
abstract = "In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32{\%} across the interconnection network, thereby improving the overall performance by 217{\%} and reducing energy cost by 53{\%}.",
author = "Jie Zhang and Myoungsoo Jung and Mahmut Kandemir",
year = "2019",
month = "3",
day = "26",
doi = "10.1109/HPCA.2019.00055",
language = "English (US)",
series = "Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "426--439",
booktitle = "Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019",
address = "United States",

}

Zhang, J, Jung, M & Kandemir, M 2019, FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. in Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019., 8675216, Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Institute of Electrical and Electronics Engineers Inc., pp. 426-439, 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Washington, United States, 2/16/19. https://doi.org/10.1109/HPCA.2019.00055

FUSE : Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. / Zhang, Jie; Jung, Myoungsoo; Kandemir, Mahmut.

Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 426-439 8675216 (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - FUSE

T2 - Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads

AU - Zhang, Jie

AU - Jung, Myoungsoo

AU - Kandemir, Mahmut

PY - 2019/3/26

Y1 - 2019/3/26

N2 - In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

AB - In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%.

UR - http://www.scopus.com/inward/record.url?scp=85064223760&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064223760&partnerID=8YFLogxK

U2 - 10.1109/HPCA.2019.00055

DO - 10.1109/HPCA.2019.00055

M3 - Conference contribution

AN - SCOPUS:85064223760

T3 - Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019

SP - 426

EP - 439

BT - Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Zhang J, Jung M, Kandemir M. FUSE: Fusing STT-MRAM into GPUs to alleviate off-chip memory access overheads. In Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 426-439. 8675216. (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019). https://doi.org/10.1109/HPCA.2019.00055