Diagnosis, tuning, and redesign for multicore performance

A case study of the fast multipole method

Aparna Chandramowlishwaran, Kamesh Madduri, Richard Vuduc

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Citations (Scopus)

Abstract

Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end-user programmers) and automated performance analysis and tuning tools could themselves apply.

Original languageEnglish (US)
Title of host publication2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
DOIs
StatePublished - Dec 1 2010
Event2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010 - New Orleans, LA, United States
Duration: Nov 13 2010Nov 19 2010

Publication series

Name2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010

Other

Other2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
CountryUnited States
CityNew Orleans, LA
Period11/13/1011/19/10

Fingerprint

Scalability
Tuning

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture

Cite this

Chandramowlishwaran, A., Madduri, K., & Vuduc, R. (2010). Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method. In 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010 [5644891] (2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010). https://doi.org/10.1109/SC.2010.19
Chandramowlishwaran, Aparna ; Madduri, Kamesh ; Vuduc, Richard. / Diagnosis, tuning, and redesign for multicore performance : A case study of the fast multipole method. 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010. 2010. (2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010).
@inproceedings{485477a3e9a041749b2c8bb9136cb9fe,
title = "Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method",
abstract = "Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end-user programmers) and automated performance analysis and tuning tools could themselves apply.",
author = "Aparna Chandramowlishwaran and Kamesh Madduri and Richard Vuduc",
year = "2010",
month = "12",
day = "1",
doi = "10.1109/SC.2010.19",
language = "English (US)",
isbn = "9781424475575",
series = "2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010",
booktitle = "2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010",

}

Chandramowlishwaran, A, Madduri, K & Vuduc, R 2010, Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method. in 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010., 5644891, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, New Orleans, LA, United States, 11/13/10. https://doi.org/10.1109/SC.2010.19

Diagnosis, tuning, and redesign for multicore performance : A case study of the fast multipole method. / Chandramowlishwaran, Aparna; Madduri, Kamesh; Vuduc, Richard.

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010. 2010. 5644891 (2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Diagnosis, tuning, and redesign for multicore performance

T2 - A case study of the fast multipole method

AU - Chandramowlishwaran, Aparna

AU - Madduri, Kamesh

AU - Vuduc, Richard

PY - 2010/12/1

Y1 - 2010/12/1

N2 - Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end-user programmers) and automated performance analysis and tuning tools could themselves apply.

AB - Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end-user programmers) and automated performance analysis and tuning tools could themselves apply.

UR - http://www.scopus.com/inward/record.url?scp=78650822594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650822594&partnerID=8YFLogxK

U2 - 10.1109/SC.2010.19

DO - 10.1109/SC.2010.19

M3 - Conference contribution

SN - 9781424475575

T3 - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010

BT - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010

ER -

Chandramowlishwaran A, Madduri K, Vuduc R. Diagnosis, tuning, and redesign for multicore performance: A case study of the fast multipole method. In 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010. 2010. 5644891. (2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010). https://doi.org/10.1109/SC.2010.19