TY - GEN
T1 - Diagnosis, tuning, and redesign for multicore performance
T2 - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
AU - Chandramowlishwaran, Aparna
AU - Madduri, Kamesh
AU - Vuduc, Richard
PY - 2010
Y1 - 2010
N2 - Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end-user programmers) and automated performance analysis and tuning tools could themselves apply.
AB - Given a program and a multisocket, multicore system, what is the process by which one understands and improves its performance and scalability? We describe an approach in the context of improving within-node scalability of the fast multipole method (FMM). Our process consists of a systematic sequence of modeling, analysis, and tuning steps, beginning with simple models, and gradually increasing their complexity in the quest for deeper performance understanding and better scalability. For the FMM, we significantly improve within-node scalability; for example, on a quad-socket Intel Nehalem-EX system, we show speedups of 1.7× over the previous best multithreaded implementation, 19.3× over a sequential but highly tuned (e.g., SIMD-vectorized) code, and match or outperform a state-ofthe-art GPGPU implementation. Our study sheds new light on the form of a more general performance analysis and tuning process that other multicore/manycore tuning practitioners (end-user programmers) and automated performance analysis and tuning tools could themselves apply.
UR - http://www.scopus.com/inward/record.url?scp=78650822594&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650822594&partnerID=8YFLogxK
U2 - 10.1109/SC.2010.19
DO - 10.1109/SC.2010.19
M3 - Conference contribution
AN - SCOPUS:78650822594
SN - 9781424475575
T3 - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
BT - 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010
Y2 - 13 November 2010 through 19 November 2010
ER -