TY - GEN
T1 - ALADDIN
T2 - 30th ACM International Conference on Information and Knowledge Management, CIKM 2021
AU - Ko, Yunyong
AU - Choi, Kibong
AU - Jei, Hyunseung
AU - Lee, Dongwon
AU - Kim, Sang Wook
N1 - Funding Information:
The work of Sang-Wook Kim was supported by the National Research Foundation of Korea (NRF) under Project Number NRF-2020R1A2B5B03001960, and Institute of Information Communications Technology Planning Evaluation (IITP) under Project Number 2020-0-01373 (Artificial Intelligence Graduate School Program, Hanyang University). The work of Dongwon Lee was supported by the NSF award #212114824.
Publisher Copyright:
© 2021 ACM.
PY - 2021/10/26
Y1 - 2021/10/26
N2 - To speed up the training of massive deep neural network (DNN) models, distributed training has been widely studied. In general, a centralized training, a type of distributed training, suffers from the communication bottleneck between a parameter server (PS) and workers. On the other hand, a decentralized training suffers from increased parameter variance among workers that causes slower model convergence. Addressing this dilemma, in this work, we propose a novel centralized training algorithm, ALADDIN, employing "asymmetric"communication between PS and workers for the PS bottleneck problem and novel updating strategies for both local and global parameters to mitigate the increased variance problem. Through a convergence analysis, we show that the convergence rate of ALADDIN is O(1 ønk ) on the non-convex problem, where n is the number of workers and k is the number of training iterations. The empirical evaluation using ResNet-50 and VGG-16 models demonstrates that (1) ALADDIN shows significantly better training throughput with up to 191% and 34% improvement compared to a synchronous algorithm and the state-of-the-art decentralized algorithm, respectively, (2) models trained by ALADDIN converge to the accuracies, comparable to those of the synchronous algorithm, within the shortest time, and (3) the convergence of ALADDIN is robust under various heterogeneous environments.
AB - To speed up the training of massive deep neural network (DNN) models, distributed training has been widely studied. In general, a centralized training, a type of distributed training, suffers from the communication bottleneck between a parameter server (PS) and workers. On the other hand, a decentralized training suffers from increased parameter variance among workers that causes slower model convergence. Addressing this dilemma, in this work, we propose a novel centralized training algorithm, ALADDIN, employing "asymmetric"communication between PS and workers for the PS bottleneck problem and novel updating strategies for both local and global parameters to mitigate the increased variance problem. Through a convergence analysis, we show that the convergence rate of ALADDIN is O(1 ønk ) on the non-convex problem, where n is the number of workers and k is the number of training iterations. The empirical evaluation using ResNet-50 and VGG-16 models demonstrates that (1) ALADDIN shows significantly better training throughput with up to 191% and 34% improvement compared to a synchronous algorithm and the state-of-the-art decentralized algorithm, respectively, (2) models trained by ALADDIN converge to the accuracies, comparable to those of the synchronous algorithm, within the shortest time, and (3) the convergence of ALADDIN is robust under various heterogeneous environments.
UR - http://www.scopus.com/inward/record.url?scp=85119205605&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119205605&partnerID=8YFLogxK
U2 - 10.1145/3459637.3482412
DO - 10.1145/3459637.3482412
M3 - Conference contribution
AN - SCOPUS:85119205605
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 863
EP - 872
BT - CIKM 2021 - Proceedings of the 30th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 1 November 2021 through 5 November 2021
ER -