Abstract
The rapid growth in model sizes and training datasets has led researchers to focus on distributed deep learning to accelerate the training process. Bulk Synchronous Parallel (BSP) and Asynchronous Parallel (ASP) are two fundamental synchronization paradigms employed in distributed training. BSP allows workers to iterate synchronously but is prone to the straggler problem. In contrast, ASP enables asynchronous iteration, but training with stale gradients can reduce statistical efficiency. This paper introduces a cluster-based, hierarchical, and hybrid synchronization scheme designed to mitigate the straggler effect and enhance resource utilization in heterogeneous training workloads. We define performance metrics for communication and computation capabilities of workers, and then cluster them based on their performance scores. The clusters are placed on a hierarchical tree where the slower clusters are placed of the deeper levels, and the performant clusters are positioned closer to the root. Workers within the same cluster adopt BSP utilizing ring allreduce, while inter-cluster communication is facilitated asynchronously through the master node in each cluster. This approach aims to minimize waiting times among workers and effectively overlap communication and computation. Experiments conducted on a toy CNN model and the Fashion MNIST dataset demonstrate that our method achieves convergence 1.76 and 1.93 times faster than BSP and ASP, respectively.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 104-111 |
| Number of pages | 8 |
| ISBN (Electronic) | 9798331524937 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 - Turin, Italy Duration: 12 Mar 2025 → 14 Mar 2025 |
Publication series
| Name | Proceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 |
|---|
Conference
| Conference | 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 |
|---|---|
| Country/Territory | Italy |
| City | Turin |
| Period | 12/03/25 → 14/03/25 |
Bibliographical note
Publisher Copyright:© 2025 IEEE.
Keywords
- Data parallelism
- Distributed deep learning
- Straggler mitigation