Özet
The rapid growth in model sizes and training datasets has led researchers to focus on distributed deep learning to accelerate the training process. Bulk Synchronous Parallel (BSP) and Asynchronous Parallel (ASP) are two fundamental synchronization paradigms employed in distributed training. BSP allows workers to iterate synchronously but is prone to the straggler problem. In contrast, ASP enables asynchronous iteration, but training with stale gradients can reduce statistical efficiency. This paper introduces a cluster-based, hierarchical, and hybrid synchronization scheme designed to mitigate the straggler effect and enhance resource utilization in heterogeneous training workloads. We define performance metrics for communication and computation capabilities of workers, and then cluster them based on their performance scores. The clusters are placed on a hierarchical tree where the slower clusters are placed of the deeper levels, and the performant clusters are positioned closer to the root. Workers within the same cluster adopt BSP utilizing ring allreduce, while inter-cluster communication is facilitated asynchronously through the master node in each cluster. This approach aims to minimize waiting times among workers and effectively overlap communication and computation. Experiments conducted on a toy CNN model and the Fashion MNIST dataset demonstrate that our method achieves convergence 1.76 and 1.93 times faster than BSP and ASP, respectively.
| Orijinal dil | İngilizce |
|---|---|
| Ana bilgisayar yayını başlığı | Proceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 |
| Yayınlayan | Institute of Electrical and Electronics Engineers Inc. |
| Sayfalar | 104-111 |
| Sayfa sayısı | 8 |
| ISBN (Elektronik) | 9798331524937 |
| DOI'lar | |
| Yayın durumu | Yayınlandı - 2025 |
| Etkinlik | 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 - Turin, Italy Süre: 12 Mar 2025 → 14 Mar 2025 |
Yayın serisi
| Adı | Proceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 |
|---|
???event.eventtypes.event.conference???
| ???event.eventtypes.event.conference??? | 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 |
|---|---|
| Ülke/Bölge | Italy |
| Şehir | Turin |
| Periyot | 12/03/25 → 14/03/25 |
Bibliyografik not
Publisher Copyright:© 2025 IEEE.
Parmak izi
Straggler Mitigation in Distributed Deep Learning: A Cluster-Based Hybrid Synchronization Approach' araştırma başlıklarına git. Birlikte benzersiz bir parmak izi oluştururlar.Alıntı Yap
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver