Ana gezinime geç Aramaya geç Ana içeriğe geç

Straggler Mitigation in Distributed Deep Learning: A Cluster-Based Hybrid Synchronization Approach

Araştırma sonucu: Kitap/Rapor/Konferans Bildirisinde BölümKonferans katkısıbilirkişi

Özet

The rapid growth in model sizes and training datasets has led researchers to focus on distributed deep learning to accelerate the training process. Bulk Synchronous Parallel (BSP) and Asynchronous Parallel (ASP) are two fundamental synchronization paradigms employed in distributed training. BSP allows workers to iterate synchronously but is prone to the straggler problem. In contrast, ASP enables asynchronous iteration, but training with stale gradients can reduce statistical efficiency. This paper introduces a cluster-based, hierarchical, and hybrid synchronization scheme designed to mitigate the straggler effect and enhance resource utilization in heterogeneous training workloads. We define performance metrics for communication and computation capabilities of workers, and then cluster them based on their performance scores. The clusters are placed on a hierarchical tree where the slower clusters are placed of the deeper levels, and the performant clusters are positioned closer to the root. Workers within the same cluster adopt BSP utilizing ring allreduce, while inter-cluster communication is facilitated asynchronously through the master node in each cluster. This approach aims to minimize waiting times among workers and effectively overlap communication and computation. Experiments conducted on a toy CNN model and the Fashion MNIST dataset demonstrate that our method achieves convergence 1.76 and 1.93 times faster than BSP and ASP, respectively.

Orijinal dilİngilizce
Ana bilgisayar yayını başlığıProceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025
YayınlayanInstitute of Electrical and Electronics Engineers Inc.
Sayfalar104-111
Sayfa sayısı8
ISBN (Elektronik)9798331524937
DOI'lar
Yayın durumuYayınlandı - 2025
Etkinlik33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 - Turin, Italy
Süre: 12 Mar 202514 Mar 2025

Yayın serisi

AdıProceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025
Ülke/BölgeItaly
ŞehirTurin
Periyot12/03/2514/03/25

Bibliyografik not

Publisher Copyright:
© 2025 IEEE.

Parmak izi

Straggler Mitigation in Distributed Deep Learning: A Cluster-Based Hybrid Synchronization Approach' araştırma başlıklarına git. Birlikte benzersiz bir parmak izi oluştururlar.

Alıntı Yap