Straggler Mitigation in Distributed Deep Learning: A Cluster-Based Hybrid Synchronization Approach

Mustafa Burak Senyigit*, Deniz Turgay Altilar*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The rapid growth in model sizes and training datasets has led researchers to focus on distributed deep learning to accelerate the training process. Bulk Synchronous Parallel (BSP) and Asynchronous Parallel (ASP) are two fundamental synchronization paradigms employed in distributed training. BSP allows workers to iterate synchronously but is prone to the straggler problem. In contrast, ASP enables asynchronous iteration, but training with stale gradients can reduce statistical efficiency. This paper introduces a cluster-based, hierarchical, and hybrid synchronization scheme designed to mitigate the straggler effect and enhance resource utilization in heterogeneous training workloads. We define performance metrics for communication and computation capabilities of workers, and then cluster them based on their performance scores. The clusters are placed on a hierarchical tree where the slower clusters are placed of the deeper levels, and the performant clusters are positioned closer to the root. Workers within the same cluster adopt BSP utilizing ring allreduce, while inter-cluster communication is facilitated asynchronously through the master node in each cluster. This approach aims to minimize waiting times among workers and effectively overlap communication and computation. Experiments conducted on a toy CNN model and the Fashion MNIST dataset demonstrate that our method achieves convergence 1.76 and 1.93 times faster than BSP and ASP, respectively.

Original languageEnglish
Title of host publicationProceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages104-111
Number of pages8
ISBN (Electronic)9798331524937
DOIs
Publication statusPublished - 2025
Event33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025 - Turin, Italy
Duration: 12 Mar 202514 Mar 2025

Publication series

NameProceedings - 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025

Conference

Conference33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2025
Country/TerritoryItaly
CityTurin
Period12/03/2514/03/25

Bibliographical note

Publisher Copyright:
© 2025 IEEE.

Keywords

  • Data parallelism
  • Distributed deep learning
  • Straggler mitigation

Fingerprint

Dive into the research topics of 'Straggler Mitigation in Distributed Deep Learning: A Cluster-Based Hybrid Synchronization Approach'. Together they form a unique fingerprint.

Cite this