---
_id: '7437'
abstract:
- lang: eng
  text: 'Most of today''s distributed machine learning systems assume reliable networks:
    whenever two machines exchange information (e.g., gradients or models), the network
    should guarantee the delivery of the message. At the same time, recent work exhibits
    the impressive tolerance of machine learning algorithms to errors or noise arising
    from relaxed communication or synchronization. In this paper, we connect these
    two trends, and consider the following question: Can we design machine learning
    systems that are tolerant to network unreliability during training? With this
    motivation, we focus on a theoretical problem of independent interest-given a
    standard distributed parameter server architecture, if every communication between
    the worker and the server has a non-zero probability p of being dropped, does
    there exist an algorithm that still converges, and at what speed? The technical
    contribution of this paper is a novel theoretical analysis proving that distributed
    learning over unreliable network can achieve comparable convergence rate to centralized
    or distributed learning over reliable networks. Further, we prove that the influence
    of the packet drop rate diminishes with the growth of the number of parameter
    servers. We map this theoretical result onto a real-world scenario, training deep
    neural networks over an unreliable network layer, and conduct network simulation
    to validate the system improvement by allowing the networks to be unreliable.'
article_processing_charge: No
arxiv: 1
author:
- first_name: Chen
  full_name: Yu, Chen
  last_name: Yu
- first_name: Hanlin
  full_name: Tang, Hanlin
  last_name: Tang
- first_name: Cedric
  full_name: Renggli, Cedric
  last_name: Renggli
- first_name: Simon
  full_name: Kassing, Simon
  last_name: Kassing
- first_name: Ankit
  full_name: Singla, Ankit
  last_name: Singla
- first_name: Dan-Adrian
  full_name: Alistarh, Dan-Adrian
  id: 4A899BFC-F248-11E8-B48F-1D18A9856A87
  last_name: Alistarh
  orcid: 0000-0003-3650-940X
- first_name: Ce
  full_name: Zhang, Ce
  last_name: Zhang
- first_name: Ji
  full_name: Liu, Ji
  last_name: Liu
citation:
  ama: 'Yu C, Tang H, Renggli C, et al. Distributed learning over unreliable networks.
    In: <i>36th International Conference on Machine Learning, ICML 2019</i>. Vol 2019-June.
    IMLS; 2019:12481-12512.'
  apa: 'Yu, C., Tang, H., Renggli, C., Kassing, S., Singla, A., Alistarh, D.-A., …
    Liu, J. (2019). Distributed learning over unreliable networks. In <i>36th International
    Conference on Machine Learning, ICML 2019</i> (Vol. 2019–June, pp. 12481–12512).
    Long Beach, CA, United States: IMLS.'
  chicago: Yu, Chen, Hanlin Tang, Cedric Renggli, Simon Kassing, Ankit Singla, Dan-Adrian
    Alistarh, Ce Zhang, and Ji Liu. “Distributed Learning over Unreliable Networks.”
    In <i>36th International Conference on Machine Learning, ICML 2019</i>, 2019–June:12481–512.
    IMLS, 2019.
  ieee: C. Yu <i>et al.</i>, “Distributed learning over unreliable networks,” in <i>36th
    International Conference on Machine Learning, ICML 2019</i>, Long Beach, CA, United
    States, 2019, vol. 2019–June, pp. 12481–12512.
  ista: 'Yu C, Tang H, Renggli C, Kassing S, Singla A, Alistarh D-A, Zhang C, Liu
    J. 2019. Distributed learning over unreliable networks. 36th International Conference
    on Machine Learning, ICML 2019. ICML: International Conference on Machine Learning
    vol. 2019–June, 12481–12512.'
  mla: Yu, Chen, et al. “Distributed Learning over Unreliable Networks.” <i>36th International
    Conference on Machine Learning, ICML 2019</i>, vol. 2019–June, IMLS, 2019, pp.
    12481–512.
  short: C. Yu, H. Tang, C. Renggli, S. Kassing, A. Singla, D.-A. Alistarh, C. Zhang,
    J. Liu, in:, 36th International Conference on Machine Learning, ICML 2019, IMLS,
    2019, pp. 12481–12512.
conference:
  end_date: 2019-06-15
  location: Long Beach, CA, United States
  name: 'ICML: International Conference on Machine Learning'
  start_date: 2019-06-10
date_created: 2020-02-02T23:01:06Z
date_published: 2019-06-01T00:00:00Z
date_updated: 2023-09-06T15:21:48Z
day: '01'
department:
- _id: DaAl
external_id:
  arxiv:
  - '1810.07766'
  isi:
  - '000684034307036'
isi: 1
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://arxiv.org/abs/1810.07766
month: '06'
oa: 1
oa_version: Preprint
page: 12481-12512
publication: 36th International Conference on Machine Learning, ICML 2019
publication_identifier:
  isbn:
  - '9781510886988'
publication_status: published
publisher: IMLS
quality_controlled: '1'
scopus_import: '1'
status: public
title: Distributed learning over unreliable networks
type: conference
user_id: c635000d-4b10-11ee-a964-aac5a93f6ac1
volume: 2019-June
year: '2019'
...