CGX: Adaptive system support for communication-efficient deep learning
Markov I, Ramezanikebrya H, Alistarh D-A. 2022. CGX: Adaptive system support for communication-efficient deep learning. Proceedings of the 23rd ACM/IFIP International Middleware Conference. Middleware: International Middleware Conference, 241–254.
Download
Conference Paper
| Published
| English
Author
Department
Abstract
The ability to scale out training workloads has been one of the key performance enablers of deep learning. The main scaling approach is data-parallel GPU-based training, which has been boosted by hardware and software support for highly efficient point-to-point communication, and in particular via hardware bandwidth over-provisioning. Overprovisioning comes at a cost: there is an order of magnitude price difference between "cloud-grade" servers with such support, relative to their popular "consumer-grade" counterparts, although single server-grade and consumer-grade GPUs can have similar computational envelopes.
In this paper, we show that the costly hardware overprovisioning approach can be supplanted via algorithmic and system design, and propose a framework called CGX, which provides efficient software support for compressed communication in ML applications, for both multi-GPU single-node training, as well as larger-scale multi-node training. CGX is based on two technical advances: At the system level, it relies on a re-developed communication stack for ML frameworks, which provides flexible, highly-efficient support for compressed communication. At the application level, it provides seamless, parameter-free integration with popular frameworks, so that end-users do not have to modify training recipes, nor significant training code. This is complemented by a layer-wise adaptive compression technique which dynamically balances compression gains with accuracy preservation. CGX integrates with popular ML frameworks, providing up to 3X speedups for multi-GPU nodes based on commodity hardware, and order-of-magnitude improvements in the multi-node setting, with negligible impact on accuracy.
Publishing Year
Date Published
2022-11-01
Proceedings Title
Proceedings of the 23rd ACM/IFIP International Middleware Conference
Publisher
Association for Computing Machinery
Acknowledgement
The authors sincerely thank Nikoli Dryden, Tal Ben-Nun, Torsten Hoefler and Bapi Chatterjee for useful discussions throughout the development of this project.
Page
241-254
Conference
Middleware: International Middleware Conference
Conference Location
Quebec, QC, Canada
Conference Date
2022-11-07 – 2022-11-11
ISBN
IST-REx-ID
Cite this
Markov I, Ramezanikebrya H, Alistarh D-A. CGX: Adaptive system support for communication-efficient deep learning. In: Proceedings of the 23rd ACM/IFIP International Middleware Conference. Association for Computing Machinery; 2022:241-254. doi:10.1145/3528535.3565248
Markov, I., Ramezanikebrya, H., & Alistarh, D.-A. (2022). CGX: Adaptive system support for communication-efficient deep learning. In Proceedings of the 23rd ACM/IFIP International Middleware Conference (pp. 241–254). Quebec, QC, Canada: Association for Computing Machinery. https://doi.org/10.1145/3528535.3565248
Markov, Ilia, Hamidreza Ramezanikebrya, and Dan-Adrian Alistarh. “CGX: Adaptive System Support for Communication-Efficient Deep Learning.” In Proceedings of the 23rd ACM/IFIP International Middleware Conference, 241–54. Association for Computing Machinery, 2022. https://doi.org/10.1145/3528535.3565248.
I. Markov, H. Ramezanikebrya, and D.-A. Alistarh, “CGX: Adaptive system support for communication-efficient deep learning,” in Proceedings of the 23rd ACM/IFIP International Middleware Conference, Quebec, QC, Canada, 2022, pp. 241–254.
Markov I, Ramezanikebrya H, Alistarh D-A. 2022. CGX: Adaptive system support for communication-efficient deep learning. Proceedings of the 23rd ACM/IFIP International Middleware Conference. Middleware: International Middleware Conference, 241–254.
Markov, Ilia, et al. “CGX: Adaptive System Support for Communication-Efficient Deep Learning.” Proceedings of the 23rd ACM/IFIP International Middleware Conference, Association for Computing Machinery, 2022, pp. 241–54, doi:10.1145/3528535.3565248.
All files available under the following license(s):
Creative Commons Attribution 4.0 International Public License (CC-BY 4.0):
Main File(s)
File Name
2022_ACMMiddleware_Markov.pdf
1.51 MB
Access Level
Open Access
Date Uploaded
2023-04-03
MD5 Checksum
1a397746235f245da5468819247ff663
Export
Marked PublicationsOpen Data ISTA Research Explorer
Sources
arXiv 2111.08617