Skip to content
brickster.ai
All videos
newsDatabricks·July 19, 2022

Tackling Challenges of Distributed Deep Learning with Open Source Solutions

Description

Deep learning has had an enormous impact in a variety of domains, however, with model and data size growing at a rapid pace, scaling out deep learning training has become essential for practical use. In this talk, you will learn about the challenges and various solutions for distributed deep learning. We will first cover some of the common patterns used to scale out deep learning training. We will then describe some of the challenges with distributed deep learning in practice: Infrastructure and hardware management Spending too much time managing clusters, resources, and the scheduling/placement of jobs or processes. Developer iteration speed. Too much overhead to go from small-scale local ML development to large-scale training Hard to run distributed training jobs in a notebook/interactive environment. Difficulty integrating with open source software. Scale out training while still being able to leverage open source tools such as MLflow, Pytorch Lightning, and Huggingface Managing large-scale training data. Efficiently ingest large amounts of training data to my distributed machine learning model. Cloud compute costs. Leverage cheaper spot instances, without having to re

Description from YouTube. Full content on the video page.

More from Databricks