報告題目：Optimizing Placement and Scheduling of Distributed DNN Training in AI clouds
報告形式：騰訊會議，ID：633 837 619
報告摘要：Deep learning models (e.g., DNNs) have been powering various AI-driven applications. To learn deep learning models with large amounts of data, distributed training using data or model parallelism has been widely adopted, mostly over homogeneous resources (same GPU models, symmetric network bandwidth). Heterogeneous training environments often exist in shared data centers or clouds, with GPUs of different models purchased in different batches and network connections of different bandwidth availability (e.g., due to contention and different physical topologies). Classic data parallelism does not work well in a heterogeneous environment, while model-parallel training is hard to plan. We design algorithms and systems to enable highly-efficient distributed training over heterogeneous available resources, from the perspectives of fine-grain placement and replication of operators in a DNN graph, parameter synchronisation topology construction and execution scheduling. We compare with state-of-the-art designs and show substantial training speed-up with our algorithms and systems.
Chuan Wu received her B.Engr. and M.Engr. degrees in 2000 and 2002 from the Department of Computer Science and Technology, Tsinghua University, China, and her Ph.D. degree in 2008 from the Department of Electrical and Computer Engineering, University of Toronto, Canada. Between 2002 and 2004, She worked in the Information Technology industry in Singapore. Since September 2008, Chuan Wu has been with the Department of Computer Science at the University of Hong Kong, where she is currently a Professor. Her current research is in the areas of distributed machine learning systems and algorithms, and intelligent elderly care technologies. She is a senior member of IEEE, a member of ACM, and served as the Chair of the Interest Group on Multimedia services and applications over Emerging Networks (MEN) of the IEEE Multimedia Communication Technical Committee (MMTC) from 2012 to 2014. She is an associate editor of ACM/IEEE Transactions on Networking and IEEE Transactions on Cloud Computing. She was the co-recipient of the best paper awards of HotPOST 2012 and ACM e-Energy 2016.