Brief Summaries

One-page overview: problem, method, results, relevance

1.Megatron LM 2.hopper 3.ZeRO 4.Switch Transformers 5.PipeDream 6.nnScaler 7.p3 8.GPipe 9.BitNet LLM microsoft 10.AutoCCL 11.Efficient Schedule Construction for Distributed Execution of Large DNN Models 12.A3C Asynchronous Methods for Deep Reinforcement Learning 13.Demystifying NCCL 14.EMLIO 15.GPU Perf modeling LLM 16.Immediate .Comm Dist tasks GPU 17.MSCCL++ 18.GPU Initiated net NCCL 19.pensieve sigcomm17 20.CollCommConfigSurvey 21.NCCLX 22.CollCommPerfEval4DDLTraining 23.HiCCL 24.gZCCL 25.R2CCL 26.BigSendOff resilaientAndPerformant 27.Survey CommEfficientDDL 28.Survey CommEfficientLargeScaleDDL 29.Survey Communication Optimization Algorithms for Distributed Deep Learning Systems A Survey 30.Survey Communication Optimization for Distributed Training Architecture Advances and Opportunities 31.Survey DistributedMachineLearning 32.0030 Survey QuantitativeSurveyCommunicationOptimizationsInDDL 33.SCCL 34.TACCL 35.MSCCLang 36.GC3 37.TE CCL 38.SyCCL 39.Chakara 40.Crux 41.BytePS 42.Horovod 43.Poseidon 44.SparCML 45.1Bit Adam 46.near optimal sparse allReduce 47.efficient largescale modelTraining 48.MegaScale 49.C4 enhancing LScale AI trining 50.HPCA collectives 51.Unified Collective Communication UCC An Unified Library for CPU GPU and DPU Collectives 52.MVAPICH2 GDR 53.GPU2GPU communication