Brief Summaries
One-page overview: problem, method, results, relevance
1.Megatron LM
2.hopper
3.ZeRO
4.Switch Transformers
5.PipeDream
6.nnScaler
7.p3
8.GPipe
9.BitNet LLM microsoft
10.AutoCCL
11.Efficient Schedule Construction for Distributed Execution of Large DNN Models
12.A3C Asynchronous Methods for Deep Reinforcement Learning
13.Demystifying NCCL
14.EMLIO
15.GPU Perf modeling LLM
16.Immediate .Comm Dist tasks GPU
17.MSCCL++
18.GPU Initiated net NCCL
19.pensieve sigcomm17
20.CollCommConfigSurvey
21.NCCLX
22.CollCommPerfEval4DDLTraining
23.HiCCL
24.gZCCL
25.R2CCL
26.BigSendOff resilaientAndPerformant
27.Survey CommEfficientDDL
28.Survey CommEfficientLargeScaleDDL
29.Survey Communication Optimization Algorithms for Distributed Deep Learning Systems A Survey
30.Survey Communication Optimization for Distributed Training Architecture Advances and Opportunities
31.Survey DistributedMachineLearning
32.0030 Survey QuantitativeSurveyCommunicationOptimizationsInDDL
33.SCCL
34.TACCL
35.MSCCLang
36.GC3
37.TE CCL
38.SyCCL
39.Chakara
40.Crux
41.BytePS
42.Horovod
43.Poseidon
44.SparCML
45.1Bit Adam
46.near optimal sparse allReduce
47.efficient largescale modelTraining
48.MegaScale
49.C4 enhancing LScale AI trining
50.HPCA collectives
51.Unified Collective Communication UCC An Unified Library for CPU GPU and DPU Collectives
52.MVAPICH2 GDR
53.GPU2GPU communication