Block Diagrams
ASCII architecture diagrams, data-flow, design trade-offs, borrowings
1.Megatron LM
2.ZeRO
3.Switch Transformers
4.PipeDream
5.nnScaler
6.p3
7.GPipe
8.BitNet LLM microsoft
9.AutoCCL
10.Efficient Schedule Construction for Distributed Execution of Large DNN Models
11.A3C Asynchronous Methods for Deep Reinforcement Learning
12.Demystifying NCCL
13.EMLIO
14.GPU Perf modeling LLM
15.Immediate .Comm Dist tasks GPU
16.MSCCL++
17.GPU Initiated net NCCL
18.pensieve sigcomm17
19.CollCommConfigSurvey
20.NCCLX
21.CollCommPerfEval4DDLTraining
22.HiCCL
23.gZCCL
24.R2CCL
25.BigSendOff resilaientAndPerformant
26.Survey CommEfficientDDL
27.Survey CommEfficientLargeScaleDDL
28.Survey Communication Optimization Algorithms for Distributed Deep Learning Systems A Survey
29.Survey Communication Optimization for Distributed Training Architecture Advances and Opportunities
30.Survey DistributedMachineLearning
31.0030 Survey QuantitativeSurveyCommunicationOptimizationsInDDL
32.SCCL
33.TACCL
34.MSCCLang
35.GC3
36.TE CCL
37.SyCCL
38.Chakara
39.Crux
40.BytePS
41.Horovod
42.Poseidon
43.SparCML
44.1Bit Adam
45.near optimal sparse allReduce
46.efficient largescale modelTraining
47.MegaScale
48.C4 enhancing LScale AI trining
49.HPCA collectives
50.Unified Collective Communication UCC An Unified Library for CPU GPU and DPU Collectives
51.MVAPICH2 GDR
52.GPU2GPU communication