Current Projects
DynamICCL: Dynamic NCCL Optimization
Deep reinforcement learning for optimizing collective communication operations in distributed GPU training environments.
Overview:
Modern distributed deep learning relies heavily on collective communication
operations like AllReduce. DynamICCL uses reinforcement learning to dynamically
select optimal NCCL algorithms and protocols based on runtime conditions.
Key Contributions:
- Custom NCCL tuner plugin using NVIDIA’s tuner_v4 API
- Performance analysis under varying network congestion levels
- Adaptive parameter selection for different message sizes
- Benchmarking on multi-node GPU clusters
Technologies:
PyTorch, NCCL, OpenMPI, Python, C++, Chameleon Cloud, Google Cloud Platform
Status: Active research (PhD dissertation project)
Research Interests
My research spans several areas in high-performance computing and machine learning:
- Distributed Machine Learning Systems - Optimizing training efficiency at scale
- Network Performance - Understanding and improving communication patterns
- Reinforcement Learning for Systems - Applying ML to systems optimization
- GPU Computing - Leveraging accelerators for scientific computing
Experimental Infrastructure
I conduct experiments using:
- Multi-node GPU clusters with NVIDIA A100 GPUs
- Cloud platforms: Chameleon Cloud, Google Cloud Platform
- Network congestion simulation and measurement tools
- Custom benchmarking frameworks for NCCL operations