Research

Current Projects

DynamICCL: Dynamic NCCL Optimization

Deep reinforcement learning for optimizing collective communication operations in distributed GPU training environments.

Overview:
Modern distributed deep learning relies heavily on collective communication operations like AllReduce. DynamICCL uses reinforcement learning to dynamically select optimal NCCL algorithms and protocols based on runtime conditions.

Key Contributions:

Custom NCCL tuner plugin using NVIDIA’s tuner_v4 API
Performance analysis under varying network congestion levels
Adaptive parameter selection for different message sizes
Benchmarking on multi-node GPU clusters

Technologies:
PyTorch, NCCL, OpenMPI, Python, C++, Chameleon Cloud, Google Cloud Platform

Status: Active research (PhD dissertation project)

Research Interests

My research spans several areas in high-performance computing and machine learning:

Distributed Machine Learning Systems - Optimizing training efficiency at scale
Network Performance - Understanding and improving communication patterns
Reinforcement Learning for Systems - Applying ML to systems optimization
GPU Computing - Leveraging accelerators for scientific computing

Experimental Infrastructure

I conduct experiments using:

Multi-node GPU clusters with NVIDIA A100 GPUs
Cloud platforms: Chameleon Cloud, Google Cloud Platform
Network congestion simulation and measurement tools
Custom benchmarking frameworks for NCCL operations

Current Projects#

DynamICCL: Dynamic NCCL Optimization#

Research Interests#

Experimental Infrastructure#

Current Projects

DynamICCL: Dynamic NCCL Optimization

Research Interests

Experimental Infrastructure