Current Projects

DynamICCL: Dynamic NCCL Optimization

Deep reinforcement learning for optimizing collective communication operations in distributed GPU training environments.

Overview:
Modern distributed deep learning relies heavily on collective communication operations like AllReduce. DynamICCL uses reinforcement learning to dynamically select optimal NCCL algorithms and protocols based on runtime conditions.

Key Contributions:

Technologies:
PyTorch, NCCL, OpenMPI, Python, C++, Chameleon Cloud, Google Cloud Platform

Status: Active research (PhD dissertation project)


Research Interests

My research spans several areas in high-performance computing and machine learning:


Experimental Infrastructure

I conduct experiments using: