Brief Summary: EMLIO: Minimizing I/O Latency and Energy Consumption for Large-Scale AI Training

Citation: Hasibul Jamil, MD S Q Zulkar Nine, Tevfik Kosar. University at Buffalo (SUNY) / Tennessee Technological University. SC2025 (Sustainable Supercomputing Workshop), August 2025. arXiv:2508.11035v1.


Problem

Large-scale deep learning training is increasingly data-bound: GPUs process samples faster than data loaders can supply them, making I/O the performance bottleneck. As datasets grow to terabytes or petabytes stored on remote networked storage, I/O latency compounds with round-trip time (RTT). At 30ms RTT (WAN), the Read+Preprocess stage can account for 90% of per-epoch time. Existing solutions (NVIDIA DALI, PyTorch DataLoader, NoPFS, Lobster) focus on latency reduction but ignore energy consumption — a critical factor at scale. Training GPT-3 requires ~1,200 MWh. Each additional millisecond of RTT multiplies both I/O energy overhead and latency, yet no prior system jointly optimizes both.

Core Insight

By moving data batching and serialization to a storage-side daemon and using out-of-order prefetching over parallel TCP/ZeroMQ streams, EMLIO completely hides network RTT from the compute-side training pipeline. Because compute nodes receive pre-batched, in-memory-ready data rather than issuing small random reads, both I/O latency and I/O energy consumption remain nearly constant regardless of network distance — while competing approaches (DALI, PyTorch) degrade by 3x-60x in energy and time as RTT increases.

Method

EMLIO is a service-based I/O framework with two components:

Storage side: An EMLIO Daemon co-located on each storage node memory-maps TFRecord shards, serializes batches of B records into msgpack payloads, and pushes them over ZeroMQ (TCP) streams with backpressure (high-water marks). A global Planner assigns shard ranges to compute nodes per epoch for correct data-parallel semantics. All I/O stages (read, serialize, network send) run on separate threads in a pipeline.

Compute side: An EMLIO Receiver pulls msgpack batches into a shared in-memory queue. A BatchProvider exposes these as DALI's external_source, feeding the GPU-accelerated preprocessing pipeline (decode, augment, normalize) directly into the PyTorch DDP training loop.

Energy monitoring: A distributed EnergyMonitor runs synchronized CPU/DRAM/GPU sampling threads at 100ms intervals using Linux perf stat and NVIDIA NVML, writing to InfluxDB for cross-node correlation.

Key Results

Evaluated on Chameleon Cloud (UC and TACC sites) training ResNet-50 and VGG-19 on ImageNet, COCO, and synthetic 2MB datasets:

Limitations


Relevance to DynamICCL

Low-to-moderate indirect relevance. EMLIO targets the data-loading I/O bottleneck; DynamICCL targets collective communication configuration. However, several connections exist:

  1. Chameleon Cloud shared infrastructure: EMLIO is evaluated on Chameleon Cloud (UC and TACC) — the same infrastructure where DynamICCL is deployed. The node specifications in EMLIO's Table 1 (Intel Xeon Gold 6126, RTX 6000, 10Gbps Ethernet) are directly comparable to DynamICCL's Chameleon deployment, providing useful baseline network characterization (10Gbps, 0.1-30ms RTT range).

  2. Network congestion context: EMLIO's evaluation shows that at 10Gbps Ethernet with 10ms RTT, data-loading I/O traffic from multiple nodes competes for network bandwidth. When EMLIO aggressively prefetches over TCP/ZeroMQ streams, this creates background network traffic that can induce congestion on the same links used by DynamICCL's NCCL collectives. Agent-1's congestion detection must be robust to this cross-traffic.

  3. Energy-aware system design: EMLIO demonstrates that energy is a first-class optimization objective for HPC training systems. DynamICCL's reward function currently minimizes collective completion time; EMLIO's findings suggest that future DynamICCL versions could incorporate energy consumption as a secondary reward signal.

  4. Tennessee Tech connection: MD S Q Zulkar Nine is affiliated with Tennessee Technological University (same institution as this research group), making EMLIO a relevant co-located research context.

  5. Prior RL work cited: Reference [20] in EMLIO cites Jamil et al. (arXiv:2503.13662) — a deep RL approach to data transfer performance and energy efficiency — which is closely related to DynamICCL's RL-for-systems-optimization approach and worth examining separately.