Brief Summary: EMLIO: Minimizing I/O Latency and Energy Consumption for Large-Scale AI Training

Citation: Hasibul Jamil, MD S Q Zulkar Nine, Tevfik Kosar. University at Buffalo (SUNY) / Tennessee Technological University. SC2025 (Sustainable Supercomputing Workshop), August 2025. arXiv:2508.11035v1.

Problem

Large-scale deep learning training is increasingly data-bound: GPUs process samples faster than data loaders can supply them, making I/O the performance bottleneck. As datasets grow to terabytes or petabytes stored on remote networked storage, I/O latency compounds with round-trip time (RTT). At 30ms RTT (WAN), the Read+Preprocess stage can account for 90% of per-epoch time. Existing solutions (NVIDIA DALI, PyTorch DataLoader, NoPFS, Lobster) focus on latency reduction but ignore energy consumption — a critical factor at scale. Training GPT-3 requires ~1,200 MWh. Each additional millisecond of RTT multiplies both I/O energy overhead and latency, yet no prior system jointly optimizes both.

Core Insight

By moving data batching and serialization to a storage-side daemon and using out-of-order prefetching over parallel TCP/ZeroMQ streams, EMLIO completely hides network RTT from the compute-side training pipeline. Because compute nodes receive pre-batched, in-memory-ready data rather than issuing small random reads, both I/O latency and I/O energy consumption remain nearly constant regardless of network distance — while competing approaches (DALI, PyTorch) degrade by 3x-60x in energy and time as RTT increases.

Method

EMLIO is a service-based I/O framework with two components:

Storage side: An EMLIO Daemon co-located on each storage node memory-maps TFRecord shards, serializes batches of B records into msgpack payloads, and pushes them over ZeroMQ (TCP) streams with backpressure (high-water marks). A global Planner assigns shard ranges to compute nodes per epoch for correct data-parallel semantics. All I/O stages (read, serialize, network send) run on separate threads in a pipeline.

Compute side: An EMLIO Receiver pulls msgpack batches into a shared in-memory queue. A BatchProvider exposes these as DALI's external_source, feeding the GPU-accelerated preprocessing pipeline (decode, augment, normalize) directly into the PyTorch DDP training loop.

Energy monitoring: A distributed EnergyMonitor runs synchronized CPU/DRAM/GPU sampling threads at 100ms intervals using Linux perf stat and NVIDIA NVML, writing to InfluxDB for cross-node correlation.

Key Results

Evaluated on Chameleon Cloud (UC and TACC sites) training ResNet-50 and VGG-19 on ImageNet, COCO, and synthetic 2MB datasets:

WAN (30 ms RTT): EMLIO completes in 156.2s vs. DALI 1699.3s and PyTorch 4232.4s (8.6x and 27x faster). EMLIO's energy stays at 10.0 kJ (CPU) and 26.2 kJ (GPU); DALI and PyTorch energies increase by up to 60x.
LAN (10 ms RTT): EMLIO 156.6s vs. DALI 552.5s and PyTorch 1202.2s (3.5x and 7.7x faster). EMLIO CPU energy 9.9 kJ, GPU energy 26.2 kJ; DALI CPU energy 3–4x higher.
Local (0.1 ms RTT): EMLIO is 2% slower than DALI (pipeline overhead through loopback), comparable energy.
Training convergence: At 10ms RTT, EMLIO reaches loss 3.3 at 1000s; DALI reaches only loss 3.4 at 7500s — EMLIO converges 7.5x faster in wall-clock time.
Sharded scenario: EMLIO 18–18.7x faster than DALI at 30ms RTT; 41% lower CPU energy, 32% lower GPU energy.

Limitations

Small serialization overhead makes EMLIO ~2% slower than DALI at local storage (0.1ms RTT) for single-concurrency mode.
Future work needed: co-scheduling data loading with DDP gradient synchronization for cross-layer energy optimization; heterogeneous transport support (RDMA, NVMe-over-Fabric); extending beyond TFRecord to audio/text/multimodal formats.
Evaluated primarily on vision workloads (ResNet-50, VGG-19, ImageNet/COCO); LLM training I/O patterns (sequential text reads, tokenization) are structurally different and not evaluated.
Does not address the energy cost of collective communication (AllReduce) — only data loading I/O.

Relevance to DynamICCL

Low-to-moderate indirect relevance. EMLIO targets the data-loading I/O bottleneck; DynamICCL targets collective communication configuration. However, several connections exist:

Chameleon Cloud shared infrastructure: EMLIO is evaluated on Chameleon Cloud (UC and TACC) — the same infrastructure where DynamICCL is deployed. The node specifications in EMLIO's Table 1 (Intel Xeon Gold 6126, RTX 6000, 10Gbps Ethernet) are directly comparable to DynamICCL's Chameleon deployment, providing useful baseline network characterization (10Gbps, 0.1-30ms RTT range).
Network congestion context: EMLIO's evaluation shows that at 10Gbps Ethernet with 10ms RTT, data-loading I/O traffic from multiple nodes competes for network bandwidth. When EMLIO aggressively prefetches over TCP/ZeroMQ streams, this creates background network traffic that can induce congestion on the same links used by DynamICCL's NCCL collectives. Agent-1's congestion detection must be robust to this cross-traffic.
Energy-aware system design: EMLIO demonstrates that energy is a first-class optimization objective for HPC training systems. DynamICCL's reward function currently minimizes collective completion time; EMLIO's findings suggest that future DynamICCL versions could incorporate energy consumption as a secondary reward signal.
Tennessee Tech connection: MD S Q Zulkar Nine is affiliated with Tennessee Technological University (same institution as this research group), making EMLIO a relevant co-located research context.
Prior RL work cited: Reference [20] in EMLIO cites Jamil et al. (arXiv:2503.13662) — a deep RL approach to data transfer performance and energy efficiency — which is closely related to DynamICCL's RL-for-systems-optimization approach and worth examining separately.