Detailed Summary: EMLIO: Minimizing I/O Latency and Energy Consumption for Large-Scale AI Training

Citation: Hasibul Jamil, MD S Q Zulkar Nine, Tevfik Kosar. University at Buffalo (SUNY) / Tennessee Technological University (Cookeville, TN) / University at Buffalo (SUNY). SC2025 (Sustainable Supercomputing Workshop), St. Louis, MO, USA, November 2025. arXiv:2508.11035v1, August 14, 2025. DOI: 10.1145/nnnnnnn.nnnnnnn


Abstract (Paraphrased)

Large-scale deep learning suffers from I/O bottlenecks as datasets grow beyond local storage capacities and GPU compute outpaces network and disk latencies. While recent systems optimize data-loading time, they overlook I/O energy cost — a critical factor at scale. EMLIO (Efficient Machine Learning I/O) is a service-based framework that jointly minimizes end-to-end data-loading latency (T) and I/O energy consumption (E) across variable-latency networked storage. EMLIO deploys a lightweight storage-side daemon that serializes and batches raw samples, streams them over TCP with out-of-order prefetching, and integrates with GPU-accelerated (NVIDIA DALI) preprocessing on the compute side. In evaluations across local disk, LAN (0.05ms and 10ms RTT), and WAN (30ms RTT) environments, EMLIO delivers up to 8.6x faster I/O and 10.9x lower energy use compared to state-of-the-art data loaders, while maintaining competitive performance and energy profiles regardless of network distance.


Motivation

The I/O Bottleneck at Scale

Training a Deep Neural Network involves three processes: (i) computation (forward/backward passes), (ii) communication (NCCL collectives for gradient synchronization), and (iii) I/O (delivering data and labels to workers). Research has focused heavily on (i) and (ii), but I/O has become the bottleneck as datasets grow to hundreds of millions or billions of samples.

At scale, up to 85% of per-epoch runtime can be spent on I/O (ResNet-50 on ImageNet). Shared filesystem contention in distributed settings further degrades performance.

Why Existing Solutions Fail at High Latency

Existing approaches — FFCV, NVIDIA DALI (local optimized), NoPFS, Lobster, Cassandra-DALI (network-aware) — use limited lookahead, double-buffering, data sharding, presaging, or in-memory caching. These suffer from:

The Energy Problem

No prior ML I/O system quantifies or minimizes energy:

Key observation (Fig. 1): At local storage, I/O accounts for only ~15% of energy and ~20% of per-epoch time. At 10ms RTT (LAN), Read+Preprocess dominates at 60% of both metrics. At 30ms RTT (WAN), it dominates at 90%. This makes joint latency-energy optimization mandatory for geo-distributed training.


Background

Training I/O Pipeline Stages

Three stages per epoch:

At local storage: CPU and GPU energy are modest; duration is compute-dominated. At LAN/WAN: The R+P stage becomes the wall-clock bottleneck and energy dominator.

Why Stochastic I/O Is Hard

Stochastic gradient descent randomly accesses small, independent samples. This causes random file access patterns, which thrash disk/NFS cache, generates many small read syscalls, and prevents effective sequential prefetching by standard loaders.

EMLIO addresses this by sharding data into large TFRecord files, enabling sequential batch reads rather than random small reads.


System Design

High-Level Architecture

Storage Cluster                          Compute Cluster
+------------------------------------+   +-----------------------------------+
| TFRecord Shards                    |   | ZMQ Pull + Deserializer           |
|    (mmap-read)                     |   |      |                            |
| Planner                            |   | Batch Loader (EMLIO Plugin)       |
|  (shard -> batch plan)             |   |      |                            |
|                                    |   | DALI Pipeline (GPU Preprocess)    |
| EMLIO Daemon Node 1 ... Node N     |   |  (decode, augment, normalize)     |
|  +---------+  +---------+          |   |      |                            |
|  |Serialize|  |Serialize|          +-->+ Training Loop (DDP)               |
|  |& Batch  |  |& Batch  |          |   +-----------------------------------+
|  +---------+  +---------+          |
|       | ZeroMQ PUSH                |
+-------+----------------------------+
         TCP/ZMQ (parallel streams)

Three Core Design Principles

1. Network-Pipeline Concurrency All I/O stages (read, serialize, network send) run on separate threads in a pipeline. While one batch is being sent over the network, the next is already being read and prepared. This fully utilizes available network bandwidth and hides per-batch latency.

2. Batch-Aligned Data-Parallel Planning A centralized Planner computes, for each epoch and compute node, exactly which TFRecord shard ranges form each fixed-size batch. This ensures correct data-parallel semantics (each compute node sees the full dataset exactly once per epoch) without client-side shard scans or random small reads.

3. Linear Horizontal Scalability Each storage node runs its own EMLIO Daemon with a pool of worker threads. By sharding TFRecords and parallelizing read, serialize, and send pipelines, aggregate throughput grows linearly with storage nodes.

Storage Side: EMLIO Daemon and Planner

Global Planner (Algorithm 2, sender planning & dispatch):

Input: TFRecord directory D, node list N = {(id_i, ip_i, port_i)}, batch size B, epochs E, threads T
1. Load shard metadata (parse mapping_shard_*.json)
2. Build global label map from all shards
3. for epoch c = 1 to E:
4.   shuffle shard list
5.   assign shards to nodes round-robin
6.   for each node (id, ip, port):
7.     split its shard list into T subsets
8.     launch T SendWorker threads (ThreadPoolExecutor)
9. end for

Each SendWorker:

TFRecord format rationale: Sequential records prefixed by length — grabbing B examples requires one read (no extra copies or individual read calls). Lower I/O overhead than small-file datasets.

Compute Side: EMLIO Receiver and DALI Integration

Algorithm 3 (EMLIO Receiver & DALI Integration):

Input: bind address ip, port p, prefetch depth Q, epochs E
1. start ZMQ PULL socket on (ip, p)
2. spawn zmq_receiver thread → pushes into shared Queue
3. build DALI pipeline with BatchProvider(Queue), prefetch=Q
4. warm up: manually run Q iterations
5. for epoch = 1 to E:
6.   while data is available:
7.     pipe.run()     # deserializes & preprocesses
8.     feed output to DDP training step
9.   end while
10. end for
11. teardown pipelines and sockets

The DALI pipeline uses BatchProvider as external_source. DALI's exec_async and exec_pipelined options run decode/augment and CPU preprocessing concurrently with GPU execution. GPU-accelerated preprocessing (JPEG decode, resize, crop, normalize) is performed on the compute node's GPU, so compute nodes work entirely on large, in-memory pre-batched data.

ZeroMQ Backpressure

PUSH socket HWM set to 16 with blocking send: storage workers naturally back off when compute-side queues are full. This prevents memory overflow without explicit flow control logic.

Concurrency Model


Energy Monitoring Framework

Distributed EnergyMonitor (Algorithm 1)

A fine-grained energy measurement system running on all compute and storage nodes:

Components per node:

Implementation:


Evaluation

Setup

Clusters (Chameleon Cloud): | Node Type | Spec | |-----------|------| | UC Compute (gpu_rtx_6000) | 2x Intel Xeon Gold 6126 @ 2.6 GHz (12c/24t), 192 GiB DDR4, 240 GiB SAS SSD, NVIDIA Quadro RTX 6000 (24 GiB), 10 Gbps Ethernet | | UC Storage (compute_skylake) | 2x Intel Xeon Gold 6126 @ 2.6 GHz (12c/48t), 192 GiB DDR4, 240 GiB SAS SSD, No GPU, 10 Gbps Ethernet | | TACC Compute (gpu_p100) | 2x Intel Xeon E5-2650 v3 @ 2.3 GHz (12c/48t), 128 GiB DDR4, 1 TB SATA HDD, 2x NVIDIA Tesla P100 (16 GiB), 10 Gbps Ethernet | | TACC Storage | 2x Intel Xeon E5-2650 v3, 64 GiB DDR4, 600 GiB SATA SSD, 10 Gbps Ethernet |

All nodes: Ubuntu 22.04 LTS, CUDA 12.6, NFSv4 mount for training data.

Baselines: PyTorch DataLoader (over NFSv4), NVIDIA DALI pipeline (over NFSv4).

Scenarios:

  1. Centralized repository (all data on single NFS server).
  2. Fully sharded (data distributed across compute nodes; 50% local + 50% remote).
  3. Training accuracy/convergence comparison.

Datasets: ImageNet (0.1 MB/sample), COCO (0.2 MB/sample), synthetic (2 MB/sample).

Models: ResNet-50, VGG-19.

Network conditions: Local disk (0.05ms RTT), LAN (0.1ms RTT), emulated LAN (1ms, 10ms, 30ms RTT via Linux tc/qdisc), WAN (30ms RTT to TACC).

Results: Centralized Repository (Fig. 5 — ResNet-50, ImageNet)

Condition Method Duration (s) CPU Energy (kJ) GPU Energy (kJ)
Local (0.05ms) DALI 151.7 ~9.5 26.2
Local EMLIO 157.1 ~10.0 26.2
LAN (0.1ms) DALI 165.4
LAN (0.1ms) EMLIO 156.6 ~10.0 26.2
LAN (10ms) DALI 552.5 ~3x higher ~3x higher
LAN (10ms) EMLIO 156.6 ~9.9 26.2
WAN (30ms) DALI 1699.3 ~60x higher ~60x higher
WAN (30ms) EMLIO 156.2 10.0 26.2

Key finding: EMLIO's epoch time varies less than 5% from 0.1ms to 30ms RTT; its I/O-related energy remains minimal. DALI and PyTorch incur 3x–27x longer runtimes and 4x–60x higher energy as RTT increases.

Local storage caveat: EMLIO is ~2% slower than DALI at 0.05ms RTT due to loopback network stack overhead (reads from disk, sends via loopback, re-ingests). This adds CPU load (9.9 vs. 9.5 kJ) but negligible GPU energy difference.

Results: COCO Dataset (Fig. 6)

Same trend: EMLIO maintains stable throughput and near-constant I/O energy from LAN to WAN. At 30ms RTT, EMLIO is roughly 6x faster and consumes 8x less I/O energy than DALI.

Results: Synthetic 2MB Samples (Figs. 7–8)

For large records, single-concurrency EMLIO Daemon incurs serialization overhead at 0.1ms and 1ms RTT. Increasing EMLIO Daemon concurrency to 2 (two parallel batch-serialize + send threads) amortizes the fixed per-batch setup cost. Across all RTTs, EMLIO achieves 2–3x higher throughput and 3–5x lower energy vs. DALI.

Results: VGG-19 (Fig. 9)

Condition Method Duration (s) CPU Energy (kJ) GPU Energy (kJ)
LAN (0.1ms) DALI 142.6 20.0 34.6
LAN (0.1ms) EMLIO 141.1 20.1 34.5
LAN (10ms) DALI 660.9 56.1 78.0
LAN (10ms) EMLIO 140.0 19.8 34.2
LAN (30ms) DALI 2096.8 156.3 163.6
LAN (30ms) EMLIO 140.5 20.3 34.4

I/O efficiency generalizes across different vision backbones.

Results: Sharded Dataset (Fig. 10 — 50% local + 50% NFS)

RTT Method Duration (s) CPU improvement GPU improvement
0.1ms DALI 230.9
0.1ms EMLIO 222.5 2.5% faster, 11.5% less CPU E 4.7% less GPU E
10ms DALI 1422.5
10ms EMLIO 221.6 6.4x faster, 13.5% less CPU E 20.8% less GPU E
30ms DALI 4154.7
30ms EMLIO 221.8 18.7x faster, 41.1% less CPU E 32% less GPU E

Results: Training Convergence (Fig. 11)

At 10ms RTT, COCO dataset, ResNet-50:


Limitations

  1. Local storage overhead: ~2% slower than DALI at 0.05ms RTT due to loopback network stack. Not suitable as a replacement for local-disk I/O when data is already local.
  2. Single-concurrency bottleneck for large records: For 2MB samples, one EMLIO Daemon thread becomes the serialization bottleneck. Requires tuning daemon concurrency parameter.
  3. Vision workloads only: All experiments use ResNet-50 and VGG-19 on ImageNet/COCO. LLM training I/O (sequential text, tokenization, variable-length sequences) is not evaluated.
  4. TFRecord format required: Raw data must be pre-converted to TFRecord (one-time cost, amortized across training jobs). Multi-format support is future work.
  5. Cross-layer optimization not addressed: Co-scheduling data loading with DDP gradient synchronization (NCCL AllReduce) for cross-layer energy optimization is identified as future work.
  6. Heterogeneous transports: RDMA and NVMe-over-Fabric are not yet supported; future work item for HPC and cloud environments.


RL Formulation Table

This paper does not use reinforcement learning for its primary contribution (EMLIO). However, reference [20] (Jamil et al., arXiv:2503.13662) cites a multi-parameter RL framework for adaptive data transfer optimization that is closely related. No RL formulation table applies to EMLIO itself.


Relevance to DynamICCL

Low-to-moderate indirect relevance. EMLIO addresses training data I/O; DynamICCL addresses collective communication configuration. The connections are:

  1. Shared Chameleon Cloud platform: EMLIO's evaluation cluster (Table 1: Intel Xeon Gold 6126, RTX 6000, 10 Gbps Ethernet, Chameleon UC and TACC) directly matches DynamICCL's deployment environment. EMLIO characterizes this hardware: 10 Gbps Ethernet creates a 0.1–30ms RTT range depending on inter-site distance. This RTT range defines the network conditions under which DynamICCL must tune NCCL.

  2. Network congestion interaction: EMLIO's storage-to-compute TCP/ZeroMQ streams (parallel multi-stream push) actively consume network bandwidth on the same 10 Gbps Ethernet fabric used by DynamICCL's NCCL collective operations. This cross-traffic is a direct source of the network congestion that Agent-1 (Trigger Agent) must detect. DynamICCL's evaluation should account for EMLIO-type data loading traffic as a realistic background congestion source.

  3. Energy-aware reward design: EMLIO demonstrates that I/O energy consumption is a tractable optimization target in HPC training. DynamICCL's reward currently minimizes collective completion time. A future extension could add energy consumption (GPU + CPU + memory during collective execution) as a secondary reward component, following EMLIO's measurement methodology (NVML + perf stat at 100ms intervals).

  4. Tennessee Tech research group: MD S Q Zulkar Nine (TTU) co-authors this paper. The EMLIO and DynamICCL projects share institutional context, making EMLIO's findings about Chameleon Cloud infrastructure directly applicable to DynamICCL's deployment.

  5. RL predecessor (ref. [20]): Jamil et al. (arXiv:2503.13662) — cited as a direct antecedent — uses deep RL to adaptively adjust application-layer data transfer settings for combined throughput, fairness, and energy efficiency. This is the closest published prior art to DynamICCL's RL-for-communication-optimization approach and merits separate review.

  6. Cross-layer optimization gap: EMLIO explicitly identifies as future work the co-scheduling of data loading with DDP gradient synchronization. DynamICCL's NCCL tuning occupies exactly this "collective communication" layer. A unified system combining EMLIO's data-I/O optimization with DynamICCL's collective communication tuning would address the complete training I/O + communication stack jointly.