Detailed Summary: EMLIO: Minimizing I/O Latency and Energy Consumption for Large-Scale AI Training

Citation: Hasibul Jamil, MD S Q Zulkar Nine, Tevfik Kosar. University at Buffalo (SUNY) / Tennessee Technological University (Cookeville, TN) / University at Buffalo (SUNY). SC2025 (Sustainable Supercomputing Workshop), St. Louis, MO, USA, November 2025. arXiv:2508.11035v1, August 14, 2025. DOI: 10.1145/nnnnnnn.nnnnnnn

Abstract (Paraphrased)

Large-scale deep learning suffers from I/O bottlenecks as datasets grow beyond local storage capacities and GPU compute outpaces network and disk latencies. While recent systems optimize data-loading time, they overlook I/O energy cost — a critical factor at scale. EMLIO (Efficient Machine Learning I/O) is a service-based framework that jointly minimizes end-to-end data-loading latency (T) and I/O energy consumption (E) across variable-latency networked storage. EMLIO deploys a lightweight storage-side daemon that serializes and batches raw samples, streams them over TCP with out-of-order prefetching, and integrates with GPU-accelerated (NVIDIA DALI) preprocessing on the compute side. In evaluations across local disk, LAN (0.05ms and 10ms RTT), and WAN (30ms RTT) environments, EMLIO delivers up to 8.6x faster I/O and 10.9x lower energy use compared to state-of-the-art data loaders, while maintaining competitive performance and energy profiles regardless of network distance.

Motivation

The I/O Bottleneck at Scale

Training a Deep Neural Network involves three processes: (i) computation (forward/backward passes), (ii) communication (NCCL collectives for gradient synchronization), and (iii) I/O (delivering data and labels to workers). Research has focused heavily on (i) and (ii), but I/O has become the bottleneck as datasets grow to hundreds of millions or billions of samples.

At scale, up to 85% of per-epoch runtime can be spent on I/O (ResNet-50 on ImageNet). Shared filesystem contention in distributed settings further degrades performance.

Why Existing Solutions Fail at High Latency

Existing approaches — FFCV, NVIDIA DALI (local optimized), NoPFS, Lobster, Cassandra-DALI (network-aware) — use limited lookahead, double-buffering, data sharding, presaging, or in-memory caching. These suffer from:

Incomplete coverage of storage hierarchy.
Compromised dataset randomization.
Hardware requirements (SSD backends for NoPFS).
Degraded performance as network RTT increases.

The Energy Problem

No prior ML I/O system quantifies or minimizes energy:

Neural architecture search for a single NLP model can emit 626,000 lbs CO2.
Training GPT-3 requires ~1,200 MWh (~550 tCO2).
Global data centers consumed ~200 TWh in 2018; AI workloads growing fastest.
Each millisecond of RTT multiplies both I/O latency and energy overhead.

Key observation (Fig. 1): At local storage, I/O accounts for only ~15% of energy and ~20% of per-epoch time. At 10ms RTT (LAN), Read+Preprocess dominates at 60% of both metrics. At 30ms RTT (WAN), it dominates at 90%. This makes joint latency-energy optimization mandatory for geo-distributed training.

Background

Training I/O Pipeline Stages

Three stages per epoch:

R (Read): Fetch raw samples from disk or network storage.
P (Preprocess): Decode (JPEG), augment (crop, flip, normalize tensors).
T (Train): GPU forward pass, loss, backward pass, AllReduce, optimizer step.

At local storage: CPU and GPU energy are modest; duration is compute-dominated. At LAN/WAN: The R+P stage becomes the wall-clock bottleneck and energy dominator.

Why Stochastic I/O Is Hard

Stochastic gradient descent randomly accesses small, independent samples. This causes random file access patterns, which thrash disk/NFS cache, generates many small read syscalls, and prevents effective sequential prefetching by standard loaders.

EMLIO addresses this by sharding data into large TFRecord files, enabling sequential batch reads rather than random small reads.

System Design

High-Level Architecture

Storage Cluster                          Compute Cluster
+------------------------------------+   +-----------------------------------+
| TFRecord Shards                    |   | ZMQ Pull + Deserializer           |
|    (mmap-read)                     |   |      |                            |
| Planner                            |   | Batch Loader (EMLIO Plugin)       |
|  (shard -> batch plan)             |   |      |                            |
|                                    |   | DALI Pipeline (GPU Preprocess)    |
| EMLIO Daemon Node 1 ... Node N     |   |  (decode, augment, normalize)     |
|  +---------+  +---------+          |   |      |                            |
|  |Serialize|  |Serialize|          +-->+ Training Loop (DDP)               |
|  |& Batch  |  |& Batch  |          |   +-----------------------------------+
|  +---------+  +---------+          |
|       | ZeroMQ PUSH                |
+-------+----------------------------+
         TCP/ZMQ (parallel streams)

Three Core Design Principles

1. Network-Pipeline Concurrency All I/O stages (read, serialize, network send) run on separate threads in a pipeline. While one batch is being sent over the network, the next is already being read and prepared. This fully utilizes available network bandwidth and hides per-batch latency.

2. Batch-Aligned Data-Parallel Planning A centralized Planner computes, for each epoch and compute node, exactly which TFRecord shard ranges form each fixed-size batch. This ensures correct data-parallel semantics (each compute node sees the full dataset exactly once per epoch) without client-side shard scans or random small reads.

3. Linear Horizontal Scalability Each storage node runs its own EMLIO Daemon with a pool of worker threads. By sharding TFRecords and parallelizing read, serialize, and send pipelines, aggregate throughput grows linearly with storage nodes.

Storage Side: EMLIO Daemon and Planner

Global Planner (Algorithm 2, sender planning & dispatch):

Input: TFRecord directory D, node list N = {(id_i, ip_i, port_i)}, batch size B, epochs E, threads T
1. Load shard metadata (parse mapping_shard_*.json)
2. Build global label map from all shards
3. for epoch c = 1 to E:
4.   shuffle shard list
5.   assign shards to nodes round-robin
6.   for each node (id, ip, port):
7.     split its shard list into T subsets
8.     launch T SendWorker threads (ThreadPoolExecutor)
9. end for

Each SendWorker:

mmaps its assigned TFRecord shard.
Slices B records, serializes into a msgpack payload.
PUSHes over ZeroMQ — implicit backpressure via ZMQ HWM (high-water mark set to 16).

TFRecord format rationale: Sequential records prefixed by length — grabbing B examples requires one read (no extra copies or individual read calls). Lower I/O overhead than small-file datasets.

Compute Side: EMLIO Receiver and DALI Integration

Algorithm 3 (EMLIO Receiver & DALI Integration):

Input: bind address ip, port p, prefetch depth Q, epochs E
1. start ZMQ PULL socket on (ip, p)
2. spawn zmq_receiver thread → pushes into shared Queue
3. build DALI pipeline with BatchProvider(Queue), prefetch=Q
4. warm up: manually run Q iterations
5. for epoch = 1 to E:
6.   while data is available:
7.     pipe.run()     # deserializes & preprocesses
8.     feed output to DDP training step
9.   end while
10. end for
11. teardown pipelines and sockets

The DALI pipeline uses BatchProvider as external_source. DALI's exec_async and exec_pipelined options run decode/augment and CPU preprocessing concurrently with GPU execution. GPU-accelerated preprocessing (JPEG decode, resize, crop, normalize) is performed on the compute node's GPU, so compute nodes work entirely on large, in-memory pre-batched data.

ZeroMQ Backpressure

PUSH socket HWM set to 16 with blocking send: storage workers naturally back off when compute-side queues are full. This prevents memory overflow without explicit flow control logic.

Concurrency Model

Sender side: each TFRecord shard processed by up to T threads in parallel.
Receiver side: DALI's exec_async and exec_pipelined decode/augment concurrently with GPU execution.

Energy Monitoring Framework

Distributed EnergyMonitor (Algorithm 1)

A fine-grained energy measurement system running on all compute and storage nodes:

Components per node:

CPU/DRAM Sampler: reads perf stat -e power/energy-pkg/,power/energy-ram/ every 100ms.
GPU Sampler: reads all GPU power via NVML, computes E_gpu = sum(P_i) * delta / 1000 watts.
Accumulator: merges CPU/DRAM and GPU tuples by timestamp t_k, interpolates gaps.
Batch Writer: writes N tuples tagged by node_id to local TSDB (InfluxDB) or central TSDB.

Implementation:

100ms sampling interval: sufficient to capture dynamic AI workload behavior.
Barrier-synchronized threads: guarantee component-aligned readings at each t_k.
Interpolation: fills any missed samples for a gapless time series.
NTP-aligned clocks: enables event-level queries across distributed workloads by start/end timestamps.
APIs: Linux perf stat (CPU/DRAM), NVIDIA NVML v5.0 (GPU), InfluxDB v1.8 Python client (storage).

Evaluation

Setup

Clusters (Chameleon Cloud): | Node Type | Spec | |-----------|------| | UC Compute (gpu_rtx_6000) | 2x Intel Xeon Gold 6126 @ 2.6 GHz (12c/24t), 192 GiB DDR4, 240 GiB SAS SSD, NVIDIA Quadro RTX 6000 (24 GiB), 10 Gbps Ethernet | | UC Storage (compute_skylake) | 2x Intel Xeon Gold 6126 @ 2.6 GHz (12c/48t), 192 GiB DDR4, 240 GiB SAS SSD, No GPU, 10 Gbps Ethernet | | TACC Compute (gpu_p100) | 2x Intel Xeon E5-2650 v3 @ 2.3 GHz (12c/48t), 128 GiB DDR4, 1 TB SATA HDD, 2x NVIDIA Tesla P100 (16 GiB), 10 Gbps Ethernet | | TACC Storage | 2x Intel Xeon E5-2650 v3, 64 GiB DDR4, 600 GiB SATA SSD, 10 Gbps Ethernet |

All nodes: Ubuntu 22.04 LTS, CUDA 12.6, NFSv4 mount for training data.

Baselines: PyTorch DataLoader (over NFSv4), NVIDIA DALI pipeline (over NFSv4).

Scenarios:

Centralized repository (all data on single NFS server).
Fully sharded (data distributed across compute nodes; 50% local + 50% remote).
Training accuracy/convergence comparison.

Datasets: ImageNet (0.1 MB/sample), COCO (0.2 MB/sample), synthetic (2 MB/sample).

Models: ResNet-50, VGG-19.

Network conditions: Local disk (0.05ms RTT), LAN (0.1ms RTT), emulated LAN (1ms, 10ms, 30ms RTT via Linux tc/qdisc), WAN (30ms RTT to TACC).

Results: Centralized Repository (Fig. 5 — ResNet-50, ImageNet)

Condition	Method	Duration (s)	CPU Energy (kJ)	GPU Energy (kJ)
Local (0.05ms)	DALI	151.7	~9.5	26.2
Local	EMLIO	157.1	~10.0	26.2
LAN (0.1ms)	DALI	165.4	—	—
LAN (0.1ms)	EMLIO	156.6	~10.0	26.2
LAN (10ms)	DALI	552.5	~3x higher	~3x higher
LAN (10ms)	EMLIO	156.6	~9.9	26.2
WAN (30ms)	DALI	1699.3	~60x higher	~60x higher
WAN (30ms)	EMLIO	156.2	10.0	26.2

Key finding: EMLIO's epoch time varies less than 5% from 0.1ms to 30ms RTT; its I/O-related energy remains minimal. DALI and PyTorch incur 3x–27x longer runtimes and 4x–60x higher energy as RTT increases.

Local storage caveat: EMLIO is ~2% slower than DALI at 0.05ms RTT due to loopback network stack overhead (reads from disk, sends via loopback, re-ingests). This adds CPU load (9.9 vs. 9.5 kJ) but negligible GPU energy difference.

Results: COCO Dataset (Fig. 6)

Same trend: EMLIO maintains stable throughput and near-constant I/O energy from LAN to WAN. At 30ms RTT, EMLIO is roughly 6x faster and consumes 8x less I/O energy than DALI.

Results: Synthetic 2MB Samples (Figs. 7–8)

For large records, single-concurrency EMLIO Daemon incurs serialization overhead at 0.1ms and 1ms RTT. Increasing EMLIO Daemon concurrency to 2 (two parallel batch-serialize + send threads) amortizes the fixed per-batch setup cost. Across all RTTs, EMLIO achieves 2–3x higher throughput and 3–5x lower energy vs. DALI.

Results: VGG-19 (Fig. 9)

Condition	Method	Duration (s)	CPU Energy (kJ)	GPU Energy (kJ)
LAN (0.1ms)	DALI	142.6	20.0	34.6
LAN (0.1ms)	EMLIO	141.1	20.1	34.5
LAN (10ms)	DALI	660.9	56.1	78.0
LAN (10ms)	EMLIO	140.0	19.8	34.2
LAN (30ms)	DALI	2096.8	156.3	163.6
LAN (30ms)	EMLIO	140.5	20.3	34.4

I/O efficiency generalizes across different vision backbones.

Results: Sharded Dataset (Fig. 10 — 50% local + 50% NFS)

RTT	Method	Duration (s)	CPU improvement	GPU improvement
0.1ms	DALI	230.9	—	—
0.1ms	EMLIO	222.5	2.5% faster, 11.5% less CPU E	4.7% less GPU E
10ms	DALI	1422.5	—	—
10ms	EMLIO	221.6	6.4x faster, 13.5% less CPU E	20.8% less GPU E
30ms	DALI	4154.7	—	—
30ms	EMLIO	221.8	18.7x faster, 41.1% less CPU E	32% less GPU E

Results: Training Convergence (Fig. 11)

At 10ms RTT, COCO dataset, ResNet-50:

EMLIO reaches loss 3.3 after 1000 wall-clock seconds.
DALI reaches only loss 3.4 at 1000s, completes epoch at 7500s with loss 3.3.
EMLIO converges 7.5x faster in wall-clock time, achieving lower final loss at every time point.

Limitations

Local storage overhead: ~2% slower than DALI at 0.05ms RTT due to loopback network stack. Not suitable as a replacement for local-disk I/O when data is already local.
Single-concurrency bottleneck for large records: For 2MB samples, one EMLIO Daemon thread becomes the serialization bottleneck. Requires tuning daemon concurrency parameter.
Vision workloads only: All experiments use ResNet-50 and VGG-19 on ImageNet/COCO. LLM training I/O (sequential text, tokenization, variable-length sequences) is not evaluated.
TFRecord format required: Raw data must be pre-converted to TFRecord (one-time cost, amortized across training jobs). Multi-format support is future work.
Cross-layer optimization not addressed: Co-scheduling data loading with DDP gradient synchronization (NCCL AllReduce) for cross-layer energy optimization is identified as future work.
Heterogeneous transports: RDMA and NVMe-over-Fabric are not yet supported; future work item for HPC and cloud environments.

FFCV: Memory-mapped "beton" format + JIT-compiled preprocessing; local SSD only; 30x speedup on local disk; degrades for remote datasets.
NVIDIA DALI: GPU-accelerated image decode/augment; excellent for local storage; performance degrades with high RTT.
NoPFS: Clairvoyant prefetching; requires SSD backends; up to 5x speedup in geo-distributed; does not account for energy.
Lobster: Cooperative caching with adaptive prefetch windows; high LAN throughput; no energy accounting.
Cassandra-DALI: NoSQL backend to hide RTT for remote datasets; reduces latency but not energy.
GreenTCP: Dynamically tunes TCP congestion windows for NIC energy savings; general networking, not ML-specific.
Jamil et al. (arXiv:2503.13662): Deep RL for data transfer performance and energy efficiency in distributed systems — direct predecessor to EMLIO's energy-optimization philosophy.

RL Formulation Table

This paper does not use reinforcement learning for its primary contribution (EMLIO). However, reference [20] (Jamil et al., arXiv:2503.13662) cites a multi-parameter RL framework for adaptive data transfer optimization that is closely related. No RL formulation table applies to EMLIO itself.

Relevance to DynamICCL

Low-to-moderate indirect relevance. EMLIO addresses training data I/O; DynamICCL addresses collective communication configuration. The connections are:

Shared Chameleon Cloud platform: EMLIO's evaluation cluster (Table 1: Intel Xeon Gold 6126, RTX 6000, 10 Gbps Ethernet, Chameleon UC and TACC) directly matches DynamICCL's deployment environment. EMLIO characterizes this hardware: 10 Gbps Ethernet creates a 0.1–30ms RTT range depending on inter-site distance. This RTT range defines the network conditions under which DynamICCL must tune NCCL.
Network congestion interaction: EMLIO's storage-to-compute TCP/ZeroMQ streams (parallel multi-stream push) actively consume network bandwidth on the same 10 Gbps Ethernet fabric used by DynamICCL's NCCL collective operations. This cross-traffic is a direct source of the network congestion that Agent-1 (Trigger Agent) must detect. DynamICCL's evaluation should account for EMLIO-type data loading traffic as a realistic background congestion source.
Energy-aware reward design: EMLIO demonstrates that I/O energy consumption is a tractable optimization target in HPC training. DynamICCL's reward currently minimizes collective completion time. A future extension could add energy consumption (GPU + CPU + memory during collective execution) as a secondary reward component, following EMLIO's measurement methodology (NVML + perf stat at 100ms intervals).
Tennessee Tech research group: MD S Q Zulkar Nine (TTU) co-authors this paper. The EMLIO and DynamICCL projects share institutional context, making EMLIO's findings about Chameleon Cloud infrastructure directly applicable to DynamICCL's deployment.
RL predecessor (ref. [20]): Jamil et al. (arXiv:2503.13662) — cited as a direct antecedent — uses deep RL to adaptively adjust application-layer data transfer settings for combined throughput, fairness, and energy efficiency. This is the closest published prior art to DynamICCL's RL-for-communication-optimization approach and merits separate review.
Cross-layer optimization gap: EMLIO explicitly identifies as future work the co-scheduling of data loading with DDP gradient synchronization. DynamICCL's NCCL tuning occupies exactly this "collective communication" layer. A unified system combining EMLIO's data-I/O optimization with DynamICCL's collective communication tuning would address the complete training I/O + communication stack jointly.