TL;DR A GPU shows 97% utilization in nvidia-smi , but training throughput is a fraction of what benchmarks promise. The GPU is not computing; it is waiting. Data loading workers are starving the training loop because CPU contention, I/O bottlenecks, or scheduling delays prevent data from arriving fast enough. Tracing the full host-to-GPU pipeline via eBPF uprobes reveals exactly where the bubble is. We investigated a case where GPU utilization numbers looked healthy but training was slow, reveal

nvidia-smi Reports 97% Utilization While the GPU Sits Idle
Ingero Team
