PipeOrgan: Modeling Memory-Bandwidth-Bound Executions for AI and Beyond

TL;DR: Latency-tolerant architectures, e.g., GPUs, increasingly use memory/storage hierarchies, e.g., for KV Caches to speed Large-Language Model AI inference. To aid codesign of such workloads and architectures, we develop the simple PipeOrgan analytic model for bandwidth-bound workloads running on memory/storage hierarchies. Background For three reasons, memory bandwidth, more than latency, limits AI inference performance. First, AI inference uses latency-tolerant compute engines, such as...