In-Depth-MoE: Predictive Sequence-Level Gatingvia Speculative Layer Allocation

Modern Large Language Models (LLMs) suffer from static computation depth, wheretrivial and highly complex prompts consume identical vertical computational resources(FLOPs). While Mixture-of-Experts (MoE) architectures provide horizontal sparsity, theyfail to address the layer-wise redundancy and GPU thread divergence caused by token-levelrouting. In this paper, we propose In-Depth-MoE, a novel architectural paradigm thatintroduces predictive sequence-level layer gating. By utilizing an ultra-lig