TorchTPU: Running PyTorch Natively on TPUs at Google Scale

The challenges of building for modern AI infrastructure have fundamentally shifted. The modern frontier of machine learning now requires leveraging distributed systems, spanning thousands of accelerators. As models scale to run on clusters of O(100,000) chips, the software that powers these models must meet new demands for performance, hardware portability, and reliability. At Google, our Tensor Processing Units (TPUs) are foundational to our supercomputing infrastructure. These custom ASICs...