NPC Nano 0.5B: From-Scratch Pretraining and the Post-Training Capability Ceiling at Sub-1B Parameters
We present NPC Nano, a 501M-parameter decoder-only language model pretrained from random initialization on 8.93B tokens using a single NVIDIA A40 GPU. We document the pretraining recipe, a label-shift bug encountered during training and the pre-launch sanity gate that prevents its recurrence, an identity layer methodology with empirically recalibrated capability gates, and a four-experiment characterization of the post-training capacity bottleneck at 0.5B with sub-2% baseline accuracy on GSM8K.
