From Verification to Internalization: A Cognitive Science Perspective on Verifier-Free Reinforcement Learning

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced Large Language Models (LLMs) reasoning by providing objective feedback, yet it remains fundamentally dependent on external verifiers, which limits self-regulated reasoning and generalization. We propose a shift toward internalization, relocating verification from external infrastructure into model-internal signals. We formalize this paradigm through a four-dimensional taxonomy: Probabilistic, Uncertainty, Process, and Interaction