Current evaluation benchmarks for large language models often conflate genuine reasoning ability with memorization and statistical pattern matching. As a result, high benchmark scores do not necessarily indicate the capacity to induce abstract rules or to reason systematically from limited data. In this paper, we propose a methodological framework for evaluating linguistic reasoning in large language models based on problems from international and national linguistics olympiads. These problems a
