Evaluating systematic linguistic reasoning in large language models via linguistics olympiad problems

Current evaluation benchmarks for large language models often conflate genuine reasoning ability with memorization and statistical pattern matching. As a result, high benchmark scores do not necessarily indicate the capacity to induce abstract rules or to reason systematically from limited data. In this paper, we propose a methodological framework for evaluating linguistic reasoning in large language models based on problems from international and national linguistics olympiads. These problems a