Benchmarking AI on Standard Chemistry Exams: LLMs Still Underperform Compared to High School Students

Abstract As Large Language Models (LLMs) become increasingly prevalent in science education, it is important to understand their capabilities compared to human learners with respect to authentic learning tasks. Such understanding is crucial for designing AI-resilient assessments and developing AI tutors that can guide students in problem solving. Using standardized assessments as benchmarks allows these comparisons to be based on widely accepted educational criteria. To date, most educational be