Evaluating Large Language Models for Educational Measurement Insights from Automated and Human Scoring of Language Exams

This study investigates the use of large language models (LLMs)—ChatGPT-5, Claude Opus 4.1, Gemini Advanced 2.5 Pro, DeepSeek Pro, Qwen-3 Max, and Mistral Le Chat Pro—and a locally fine-tuned LLaMA 3.3 70B Instruct model for automating assessment tasks in language education. Specifically, the study looks to examine LLM capabilities in automating assessments with authentic midterm exam sheets from a “German as a Foreign Language” (GFL) course in three different scenarios: (1) general-purpose LLM