Limitations of large language models in clinical problem-solving arising from inflexible reasoning
Abstract Large Language Models (LLMs) have attained human-level accuracy on medical question-answer (QA) benchmarks. However, their limitations in navigating clinical scenarios requiring flexible reasoning have recently been shown, raising concerns about the robustness and generalizability of LLM reasoning across diverse, real-world medical tasks. To probe potential LLM failure modes in clinical problem-solving, we present the medical abstraction and reasoning corpus (mARC-QA). mARC-QA assesses
