Deficiencies in clinical reasoning of LLMs in low back pain management and remediation via prompt engineering: from performance evaluation to error diagnosis

BackgroundLarge language models show promise in medical tasks, but their systematic error patterns in high-stakes clinical settings remain poorly understood, limiting safe deployment.MethodsA three-phase simulation study was conducted. In phase 1, researchers selected 103 multiple-choice questions and 30 clinical scenario questions, derived from an LBP examination question bank and clinical guidelines and systematically evaluated five mainstream LLMs (GPT-5, GPT-4o, GPT-o3, Deepseek-V2.5, and Gr