Background Thyroid eye disease (TED) severity assessment using the EUGOGO severity classification is partly subjective and prone to interobserver variability. While MRI-derived anatomical measurements offer objective features, such as ocular protrusion and extraocular muscle thickness, these are underutilized in machine learning (ML) models that often rely on non-interpretable radiomic features. Moreover, the inclusion of longitudinal scans from the same patient may artificially inflate model pe