Efficient Adaptation of Vision Foundation Model for High-Resolution Remote Sensing Image Segmentation via Spatial-Frequency Modeling and Sparse Refinement
Chenlong Ding·Xin Li·Daofang Liu·Zhihao Shi·Xin Lyu·Zhenyu Fang·Xue Liu·Lingqiang Meng·Yiwei Fang·C Z Zhang·Chengyi Shi
High-resolution remote-sensing semantic segmentation requires models to simultaneously capture global scene semantics and preserve fine-grained local structures. Although satellite-pretrained vision foundation models provide strong transferable representations, the features extracted by a frozen backbone remain insufficiently adapted to dense prediction, particularly for representing high-frequency details and multiscale local patterns. In addition, correcting residual prediction errors with den
