Deep learning algorithms have driven substantial progress in remote sensing semantic segmentation. However, conventional approaches typically rely on predefined semantic categories, necessitating costly data annotation and model retraining when new classes are introduced. While large-scale vision-language models, such as CLIP, enable segmentation of arbitrary class with natural language guidance, their limited localization capability poses challenges for dense prediction tasks. This study invest
