Symmetric Image-Text Tuning With Entropy-Guided Fusion for Online Continual Learning in Non-Stationary Visual Streams

Online continual learning studies how models learn from continuous and non-stationary data streams. In this paper, we observe that CLIP models exhibit an asymmetric image-text interaction under online continual learning. Specifically, text features of previously seen classes may introduce unfavorable supervision when paired with visual features of newly observed data, leading to catastrophic forgetting. To alleviate this issue, we propose a simple yet effective symmetric image-text tuning (SIT)