CDFIT: A Transformer Using Cross-Modal Dual-Stream Feature Interaction for Multispectral Pedestrian Detection

Modality imbalance is a significant challenge for multi-modal interaction at various depths in multispectral pedestrian detection under varying illumination environments. To overcome the limitations of current cross attention in addressing the modality imbalance, we propose the Cross-Modal Dual-Stream Feature Interaction Transformer (CDFIT). CDFIT capitalizes on the Transformer’s ability to learn long-range dependencies, extracting global intra-modal and inter-modal correlations during the featu