The growing demand of smart surveillance systems necessitates the accurate and real-time detection of weapons and face recognition with robustness against occlusion, illumination changes, and complex backgrounds. Existing techniques based on standalone CNN or transformer architectures are less effective in capturing local fine-grained features as well as long-range dependencies. This paper presents ConViDeTR, a hybrid deep learning framework that integrates CNN, Vision Transformer (ViT), and Det