Do We Still Need Text Features for Video Retrieval in the Era of Vision-Language Models?