ViT-ConvGAN: a hybrid model for spatiotemporal action recognition using video transformer and 3D CNN