Do multimodal LLMs understand programming screenshots? Inferring questions and extracting relevant content