Auto-labelling 1.2M robotics frames with VLMs: a failover story

Marco Rinaldi
TL;DR: We needed to caption 1.2M reconstructed event-camera frames using vision-language models for auxiliary supervision. The first run died at 340K from Anthropic rate limits. Putting Bifrost in front of three VLM providers cut the rerun cost by 22% and finished in 9 hours. So, the thing is, when you work at a neuromorphic vision startup, your training data looks strange. At Prophesee we accumulate event streams into time-binned windows that we render into pseudo-frames. For a self-supervised