AsgardBench: A benchmark for visually grounded interactive planning

Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao
At a glance - To successfully complete tasks, embodied AI agents must ground and update their plans based on visual feedback. - AsgardBench isolates whether agents can use visual observations to revise their plans as tasks unfold. - Spanning 108 controlled task instances across 12 task types, the benchmark requires agents to adapt their plans based on what they observe. - Because objects can be in different positions and states (e.g., clean or dirty), the same instruction can require different..