Evaluating agents for scientific discovery

Two benchmarks developed at Ai2 – ScienceWorld and DiscoveryWorld – reveal that even incredibly strong AI science agents struggle with problems human scientists solve routinely.