Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition—i.e., the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily available interaction with human users. Naturally, these promises have created substantial excitement in