Bridging the Depth Gap: Adaptive Scene-Instance Alignment for Training-Free Depth Refinement in Robotic Manipulation Scenes

Monocular depth estimation foundation models provide robust depth priors with exceptional generalization capabilities; however, their predictions typically lack a reliable metric scale and contain local inconsistencies in a zero-shot setting, which limit their deployment in unconstrained real-world environments. Meanwhile, sensor-based depth measurements are typically sparse or noisy, especially in challenging scenarios such as transparent objects or cluttered robotic scenes. To address this lim