ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

In the absence of parallax cues, a learning based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 28285 - 28295
Main Authors	Patni, Suraj, Agarwal, Aradhye, Arora, Chetan
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	Benchmark testing Codes Computer architecture Computer vision Diffusion models Estimation metric-depth-estimation monocular-depth-estimation Pipelines single-image-depth-estimation stable-diffusion-backbone zero-shot-transfer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In the absence of parallax cues, a learning based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pretrained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pretrained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pretrained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYU Depth v2 dataset, achieving Abs Rel error of 0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GED). For zero shot transfer with a model trained on NYU Depth v2, we report mean relative improvement of (20%, 23%,81%, 25%) over NeWCRF on (Sun-RGBD, iBimsl, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoEDepth. The code is available in our project page.
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.02672