Deformable convolutions in multi-view stereo

The Multi-View Stereo (MVS) is a key process in the photogrammetry workflow. It is responsible for taking the camera's views and finding the maximum number of matches between the images yielding a dense point cloud of the observed scene. Since this process is based on the matching between image...

Full description

Saved in:

Bibliographic Details
Published in	Image and vision computing Vol. 118; p. 104369
Main Authors	Masson, Juliano Emir Nunes, Petry, Marcelo Roberto, Coutinho, Daniel Ferreira, Honório, Leonardo de Mello
Format	Journal Article
Language	English
Published	Elsevier B.V 01.02.2022
Subjects	Deep learning Depth map Multi-view stereo Deep learning Depth map Multi-view stereo
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The Multi-View Stereo (MVS) is a key process in the photogrammetry workflow. It is responsible for taking the camera's views and finding the maximum number of matches between the images yielding a dense point cloud of the observed scene. Since this process is based on the matching between images it greatly depends on the ability of features matching throughout different images. To improve the matching performance several researchers have proposed the use of Convolutional Neural Networks (CNNs) to solve the MVS problem. Despite the progress in the MVS problem with the usage of CNNs, the Video RAM (VRAM) consumption within these approaches is usually far greater than classical methods, that rely more on RAM, which is cheaper to expand than VRAM. This work then follows the progress made in CasMVSNet in the reduction of GPU memory usage, and further study the changes in the feature extraction process. The Average Group-wise Correlation is used in the cost volume generation, to reduce the number of channels in the cost volume, yielding a reduction in GPU memory usage without noticeable penalties in the result. The deformable convolutions are applied in the feature extraction network to augment the spatial sampling locations with learning offsets, without additional supervision, to further improve the network's ability to model transformations. The impact of these changes is measured using quantitative and qualitative tests using the DTU and the Tanks and Temples datasets. The modifications reduced the GPU memory usage by 32% and improved the completeness by 9% with a penalty of 6.6% in accuracy on the DTU dataset. [Display omitted] •The use of the AGC in the cost volume, yield a reduction in GPU memory usage without noticeable penalties in the result.•The use of deformable convolutions in the feature extraction network improved the network ability to model transformations.
ISSN:	0262-8856 1872-8138
DOI:	10.1016/j.imavis.2021.104369