Bridging Viewpoints in Cross-View Geo-Localization With Siamese Vision Transformer

Cross-view geo-localization (CVGL) aims to determine the locations of ground-view images using corresponding aerial views. However, the inherent differences in the viewpoints and appearances between these cross-view images complicate accurate localization. Existing CVGL approaches have focused on al...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 12
Main Authors	Ahn, Woo-Jin, Park, So-Yeon, Pae, Dong-Sung, Choi, Hyun-Duck, Lim, Myo-Taeg
Format	Journal Article
Language	English
Published	New York IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Alignment Artificial neural networks Cross-view geo-localization (CVGL) deep learning Distortion Feature extraction Global Positioning System Image enhancement image retrieval Information processing Layout Localization Location awareness polar transform Siamese network Transformers Transforms Vision vision transformer (ViT) Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Cross-view geo-localization (CVGL) aims to determine the locations of ground-view images using corresponding aerial views. However, the inherent differences in the viewpoints and appearances between these cross-view images complicate accurate localization. Existing CVGL approaches have focused on aligning image perspectives; however, these approaches often overlook the information distortion that occurs during the alignment process. To address this challenge, we propose a novel triple-vision transformers (TriViTs) framework that fully utilizes the spatially aligned and original images. The TriViTs consist of a Siamese network and two independent networks. The Siamese network takes the polar-transformed ground or aerial images to extract the coarse features from the aligned images. Meanwhile, the two independent networks take the original images (i.e., aerial or ground images, respectively) to compensate for the distortion owing to the alignment process. Furthermore, we introduce geo-aligned triplet learning using a polar warping network (PWN) for the independent transformers to enhance the correlation between image descriptors within cross-view images. The results of extensive experiments on benchmark datasets CVUSA and CVACT prove that our method exhibits superior performance with respect to existing methods. Furthermore, the proposed simple yet effective approach demonstrates scalability across different backbones, including CNNs.
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2024.3429570