Bridging Viewpoints in Cross-View Geo-Localization With Siamese Vision Transformer

Cross-view geo-localization (CVGL) aims to determine the locations of ground-view images using corresponding aerial views. However, the inherent differences in the viewpoints and appearances between these cross-view images complicate accurate localization. Existing CVGL approaches have focused on al...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 12
Main Authors Ahn, Woo-Jin, Park, So-Yeon, Pae, Dong-Sung, Choi, Hyun-Duck, Lim, Myo-Taeg
Format Journal Article
LanguageEnglish
Published New York IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Cross-view geo-localization (CVGL) aims to determine the locations of ground-view images using corresponding aerial views. However, the inherent differences in the viewpoints and appearances between these cross-view images complicate accurate localization. Existing CVGL approaches have focused on aligning image perspectives; however, these approaches often overlook the information distortion that occurs during the alignment process. To address this challenge, we propose a novel triple-vision transformers (TriViTs) framework that fully utilizes the spatially aligned and original images. The TriViTs consist of a Siamese network and two independent networks. The Siamese network takes the polar-transformed ground or aerial images to extract the coarse features from the aligned images. Meanwhile, the two independent networks take the original images (i.e., aerial or ground images, respectively) to compensate for the distortion owing to the alignment process. Furthermore, we introduce geo-aligned triplet learning using a polar warping network (PWN) for the independent transformers to enhance the correlation between image descriptors within cross-view images. The results of extensive experiments on benchmark datasets CVUSA and CVACT prove that our method exhibits superior performance with respect to existing methods. Furthermore, the proposed simple yet effective approach demonstrates scalability across different backbones, including CNNs.
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3429570