Bridging Viewpoints in Cross-View Geo-Localization With Siamese Vision Transformer
Cross-view geo-localization (CVGL) aims to determine the locations of ground-view images using corresponding aerial views. However, the inherent differences in the viewpoints and appearances between these cross-view images complicate accurate localization. Existing CVGL approaches have focused on al...
Saved in:
Published in | IEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 12 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Cross-view geo-localization (CVGL) aims to determine the locations of ground-view images using corresponding aerial views. However, the inherent differences in the viewpoints and appearances between these cross-view images complicate accurate localization. Existing CVGL approaches have focused on aligning image perspectives; however, these approaches often overlook the information distortion that occurs during the alignment process. To address this challenge, we propose a novel triple-vision transformers (TriViTs) framework that fully utilizes the spatially aligned and original images. The TriViTs consist of a Siamese network and two independent networks. The Siamese network takes the polar-transformed ground or aerial images to extract the coarse features from the aligned images. Meanwhile, the two independent networks take the original images (i.e., aerial or ground images, respectively) to compensate for the distortion owing to the alignment process. Furthermore, we introduce geo-aligned triplet learning using a polar warping network (PWN) for the independent transformers to enhance the correlation between image descriptors within cross-view images. The results of extensive experiments on benchmark datasets CVUSA and CVACT prove that our method exhibits superior performance with respect to existing methods. Furthermore, the proposed simple yet effective approach demonstrates scalability across different backbones, including CNNs. |
---|---|
ISSN: | 0196-2892 1558-0644 |
DOI: | 10.1109/TGRS.2024.3429570 |