One-to-Many Retrieval Between UAV Images and Satellite Images for UAV Self-Localization in Real-World Scenarios

Matching drone images to satellite reference images is a critical step for achieving UAV self-localization. Existing drone visual localization datasets mainly focus on target localization, where each drone image is paired with a corresponding satellite image slice, typically with identical coverage....

Full description

Saved in:

Bibliographic Details
Published in	Remote sensing (Basel, Switzerland) Vol. 17; no. 17; p. 3045
Main Authors	Li, Jiaqi, Sun, Yuli, Xiang, Yaobing, Lei, Lin
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.09.2025
Subjects	Accuracy Datasets Drones Embedding Flying-machines geo-localization Image processing Image retrieval Localization Matching Methods Satellite imagery Tiles transformer unmanned aerial vehicle Unmanned aerial vehicles
Online Access	Get full text
ISSN	2072-4292 2072-4292
DOI	10.3390/rs17173045

Cover

More Information
Summary:	Matching drone images to satellite reference images is a critical step for achieving UAV self-localization. Existing drone visual localization datasets mainly focus on target localization, where each drone image is paired with a corresponding satellite image slice, typically with identical coverage. However, this one-to-one approach does not reflect real-world UAV self-localization needs as it cannot guarantee exact matches between drone images and satellite tiles nor reliably identify the correct satellite slice. To bridge this gap, we propose a one-to-many matching method between drone images and satellite reference tiles. First, we enhance the UAV-VisLoc dataset, making it the first in the field tailored for one-to-many imperfect matching in UAV self-localization. Second, we introduce a novel loss function, Incomp-NPair Loss, which better reflects real-world imperfect matching scenarios than traditional methods. Finally, to address challenges such as limited dataset size, training instability, and large-scale differences between drone images and satellite tiles, we adopt a Vision Transformer (ViT) baseline and integrate CNN-extracted features into its patch embedding layer.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2072-4292 2072-4292
DOI:	10.3390/rs17173045