CSR-Net++: Rethinking Context Structure Representation Learning for Feature Matching

Seeking good feature correspondences between two remote sensing (RS) images is an essential and important problem in the fields of RS and photogrammetry. Traditional approaches often necessitate a predefined geometric transformation model or additional manually crafted descriptors, significantly con...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 12
Main Authors Chen, Xiaoxian, Chen, Jiaxuan
Format Journal Article
LanguageEnglish
Published New York IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Seeking good feature correspondences between two remote sensing (RS) images is an essential and important problem in the fields of RS and photogrammetry. Traditional approaches often necessitate a predefined geometric transformation model or additional manually crafted descriptors, significantly constraining the versatility. In this work, we adopt the recent context structure representation network (CSR-Net), which has shown promising performance in general feature matching problems, and propose modifications, named CSR-Net++, to overcome its main limitations. Specifically, CSR-Net is combined with a PointNet-like geometry estimator, which is sensitive to large deformations, for global preregistration. In addition, CSR-Net learns local consensus representation through a fixed-size grid, leading to limited space-aware capacities due to grid pixelwise max-pooling operations. To tackle the abovementioned limitations, we first introduce a pruning layer for matching guided by global consensus, as opposed to relying on a geometric estimator. In addition, for directly learning consensus representation from points, we propose a modified context structure representation (CSR) learning module including an independent spatial location stream and a stand-alone visual stream (VS). This decomposition separates local consensus into positional consensus and visual consensus. The proposed dual-stream representation learning not only avoids the introduction of grid anchors but also provides visual contextual priors. To demonstrate the robustness and versatility of our CSR-Net++, we conducted comprehensive experiments using diverse sets of real image pairs for general feature matching. The results demonstrate the superiority of our CSR-Net++ in most matching scenarios, achieving a 0.47%-4.70% improvement in F-score for multimodal images over existing leading methods.
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2024.3431008