CSR-Net++: Rethinking Context Structure Representation Learning for Feature Matching
Seeking good feature correspondences between two remote sensing (RS) images is an essential and important problem in the fields of RS and photogrammetry. Traditional approaches often necessitate a predefined geometric transformation model or additional manually crafted descriptors, significantly con...
Saved in:
Published in | IEEE transactions on geoscience and remote sensing Vol. 62; pp. 1 - 12 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Seeking good feature correspondences between two remote sensing (RS) images is an essential and important problem in the fields of RS and photogrammetry. Traditional approaches often necessitate a predefined geometric transformation model or additional manually crafted descriptors, significantly constraining the versatility. In this work, we adopt the recent context structure representation network (CSR-Net), which has shown promising performance in general feature matching problems, and propose modifications, named CSR-Net++, to overcome its main limitations. Specifically, CSR-Net is combined with a PointNet-like geometry estimator, which is sensitive to large deformations, for global preregistration. In addition, CSR-Net learns local consensus representation through a fixed-size grid, leading to limited space-aware capacities due to grid pixelwise max-pooling operations. To tackle the abovementioned limitations, we first introduce a pruning layer for matching guided by global consensus, as opposed to relying on a geometric estimator. In addition, for directly learning consensus representation from points, we propose a modified context structure representation (CSR) learning module including an independent spatial location stream and a stand-alone visual stream (VS). This decomposition separates local consensus into positional consensus and visual consensus. The proposed dual-stream representation learning not only avoids the introduction of grid anchors but also provides visual contextual priors. To demonstrate the robustness and versatility of our CSR-Net++, we conducted comprehensive experiments using diverse sets of real image pairs for general feature matching. The results demonstrate the superiority of our CSR-Net++ in most matching scenarios, achieving a 0.47%-4.70% improvement in F-score for multimodal images over existing leading methods. |
---|---|
ISSN: | 0196-2892 1558-0644 |
DOI: | 10.1109/TGRS.2024.3431008 |