Favoring One Among Equals - Not a Good Idea: Many-to-one Matching for Robust Transformer based Pedestrian Detection

We investigate the reasons for lower performance of transformer based pedestrian detection models compared to convolutional neural network (CNN) based ones. CNN models generate dense pedestrian proposals, refine each proposal individually, and follow it up with non-maximal-suppression (NMS) to gener...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) pp. 748 - 757
Main Authors	Shastry, K.N Ajay, Teja, K. Ravi Sri, Nigam, Aditya, Arora, Chetan
Format	Conference Proceeding
Language	English
Published	IEEE 03.01.2024
Subjects	Algorithms Applications Autonomous Driving Codes Computer vision Costs Image recognition and understanding Pedestrians Predictive models Training Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We investigate the reasons for lower performance of transformer based pedestrian detection models compared to convolutional neural network (CNN) based ones. CNN models generate dense pedestrian proposals, refine each proposal individually, and follow it up with non-maximal-suppression (NMS) to generate sparse predictions. In contrast, transformer models select one proposal per groundtruth (GT) pedestrian box and backpropagate positive gradient from them. All other proposals, many of them highly similar to the selected ones, are passed negative gradient. Though this leads to sparse predictions, obviating the need of NMS, the arbitrary selection of one among many similar proposals, hinders effective training, and lower accuracy of pedestrian detection. To mitigate the problem, instead of commonly used Kuhn-Munkres matching algorithm, we propose Min-cost-flow based formulation, and incorporate constraints such as, each ground truth box is matched to atleast one proposal, and many equally good proposals can be matched to a single ground truth box. We propose first transformer based pedestrian detection model incorporating our matching algorithm. Extensive experiments reveal that our approach achieves a miss rate (lower is better) of 3.7 / 17.4 / 21.8 / 8.3 / 2.0 on Eurocity / TJU-traffic / TJUcampus / Cityperson / Caltech datasets compared to 4.7 /18.7 / 24.8 / 8.5 / 3.1 by the current SOTA. Code is available at https://ajayshastry08.github.io/flow_matcher
ISSN:	2642-9381
DOI:	10.1109/WACV57701.2024.00081