Fast and Interpretable Face Identification for Out-Of-Distribution Data Using Vision Transformers

Most face identification approaches employ a Siamese neural network to compare two images at the image embedding level. Yet, this technique can be subject to occlusion (e.g., faces with masks or sunglasses) and out-of-distribution data. DeepFace-EMD [40] reaches state-of-the-art accuracy on out-of-d...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE Workshop on Applications of Computer Vision pp. 6289 - 6299
Main Authors Phan, Hai, Le, Cindy X., Le, Vu, He, Yihui, Nguyen, Anh Totti
Format Conference Proceeding
LanguageEnglish
Published IEEE 03.01.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Most face identification approaches employ a Siamese neural network to compare two images at the image embedding level. Yet, this technique can be subject to occlusion (e.g., faces with masks or sunglasses) and out-of-distribution data. DeepFace-EMD [40] reaches state-of-the-art accuracy on out-of-distribution data by first comparing two images at the image level, and then at the patch level. Yet, its later patch-wise re-ranking stage admits a large O(n 3 log n) time complexity (for n patches in an image) due to the optimal transport optimization.In this paper, we propose a novel, 2-image Vision Transformers (ViTs) that compares two images at the patch level using cross attention. After training on 2M pairs of images on CASIA Webface [58], our model performs at a comparable accuracy as DeepFace-EMD on out-of-distribution data, yet at an inference speed more than twice as fast as DeepFace-EMD [40]. In addition, via a human study, our model shows promising explainability through the visualization of cross-attention. We believe our work can inspire more explorations in using ViTs for face identification.
ISSN:2642-9381
DOI:10.1109/WACV57701.2024.00618