Concentrated Reasoning and Unified Reconstruction for Multi-Modal Media Manipulation

Detecting and Grounding Multi-Modal Media Manipulation (DGM 4 ) is an emerging task that aims to identify and locate manipulated elements in both textual and visual media. Given the complexity of this task, the model requires more sophisticated reasoning capabilities to align multi-modal features an...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8190 - 8194
Main Authors Zhao, Weichen, Lu, Yuxing, Jiao, Ge, Yang, Yuan
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Detecting and Grounding Multi-Modal Media Manipulation (DGM 4 ) is an emerging task that aims to identify and locate manipulated elements in both textual and visual media. Given the complexity of this task, the model requires more sophisticated reasoning capabilities to align multi-modal features and capture forgery traces. To this end, we propose a Concentrated reasoning and Unified reconstruction framework (CrUr) for DGM 4 . Instead of adhering to traditional hierarchical reasoning paradigms, we directly carry out all inference tasks using integrated multi-modal features. Specifically, we extract and align features at a finer granularity, capturing subtle differences that may indicate manipulation by leveraging advanced mask signal modeling. Moreover, to adapt to fine-grained reasoning tasks, we design a transformer-based Reconstruction Harmonizer to facilitate more complex interactions among the reconstructed features, ultimately obtaining integrated features. Experimental results on the DGM 4 datasets show that our method achieves state-of-the-art performances.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10447651