Concentrated Reasoning and Unified Reconstruction for Multi-Modal Media Manipulation
Detecting and Grounding Multi-Modal Media Manipulation (DGM 4 ) is an emerging task that aims to identify and locate manipulated elements in both textual and visual media. Given the complexity of this task, the model requires more sophisticated reasoning capabilities to align multi-modal features an...
Saved in:
Published in | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8190 - 8194 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
14.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Detecting and Grounding Multi-Modal Media Manipulation (DGM 4 ) is an emerging task that aims to identify and locate manipulated elements in both textual and visual media. Given the complexity of this task, the model requires more sophisticated reasoning capabilities to align multi-modal features and capture forgery traces. To this end, we propose a Concentrated reasoning and Unified reconstruction framework (CrUr) for DGM 4 . Instead of adhering to traditional hierarchical reasoning paradigms, we directly carry out all inference tasks using integrated multi-modal features. Specifically, we extract and align features at a finer granularity, capturing subtle differences that may indicate manipulation by leveraging advanced mask signal modeling. Moreover, to adapt to fine-grained reasoning tasks, we design a transformer-based Reconstruction Harmonizer to facilitate more complex interactions among the reconstructed features, ultimately obtaining integrated features. Experimental results on the DGM 4 datasets show that our method achieves state-of-the-art performances. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP48485.2024.10447651 |