Two-Phase Multidocument Summarization Through Content-Attention-Based Subtopic Detection

Multidocument summarization problem deals with extracting main information and ideas from a set of related documents. Solution to this problem is to find an extraction strategy that aims at finding a small subset of sentences that is able to cover the most important information about the whole docum...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computational social systems Vol. 8; no. 6; pp. 1379 - 1392
Main Authors Dong, Luobing, Satpute, Meghana N., Wu, Weili, Du, Ding-Zhu
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.12.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Multidocument summarization problem deals with extracting main information and ideas from a set of related documents. Solution to this problem is to find an extraction strategy that aims at finding a small subset of sentences that is able to cover the most important information about the whole document set. Although a large number of machine-learning-based methods have shown great promise, the lack of high-quality training data poses an inherent obstacle to them. Furthermore, because of the proliferation of low-quality documents on the Internet, the existing summarization strategies, which are merely based on statistical features, get poor performance. In this article, we propose a new two-phase multidocument summarization strategy using content attention-based subtopic detection. First, inspired by distance dynamics-based community detection mechanism, we extract subtopics from the set of documents by having insight into their own content attention and also underlying semantic relations. Instead of complicated neural attention mechanisms, we propose a simple iteration-based content attention method to complete the subtopic detection task. Second, we formulate summarization from different subtopics as a combinatorial optimization problem of minimizing sentence distance and maximizing topic diversity. We prove the submodularity of the above optimization problem, which allows us to propose a new multidocument summarization algorithm based on the greedy mechanism. Finally, we experimentally validate our new algorithms on BBC news summary and wikiHow data. The results show our new algorithms outperform the state-of-the-art methods.
ISSN:2329-924X
2329-924X
2373-7476
DOI:10.1109/TCSS.2021.3079206