An Information Distillation Framework for Extractive Summarization

In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and doc...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 26; no. 1; pp. 161 - 170
Main Authors	Chen, Kuan-Yu, Liu, Shih-Hung, Chen, Berlin, Wang, Hsin-Min
Format	Journal Article
Language	English
Published	IEEE 01.01.2018
Subjects	Context modeling distilling Neural networks paragraph embedding Predictive models Representation learning Speech Speech processing summarization Training unsupervised
Online Access	Get full text
ISSN	2329-9290 2329-9304
DOI	10.1109/TASLP.2017.2764545

Cover

Abstract	In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
AbstractList	In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
Author	Shih-Hung Liu Berlin Chen Kuan-Yu Chen Hsin-Min Wang
Author_xml	– sequence: 1 givenname: Kuan-Yu orcidid: 0000-0001-9656-7551 surname: Chen fullname: Chen, Kuan-Yu – sequence: 2 givenname: Shih-Hung surname: Liu fullname: Liu, Shih-Hung – sequence: 3 givenname: Berlin surname: Chen fullname: Chen, Berlin – sequence: 4 givenname: Hsin-Min orcidid: 0000-0003-3599-5071 surname: Wang fullname: Wang, Hsin-Min
BookMark	eNp9kMtOwzAQRS1UJErpD8AmP5Aw8TNehtJCpUogtawjO7ElQx7ICa9-PWlSWLBgNTOae0Z37jma1E1tELqMIYpjkNe7dLt5jDDEIsKCU0bZCZpigmUoCdDJT48lnKF52z4DQAxCSkGn6Catg3VtG1-pzjV1cOvazpXlOKy8qsxH41-CXhAsPzuv8s69m2D7VlXKu_0gu0CnVpWtmR_rDD2tlrvFfbh5uFsv0k2YYy66kOtcWsE0s5YZzRUrkkIyrHPCLWWaMJwY0IJoYgmzwLgGqnoVV4U1lAOZoWS8m_umbb2xWe66wUHvy5VZDNkhjmyIIzvEkR3j6FH8B331rv_g63_oaoScMeYXSEBQ0W-_AeK9by4
CODEN	ITASD8
CitedBy_id	crossref_primary_10_1007_s42044_022_00127_4 crossref_primary_10_1016_j_cogsys_2019_07_003 crossref_primary_10_1109_TASLP_2020_3006731 crossref_primary_10_2308_jeta_52665 crossref_primary_10_1002_cpe_6482 crossref_primary_10_1109_TASLP_2021_3138673 crossref_primary_10_1016_j_eswa_2019_05_045 crossref_primary_10_1016_j_jksuci_2020_05_006 crossref_primary_10_1016_j_cogsys_2018_11_005 crossref_primary_10_1109_ACCESS_2024_3377463 crossref_primary_10_1109_TASLP_2019_2942157 crossref_primary_10_2174_2213275912666190716105347 crossref_primary_10_1016_j_ipm_2020_102359 crossref_primary_10_1109_TCYB_2020_3042230
Cites_doi	10.1109/TASLP.2016.2520371 10.1145/383952.383955 10.1109/TASL.2007.907429 10.1109/ICASSP.2014.6854974 10.1613/jair.1523 10.1109/MSP.2012.2209906 10.1109/MSP.2008.918685 10.1145/1390156.1390177 10.1126/science.1242072 10.1109/MSP.2005.1511823 10.1109/TASLP.2015.2428632 10.1561/1500000015 10.1145/290941.291025 10.1145/502585.502654 10.1109/ICASSP.2013.6639330 10.1109/ICASSP.2016.7472831 10.1002/9781119004752 10.1109/TASLP.2015.2414820 10.1002/9781119992691.ch13 10.1016/j.specom.2010.06.002 10.3115/v1/D14-1162 10.3115/v1/P15-2136 10.3115/v1/P14-1146 10.1080/01690969108406936 10.1145/2505515.2505665 10.3115/v1/N15-1136 10.3115/v1/D14-1156
ContentType	Journal Article
DBID	97E RIA RIE AAYXX CITATION
DOI	10.1109/TASLP.2017.2764545
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	2329-9304
EndPage	170
ExternalDocumentID	10_1109_TASLP_2017_2764545 8074745
Genre	orig-research
GroupedDBID	0R~ 4.4 6IK 97E AAJGR AAKMM AALFJ AARMG AASAJ AAWTH AAWTV ABAZT ABQJQ ABVLG ACIWK ACM ADBCU AEBYY AEFXT AEJOY AENSD AFWIH AFWXC AGQYO AGSQL AHBIQ AIKLT AKJIK AKQYR AKRVB ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CCLIF EBS EJD GUFHI HGAVV IFIPE IPLJI JAVBF LHSKQ M43 OCL PQQKQ RIA RIE RNS ROL AAYXX CITATION
ID	FETCH-LOGICAL-c267t-6bc9f75b5ff5eb6a5d8d952bc36f45b3528e0b73b3f35f056b04aa5d6adfe4603
IEDL.DBID	RIE
ISSN	2329-9290
IngestDate	Thu Apr 24 22:56:41 EDT 2025 Tue Jul 01 01:27:57 EDT 2025 Tue Aug 26 17:03:04 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c267t-6bc9f75b5ff5eb6a5d8d952bc36f45b3528e0b73b3f35f056b04aa5d6adfe4603
ORCID	0000-0003-3599-5071 0000-0001-9656-7551
PageCount	10
ParticipantIDs	crossref_primary_10_1109_TASLP_2017_2764545 crossref_citationtrail_10_1109_TASLP_2017_2764545 ieee_primary_8074745
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2018-Jan. 2018-1-00
PublicationDateYYYYMMDD	2018-01-01
PublicationDate_xml	– month: 01 year: 2018 text: 2018-Jan.
PublicationDecade	2010
PublicationTitle	IEEE/ACM transactions on audio, speech, and language processing
PublicationTitleAbbrev	TASLP
PublicationYear	2018
Publisher	IEEE
Publisher_xml	– name: IEEE
References	ref13 ref34 ref12 ref37 ref15 ref36 ref14 chen (ref46) 0 ref31 ref30 ref11 erkan (ref32) 2004; 22 ref38 ref16 kingma (ref24) 0 mnih (ref18) 0 levy (ref41) 0 goodfellow (ref23) 2016 wang (ref27) 2015 penn (ref33) 0 ref45 lin (ref29) 2003 ref26 lin (ref39) 0 ref25 chorowski (ref47) 0 ref42 ref22 ref21 ref43 kageback (ref6) 0 bengio (ref1) 2003; 3 ref28 wang (ref35) 2005; 10 mikolov (ref17) 0 ref7 baeza-yates (ref44) 2011 ref9 ref4 ref3 qiu (ref20) 0 ref5 lee (ref10) 2005; 22 ref40 le (ref8) 0 mikolov (ref2) 0 morin (ref19) 0 bahdanau (ref48) 0
References_xml	– ident: ref9 doi: 10.1109/TASLP.2016.2520371 – ident: ref31 doi: 10.1145/383952.383955 – start-page: 470 year: 0 ident: ref33 article-title: A critical reassessment of evaluation baselines for speech summarization publication-title: Proc 2008 Conf Assoc Comput Linguistics – ident: ref37 doi: 10.1109/TASL.2007.907429 – ident: ref22 doi: 10.1109/ICASSP.2014.6854974 – start-page: 2177 year: 0 ident: ref41 article-title: Neural word embedding as implicit matrix factorization publication-title: Proc 27th Int Conf Neural Inf Process Syst – year: 2003 ident: ref29 article-title: ROUGE: Recall-oriented understudy for gisting evaluation – volume: 22 start-page: 457 year: 2004 ident: ref32 article-title: LexRank: Graph-based lexical centrality as salience in text summarization publication-title: J Artif Intell Res doi: 10.1613/jair.1523 – ident: ref13 doi: 10.1109/MSP.2012.2209906 – ident: ref11 doi: 10.1109/MSP.2008.918685 – ident: ref4 doi: 10.1145/1390156.1390177 – ident: ref26 doi: 10.1126/science.1242072 – start-page: 246 year: 0 ident: ref19 article-title: Hierarchical probabilistic neural network language model publication-title: Proc 10th Int Workshop Artif Intell Statist – volume: 22 start-page: 42 year: 2005 ident: ref10 article-title: Spoken document understanding and organization publication-title: IEEE Signal Process Mag doi: 10.1109/MSP.2005.1511823 – year: 2011 ident: ref44 publication-title: Modern Information Retrieval The Concepts and Technology Behind Search – ident: ref36 doi: 10.1109/TASLP.2015.2428632 – ident: ref34 doi: 10.1561/1500000015 – start-page: 3111 year: 0 ident: ref17 article-title: Distributed representations of words and phrases and their compositionality publication-title: Proc 26th Int Conf Neural Inf Process Syst – year: 0 ident: ref24 article-title: ADAM: A method for stochastic optimization publication-title: Proc Int Conf Learn Represent – year: 2015 ident: ref27 article-title: Comment on "Clustering by fast search and find of density peaks – volume: 10 start-page: 219 year: 2005 ident: ref35 article-title: MATBN: A Mandarin Chinese broadcast news corpus publication-title: Comput Linguist Chinese Lang Process – start-page: 358 year: 0 ident: ref46 article-title: Learning to distill: The essence vector modeling framework publication-title: Proc 26th Int Conf Comput Linguistics – ident: ref25 doi: 10.1145/290941.291025 – ident: ref45 doi: 10.1145/502585.502654 – ident: ref42 doi: 10.1109/ICASSP.2013.6639330 – ident: ref16 doi: 10.1109/ICASSP.2016.7472831 – start-page: 1572 year: 0 ident: ref20 article-title: Learning word representation considering proximity and ambiguity publication-title: Proc 28th AAAI Conf Artif Intell – ident: ref14 doi: 10.1002/9781119004752 – ident: ref38 doi: 10.1109/TASLP.2015.2414820 – ident: ref12 doi: 10.1002/9781119992691.ch13 – year: 0 ident: ref2 article-title: Efficient estimation of word representations in vector space publication-title: Proc Int Conf Learn Represent – ident: ref40 doi: 10.1016/j.specom.2010.06.002 – year: 0 ident: ref48 article-title: Neural machine translation by jointly learning to align and translate publication-title: Proc Int Conf Learn Represent – year: 2016 ident: ref23 publication-title: Deep Learning – ident: ref3 doi: 10.3115/v1/D14-1162 – ident: ref30 doi: 10.3115/v1/P15-2136 – ident: ref15 doi: 10.1145/290941.291025 – ident: ref5 doi: 10.3115/v1/P14-1146 – start-page: 912 year: 0 ident: ref39 article-title: Multi-document summarization via budgeted maximization of submodular functions publication-title: Proc Hum Lang Technol 2010 Annu Conf North Amer Chapter Assoc Comput Linguistics – ident: ref21 doi: 10.1080/01690969108406936 – ident: ref7 doi: 10.1145/2505515.2505665 – volume: 3 start-page: 1137 year: 2003 ident: ref1 article-title: A neural probabilistic language model publication-title: J Mach Learn Res – ident: ref28 doi: 10.3115/v1/N15-1136 – start-page: 31 year: 0 ident: ref6 article-title: Extractive summarization using continuous vector space models publication-title: Workshop on Continuous Vector Space Models and their Compositionality – ident: ref43 doi: 10.3115/v1/D14-1156 – start-page: 2265 year: 0 ident: ref18 article-title: Learning word embeddings efficiently with noise-contrastive estimation publication-title: Proc 26th Int Conf Neural Inf Process Syst – start-page: 1188 year: 0 ident: ref8 article-title: Distributed representations of sentences and documents publication-title: Proc 31st Int Conf Mach Learn – start-page: 577 year: 0 ident: ref47 article-title: Attention-based models for speech recognition publication-title: Proc 28th Int Conf Neural Inf Process Syst
SSID	ssj0001079974
Score	2.1571226
Snippet	In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in...
SourceID	crossref ieee
SourceType	Enrichment Source Index Database Publisher
StartPage	161
SubjectTerms	Context modeling distilling Neural networks paragraph embedding Predictive models Representation learning Speech Speech processing summarization Training unsupervised
Title	An Information Distillation Framework for Extractive Summarization
URI	https://ieeexplore.ieee.org/document/8074745
Volume	26
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS8MwEA_bnvTBrynOL_Lgm7Zr1yZpHqduDHEibIO9lSa5vDg6kU7Ev94kbecUEd9KuEB6d-Gu1_v9DqFLxaQMFaeenbPtxZAoTyhz8TiYWGTHu4fSAoXHj3Q0i-_nZN5A12ssDAC45jPw7aP7l6-WcmVLZV1L3MJi0kRN42YlVuurnhIwzh3psskRuGeiflBjZALenfYnD0-2kYv5PWZJrMi3OLQxWMXFleEuGtcnKttJnv1VIXz58YOs8b9H3kM7VYKJ-6VH7KMG5Adoe4N2sI1u-jmuYEjWLPjO3vNF2RSHh3W3FjYCePBeOBjVG-CJg7lVsM1DNBsOprcjr5ql4MkeZYVHheSaEUG0JiBoRlSiOOkJGVEdE2E5XiAQLBKRjog2WZEI4sxI0UxpiGkQHaFWvszhGOGEMjDmjQUQ82ZhZlKsBFSUgAaVca47KKw1m8qKaNzOu1ik7oMj4KmzRmqtkVbW6KCr9Z6XkmbjT-m21fRaslLyye_Lp2jLbE7KuskZahWvKzg3mUQhLpwLfQK2fsVd
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLbGOAAH3og3PXCDjnZtkuY4HtOADSGxSbtVTeJcmApCHUL8epK0HQ8hxK2K3Cq1E9lx_H0GOFZMylBx6ts-236MifKFMhuPo_FFtr17KC1QeHBHe6P4ZkzGDTidYWEQ0RWfYcs-urt89SSnNlV2ZolbWEzmYN74_ZiUaK3PjErAOHe0yyZK4L7x-0GNkgn42bDz0L-3pVys1WaWxop880RfWqs4z9JdgUE9p7Kg5LE1LURLvv-ga_zvpFdhuQoxvU65Jtaggfk6LH0hHtyA807uVUAkaxjv0u70SVkW53Xrei3PCHhXb4UDUr2i9-CAbhVwcxNG3avhRc-vuin4sk1Z4VMhuWZEEK0JCpoRlShO2kJGVMdEWJYXDASLRKQjok1cJII4M1I0UxpjGkRb0MyfctwGL6EMjYFjgcT8WZiZICtBFSWoUWWc6x0Ia82msqIatx0vJqk7cgQ8ddZIrTXSyho7cDJ757kk2vhTesNqeiZZKXn39-EjWOgNB_20f313uweL5kNJmUXZh2bxMsUDE1cU4tAtpw84y8iq
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Information+Distillation+Framework+for+Extractive+Summarization&rft.jtitle=IEEE%2FACM+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Kuan-Yu+Chen&rft.au=Shih-Hung+Liu&rft.au=Berlin+Chen&rft.au=Hsin-Min+Wang&rft.date=2018-01-01&rft.pub=IEEE&rft.issn=2329-9290&rft.volume=26&rft.issue=1&rft.spage=161&rft.epage=170&rft_id=info:doi/10.1109%2FTASLP.2017.2764545&rft.externalDocID=8074745
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2329-9290&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2329-9290&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2329-9290&client=summon