An Information Distillation Framework for Extractive Summarization

In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and doc...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on audio, speech, and language processing Vol. 26; no. 1; pp. 161 - 170
Main Authors Chen, Kuan-Yu, Liu, Shih-Hung, Chen, Berlin, Wang, Hsin-Min
Format Journal Article
LanguageEnglish
Published IEEE 01.01.2018
Subjects
Online AccessGet full text
ISSN2329-9290
2329-9304
DOI10.1109/TASLP.2017.2764545

Cover

Abstract In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
AbstractList In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some realistic tasks such as document summarization. Nevertheless, classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions in this paper are threefold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph of interest. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. Third, a new summarization framework, which can take both relevance and redundancy information into account simultaneously, is also introduced. We evaluate the proposed embedding methods (i.e., EV and D-EV) and the summarization framework on two benchmark summarization corpora. The experimental results demonstrate the effectiveness and applicability of the proposed framework in relation to several well-practiced and state-of-the-art summarization methods.
Author Shih-Hung Liu
Berlin Chen
Kuan-Yu Chen
Hsin-Min Wang
Author_xml – sequence: 1
  givenname: Kuan-Yu
  orcidid: 0000-0001-9656-7551
  surname: Chen
  fullname: Chen, Kuan-Yu
– sequence: 2
  givenname: Shih-Hung
  surname: Liu
  fullname: Liu, Shih-Hung
– sequence: 3
  givenname: Berlin
  surname: Chen
  fullname: Chen, Berlin
– sequence: 4
  givenname: Hsin-Min
  orcidid: 0000-0003-3599-5071
  surname: Wang
  fullname: Wang, Hsin-Min
BookMark eNp9kMtOwzAQRS1UJErpD8AmP5Aw8TNehtJCpUogtawjO7ElQx7ICa9-PWlSWLBgNTOae0Z37jma1E1tELqMIYpjkNe7dLt5jDDEIsKCU0bZCZpigmUoCdDJT48lnKF52z4DQAxCSkGn6Catg3VtG1-pzjV1cOvazpXlOKy8qsxH41-CXhAsPzuv8s69m2D7VlXKu_0gu0CnVpWtmR_rDD2tlrvFfbh5uFsv0k2YYy66kOtcWsE0s5YZzRUrkkIyrHPCLWWaMJwY0IJoYgmzwLgGqnoVV4U1lAOZoWS8m_umbb2xWe66wUHvy5VZDNkhjmyIIzvEkR3j6FH8B331rv_g63_oaoScMeYXSEBQ0W-_AeK9by4
CODEN ITASD8
CitedBy_id crossref_primary_10_1007_s42044_022_00127_4
crossref_primary_10_1016_j_cogsys_2019_07_003
crossref_primary_10_1109_TASLP_2020_3006731
crossref_primary_10_2308_jeta_52665
crossref_primary_10_1002_cpe_6482
crossref_primary_10_1109_TASLP_2021_3138673
crossref_primary_10_1016_j_eswa_2019_05_045
crossref_primary_10_1016_j_jksuci_2020_05_006
crossref_primary_10_1016_j_cogsys_2018_11_005
crossref_primary_10_1109_ACCESS_2024_3377463
crossref_primary_10_1109_TASLP_2019_2942157
crossref_primary_10_2174_2213275912666190716105347
crossref_primary_10_1016_j_ipm_2020_102359
crossref_primary_10_1109_TCYB_2020_3042230
Cites_doi 10.1109/TASLP.2016.2520371
10.1145/383952.383955
10.1109/TASL.2007.907429
10.1109/ICASSP.2014.6854974
10.1613/jair.1523
10.1109/MSP.2012.2209906
10.1109/MSP.2008.918685
10.1145/1390156.1390177
10.1126/science.1242072
10.1109/MSP.2005.1511823
10.1109/TASLP.2015.2428632
10.1561/1500000015
10.1145/290941.291025
10.1145/502585.502654
10.1109/ICASSP.2013.6639330
10.1109/ICASSP.2016.7472831
10.1002/9781119004752
10.1109/TASLP.2015.2414820
10.1002/9781119992691.ch13
10.1016/j.specom.2010.06.002
10.3115/v1/D14-1162
10.3115/v1/P15-2136
10.3115/v1/P14-1146
10.1080/01690969108406936
10.1145/2505515.2505665
10.3115/v1/N15-1136
10.3115/v1/D14-1156
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TASLP.2017.2764545
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2329-9304
EndPage 170
ExternalDocumentID 10_1109_TASLP_2017_2764545
8074745
Genre orig-research
GroupedDBID 0R~
4.4
6IK
97E
AAJGR
AAKMM
AALFJ
AARMG
AASAJ
AAWTH
AAWTV
ABAZT
ABQJQ
ABVLG
ACIWK
ACM
ADBCU
AEBYY
AEFXT
AEJOY
AENSD
AFWIH
AFWXC
AGQYO
AGSQL
AHBIQ
AIKLT
AKJIK
AKQYR
AKRVB
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CCLIF
EBS
EJD
GUFHI
HGAVV
IFIPE
IPLJI
JAVBF
LHSKQ
M43
OCL
PQQKQ
RIA
RIE
RNS
ROL
AAYXX
CITATION
ID FETCH-LOGICAL-c267t-6bc9f75b5ff5eb6a5d8d952bc36f45b3528e0b73b3f35f056b04aa5d6adfe4603
IEDL.DBID RIE
ISSN 2329-9290
IngestDate Thu Apr 24 22:56:41 EDT 2025
Tue Jul 01 01:27:57 EDT 2025
Tue Aug 26 17:03:04 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c267t-6bc9f75b5ff5eb6a5d8d952bc36f45b3528e0b73b3f35f056b04aa5d6adfe4603
ORCID 0000-0003-3599-5071
0000-0001-9656-7551
PageCount 10
ParticipantIDs crossref_primary_10_1109_TASLP_2017_2764545
crossref_citationtrail_10_1109_TASLP_2017_2764545
ieee_primary_8074745
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2018-Jan.
2018-1-00
PublicationDateYYYYMMDD 2018-01-01
PublicationDate_xml – month: 01
  year: 2018
  text: 2018-Jan.
PublicationDecade 2010
PublicationTitle IEEE/ACM transactions on audio, speech, and language processing
PublicationTitleAbbrev TASLP
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref34
ref12
ref37
ref15
ref36
ref14
chen (ref46) 0
ref31
ref30
ref11
erkan (ref32) 2004; 22
ref38
ref16
kingma (ref24) 0
mnih (ref18) 0
levy (ref41) 0
goodfellow (ref23) 2016
wang (ref27) 2015
penn (ref33) 0
ref45
lin (ref29) 2003
ref26
lin (ref39) 0
ref25
chorowski (ref47) 0
ref42
ref22
ref21
ref43
kageback (ref6) 0
bengio (ref1) 2003; 3
ref28
wang (ref35) 2005; 10
mikolov (ref17) 0
ref7
baeza-yates (ref44) 2011
ref9
ref4
ref3
qiu (ref20) 0
ref5
lee (ref10) 2005; 22
ref40
le (ref8) 0
mikolov (ref2) 0
morin (ref19) 0
bahdanau (ref48) 0
References_xml – ident: ref9
  doi: 10.1109/TASLP.2016.2520371
– ident: ref31
  doi: 10.1145/383952.383955
– start-page: 470
  year: 0
  ident: ref33
  article-title: A critical reassessment of evaluation baselines for speech summarization
  publication-title: Proc 2008 Conf Assoc Comput Linguistics
– ident: ref37
  doi: 10.1109/TASL.2007.907429
– ident: ref22
  doi: 10.1109/ICASSP.2014.6854974
– start-page: 2177
  year: 0
  ident: ref41
  article-title: Neural word embedding as implicit matrix factorization
  publication-title: Proc 27th Int Conf Neural Inf Process Syst
– year: 2003
  ident: ref29
  article-title: ROUGE: Recall-oriented understudy for gisting evaluation
– volume: 22
  start-page: 457
  year: 2004
  ident: ref32
  article-title: LexRank: Graph-based lexical centrality as salience in text summarization
  publication-title: J Artif Intell Res
  doi: 10.1613/jair.1523
– ident: ref13
  doi: 10.1109/MSP.2012.2209906
– ident: ref11
  doi: 10.1109/MSP.2008.918685
– ident: ref4
  doi: 10.1145/1390156.1390177
– ident: ref26
  doi: 10.1126/science.1242072
– start-page: 246
  year: 0
  ident: ref19
  article-title: Hierarchical probabilistic neural network language model
  publication-title: Proc 10th Int Workshop Artif Intell Statist
– volume: 22
  start-page: 42
  year: 2005
  ident: ref10
  article-title: Spoken document understanding and organization
  publication-title: IEEE Signal Process Mag
  doi: 10.1109/MSP.2005.1511823
– year: 2011
  ident: ref44
  publication-title: Modern Information Retrieval The Concepts and Technology Behind Search
– ident: ref36
  doi: 10.1109/TASLP.2015.2428632
– ident: ref34
  doi: 10.1561/1500000015
– start-page: 3111
  year: 0
  ident: ref17
  article-title: Distributed representations of words and phrases and their compositionality
  publication-title: Proc 26th Int Conf Neural Inf Process Syst
– year: 0
  ident: ref24
  article-title: ADAM: A method for stochastic optimization
  publication-title: Proc Int Conf Learn Represent
– year: 2015
  ident: ref27
  article-title: Comment on "Clustering by fast search and find of density peaks
– volume: 10
  start-page: 219
  year: 2005
  ident: ref35
  article-title: MATBN: A Mandarin Chinese broadcast news corpus
  publication-title: Comput Linguist Chinese Lang Process
– start-page: 358
  year: 0
  ident: ref46
  article-title: Learning to distill: The essence vector modeling framework
  publication-title: Proc 26th Int Conf Comput Linguistics
– ident: ref25
  doi: 10.1145/290941.291025
– ident: ref45
  doi: 10.1145/502585.502654
– ident: ref42
  doi: 10.1109/ICASSP.2013.6639330
– ident: ref16
  doi: 10.1109/ICASSP.2016.7472831
– start-page: 1572
  year: 0
  ident: ref20
  article-title: Learning word representation considering proximity and ambiguity
  publication-title: Proc 28th AAAI Conf Artif Intell
– ident: ref14
  doi: 10.1002/9781119004752
– ident: ref38
  doi: 10.1109/TASLP.2015.2414820
– ident: ref12
  doi: 10.1002/9781119992691.ch13
– year: 0
  ident: ref2
  article-title: Efficient estimation of word representations in vector space
  publication-title: Proc Int Conf Learn Represent
– ident: ref40
  doi: 10.1016/j.specom.2010.06.002
– year: 0
  ident: ref48
  article-title: Neural machine translation by jointly learning to align and translate
  publication-title: Proc Int Conf Learn Represent
– year: 2016
  ident: ref23
  publication-title: Deep Learning
– ident: ref3
  doi: 10.3115/v1/D14-1162
– ident: ref30
  doi: 10.3115/v1/P15-2136
– ident: ref15
  doi: 10.1145/290941.291025
– ident: ref5
  doi: 10.3115/v1/P14-1146
– start-page: 912
  year: 0
  ident: ref39
  article-title: Multi-document summarization via budgeted maximization of submodular functions
  publication-title: Proc Hum Lang Technol 2010 Annu Conf North Amer Chapter Assoc Comput Linguistics
– ident: ref21
  doi: 10.1080/01690969108406936
– ident: ref7
  doi: 10.1145/2505515.2505665
– volume: 3
  start-page: 1137
  year: 2003
  ident: ref1
  article-title: A neural probabilistic language model
  publication-title: J Mach Learn Res
– ident: ref28
  doi: 10.3115/v1/N15-1136
– start-page: 31
  year: 0
  ident: ref6
  article-title: Extractive summarization using continuous vector space models
  publication-title: Workshop on Continuous Vector Space Models and their Compositionality
– ident: ref43
  doi: 10.3115/v1/D14-1156
– start-page: 2265
  year: 0
  ident: ref18
  article-title: Learning word embeddings efficiently with noise-contrastive estimation
  publication-title: Proc 26th Int Conf Neural Inf Process Syst
– start-page: 1188
  year: 0
  ident: ref8
  article-title: Distributed representations of sentences and documents
  publication-title: Proc 31st Int Conf Mach Learn
– start-page: 577
  year: 0
  ident: ref47
  article-title: Attention-based models for speech recognition
  publication-title: Proc 28th Int Conf Neural Inf Process Syst
SSID ssj0001079974
Score 2.1571226
Snippet In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in...
SourceID crossref
ieee
SourceType Enrichment Source
Index Database
Publisher
StartPage 161
SubjectTerms Context modeling
distilling
Neural networks
paragraph embedding
Predictive models
Representation learning
Speech
Speech processing
summarization
Training
unsupervised
Title An Information Distillation Framework for Extractive Summarization
URI https://ieeexplore.ieee.org/document/8074745
Volume 26
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS8MwEA_bnvTBrynOL_Lgm7Zr1yZpHqduDHEibIO9lSa5vDg6kU7Ev94kbecUEd9KuEB6d-Gu1_v9DqFLxaQMFaeenbPtxZAoTyhz8TiYWGTHu4fSAoXHj3Q0i-_nZN5A12ssDAC45jPw7aP7l6-WcmVLZV1L3MJi0kRN42YlVuurnhIwzh3psskRuGeiflBjZALenfYnD0-2kYv5PWZJrMi3OLQxWMXFleEuGtcnKttJnv1VIXz58YOs8b9H3kM7VYKJ-6VH7KMG5Adoe4N2sI1u-jmuYEjWLPjO3vNF2RSHh3W3FjYCePBeOBjVG-CJg7lVsM1DNBsOprcjr5ql4MkeZYVHheSaEUG0JiBoRlSiOOkJGVEdE2E5XiAQLBKRjog2WZEI4sxI0UxpiGkQHaFWvszhGOGEMjDmjQUQ82ZhZlKsBFSUgAaVca47KKw1m8qKaNzOu1ik7oMj4KmzRmqtkVbW6KCr9Z6XkmbjT-m21fRaslLyye_Lp2jLbE7KuskZahWvKzg3mUQhLpwLfQK2fsVd
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLbGOAAH3og3PXCDjnZtkuY4HtOADSGxSbtVTeJcmApCHUL8epK0HQ8hxK2K3Cq1E9lx_H0GOFZMylBx6ts-236MifKFMhuPo_FFtr17KC1QeHBHe6P4ZkzGDTidYWEQ0RWfYcs-urt89SSnNlV2ZolbWEzmYN74_ZiUaK3PjErAOHe0yyZK4L7x-0GNkgn42bDz0L-3pVys1WaWxop880RfWqs4z9JdgUE9p7Kg5LE1LURLvv-ga_zvpFdhuQoxvU65Jtaggfk6LH0hHtyA807uVUAkaxjv0u70SVkW53Xrei3PCHhXb4UDUr2i9-CAbhVwcxNG3avhRc-vuin4sk1Z4VMhuWZEEK0JCpoRlShO2kJGVMdEWJYXDASLRKQjok1cJII4M1I0UxpjGkRb0MyfctwGL6EMjYFjgcT8WZiZICtBFSWoUWWc6x0Ia82msqIatx0vJqk7cgQ8ddZIrTXSyho7cDJ757kk2vhTesNqeiZZKXn39-EjWOgNB_20f313uweL5kNJmUXZh2bxMsUDE1cU4tAtpw84y8iq
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Information+Distillation+Framework+for+Extractive+Summarization&rft.jtitle=IEEE%2FACM+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Kuan-Yu+Chen&rft.au=Shih-Hung+Liu&rft.au=Berlin+Chen&rft.au=Hsin-Min+Wang&rft.date=2018-01-01&rft.pub=IEEE&rft.issn=2329-9290&rft.volume=26&rft.issue=1&rft.spage=161&rft.epage=170&rft_id=info:doi/10.1109%2FTASLP.2017.2764545&rft.externalDocID=8074745
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2329-9290&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2329-9290&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2329-9290&client=summon