CoCITe-Coordinating Changes in Text

Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. There is a need for tools that can detect and group changes occurring within text streams and substreams, in order to find, structure, and sum...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on knowledge and data engineering Vol. 24; no. 1; pp. 15 - 29
Main Authors Wright, J. H., Grothendieck, J.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.01.2012
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. There is a need for tools that can detect and group changes occurring within text streams and substreams, in order to find, structure, and summarize these changes for presentation to human analysts. This paper describes a procedure for efficiently finding step changes, trends, bursts, and cyclic changes affecting frequencies of words, or more general lexical items, within streams of documents which may be optionally labeled with metadata. The common phenomenon of over-dispersion is accommodated using mixture distributions. A streaming implementation is described which can process data from a continuous feed. Anomalies can be detected, grouped, and rendered visually for human comprehension.
AbstractList Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection. There is a need for tools that can detect and group changes occurring within text streams and substreams, in order to find, structure, and summarize these changes for presentation to human analysts. This paper describes a procedure for efficiently finding step changes, trends, bursts, and cyclic changes affecting frequencies of words, or more general lexical items, within streams of documents which may be optionally labeled with metadata. The common phenomenon of over-dispersion is accommodated using mixture distributions. A streaming implementation is described which can process data from a continuous feed. Anomalies can be detected, grouped, and rendered visually for human comprehension.
Author Grothendieck, J.
Wright, J. H.
Author_xml – sequence: 1
  givenname: J. H.
  surname: Wright
  fullname: Wright, J. H.
  email: jwright@research.att.com
  organization: AT&T Labs.-Res., Florham Park, NJ, USA
– sequence: 2
  givenname: J.
  surname: Grothendieck
  fullname: Grothendieck, J.
  email: jgrothen@bbn.com
  organization: Raytheon BBN Technol., Columbia, MD, USA
BookMark eNo9kEFLAzEQhYNUsK0ePXkpek6dJJNNcpS1arHgZT2HuJutWzSpyRb037tLxdO8gY_34JuRSYjBE3LJYMkYmNvq-X615DC8XMIJmTIpNeXMsMmQARlFgeqMzHLeAYBWmk3JTRnLdeVpGWNquuD6LmwX5bsLW58XXVhU_rs_J6et-8j-4u_OyevDqiqf6OblcV3ebWjNEXrKpNHA6sY0Dj0W0kljXGtAeY2sqLF-86A5ClM7UXjN0KDgRirTFEK0rRJzcn3s3af4dfC5t7t4SGGYtIYJxTUgHyB6hOoUc06-tfvUfbr0YxnYUYMdNdhRgx00DPzVke-89_-sLBQCgvgFxt9W1g
CODEN ITKEEH
Cites_doi 10.1086/306064
10.1093/biomet/66.3.585
10.1093/biomet/73.1.85
10.1145/1255438.1255439
10.1080/01621459.1989.10478792
10.1145/775094.775101
10.1093/biostatistics/kxm030
10.1093/biomet/34.1-2.123
10.1186/1471-2288-8-58
10.1145/1132960.1132963
10.1145/1277741.1277779
10.1016/j.stamet.2004.10.004
10.1145/1281192.1281276
10.1145/290941.290954
10.1007/978-1-4615-0933-2
10.1145/775047.775061
10.1109/ICDM.2006.99
10.1007/978-3-540-30143-1_9
10.1007/s10618-007-0066-x
10.1145/1007568.1007586
10.1007/s10115-004-0157-6
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2012
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Jan 2012
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TKDE.2010.250
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2191
EndPage 29
ExternalDocumentID 2553145531
10_1109_TKDE_2010_250
5674040
Genre orig-research
GroupedDBID -~X
.DC
0R~
1OL
29I
4.4
5GY
5VS
6IK
97E
9M8
AAJGR
AASAJ
AAYOK
ABFSI
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AETIX
AI.
AIBXA
AKJIK
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
H~9
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIC
RIE
RIG
RNI
RNS
RXW
RZB
TAE
TAF
TN5
UHB
VH1
XFK
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c240t-159801cd9da4e465a599af907e8416c4cbe082439ca36e81494329579d633ff73
IEDL.DBID RIE
ISSN 1041-4347
IngestDate Fri Sep 13 04:47:30 EDT 2024
Fri Aug 23 01:04:21 EDT 2024
Wed Jun 26 19:28:22 EDT 2024
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c240t-159801cd9da4e465a599af907e8416c4cbe082439ca36e81494329579d633ff73
PQID 913728042
PQPubID 85438
PageCount 15
ParticipantIDs crossref_primary_10_1109_TKDE_2010_250
ieee_primary_5674040
proquest_journals_913728042
PublicationCentury 2000
PublicationDate 2012-Jan.
2012-01-00
20120101
PublicationDateYYYYMMDD 2012-01-01
PublicationDate_xml – month: 01
  year: 2012
  text: 2012-Jan.
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2012
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References bibttk201201001515
bibttk201201001516
bibttk201201001517
bibttk201201001519
bibttk201201001510
bibttk201201001511
van dongen (bibttk201201001522) 2000
bibttk201201001512
bibttk201201001513
bibttk201201001514
seber (bibttk201201001518) 1977
bibttk20120100159
bibttk20120100157
bibttk20120100155
bibttk20120100154
bibttk20120100153
bibttk20120100152
van dongen (bibttk201201001523) 2000
allan (bibttk20120100156) 2002
bibttk20120100151
(bibttk201201001527) 2002
bibttk201201001520
bibttk201201001521
scarfone (bibttk201201001524) 2007
kleinberg (bibttk20120100158) 2006
bibttk201201001525
viinikka (bibttk201201001526) 2004
References_xml – ident: bibttk20120100152
  doi: 10.1086/306064
– ident: bibttk201201001521
  doi: 10.1093/biomet/66.3.585
– year: 2000
  ident: bibttk201201001522
  article-title: MCLA Cluster Algorithm for Graphs
  contributor:
    fullname: van dongen
– ident: bibttk20120100151
  doi: 10.1093/biomet/73.1.85
– year: 2000
  ident: bibttk201201001523
  publication-title: Graph Clustering by Flow Simulation
  contributor:
    fullname: van dongen
– ident: bibttk201201001514
  doi: 10.1145/1255438.1255439
– ident: bibttk201201001520
  doi: 10.1080/01621459.1989.10478792
– ident: bibttk201201001525
  doi: 10.1145/775094.775101
– ident: bibttk201201001516
  doi: 10.1093/biostatistics/kxm030
– ident: bibttk201201001519
  doi: 10.1093/biomet/34.1-2.123
– ident: bibttk201201001517
  doi: 10.1186/1471-2288-8-58
– ident: bibttk201201001515
  doi: 10.1145/1132960.1132963
– ident: bibttk201201001513
  doi: 10.1145/1277741.1277779
– year: 2007
  ident: bibttk201201001524
  article-title: Guide to Intrusion Detection and Prevention Systems (IDPS)
  publication-title: NIST Special Publication 800-94
  contributor:
    fullname: scarfone
– ident: bibttk20120100154
  doi: 10.1016/j.stamet.2004.10.004
– ident: bibttk201201001510
  doi: 10.1145/1281192.1281276
– ident: bibttk20120100155
  doi: 10.1145/290941.290954
– year: 2002
  ident: bibttk20120100156
  publication-title: Topic Detection and Tracking
  doi: 10.1007/978-1-4615-0933-2
  contributor:
    fullname: allan
– year: 2002
  ident: bibttk201201001527
  publication-title: The AQUAINT Corpus of English News Text
– ident: bibttk20120100157
  doi: 10.1145/775047.775061
– ident: bibttk201201001512
  doi: 10.1109/ICDM.2006.99
– start-page: 166
  year: 2004
  ident: bibttk201201001526
  article-title: Monitoring IDS Background Noise Using EWMA Control Charts and Alert Information
  publication-title: Proc Seventh Int'l Symp Recent Advances in Intrusion Detection (RAID)
  doi: 10.1007/978-3-540-30143-1_9
  contributor:
    fullname: viinikka
– year: 2006
  ident: bibttk20120100158
  publication-title: Data Stream Management Processing High-Speed Data Streams
  contributor:
    fullname: kleinberg
– ident: bibttk201201001511
  doi: 10.1007/s10618-007-0066-x
– ident: bibttk20120100159
  doi: 10.1145/1007568.1007586
– year: 1977
  ident: bibttk201201001518
  publication-title: Linear Regression Analysis
  contributor:
    fullname: seber
– ident: bibttk20120100153
  doi: 10.1007/s10115-004-0157-6
SSID ssj0008781
Score 2.0540352
Snippet Text streams are ubiquitous and contain a wealth of information, but are typically orders of magnitude too large in scale for comprehensive human inspection....
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Publisher
StartPage 15
SubjectTerms Data models
Dynamic programming
Heuristic algorithms
modeling structured
Multimedia communication
Statistical analysis
Statistical software
Text mining
textual and multimedia data
Time frequency analysis
Title CoCITe-Coordinating Changes in Text
URI https://ieeexplore.ieee.org/document/5674040
https://www.proquest.com/docview/913728042/abstract/
Volume 24
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjR1LT8Iw-Aty0oMoaETULNF4crCtj61HMyGowdNIuC1dVwwx2YyMi7_er3sgUQ_edmi7pt_7DXBjdFzUkpXtpEjkNOCuHXCBpkrCkF26NOWeqUaevfDpnD4t2KIFd9taGK11mXymh-azjOWnudoYV9mIcZ8i0u3BXuB4Va3WlusGfjmQFK0L_BGh_nc_zVH0_DCukrg8U16_I3_KgSq_uHApWiYdmDWXqjJK3oabIhmqzx_9Gv976yM4rHVM675CimNo6awLnWZ-g1WTcxcOdpoR9uA6zMPHSNthjvboyjgJs1erKj5YW6vMipCNn8B8Mo7CqV2PULAViurCRmUFRZBKRSqpppxJJoRcokGsTbhRUZVo1AEQXkoSrgM0lyjxTOQu5YQslz45hXaWZ_oMLO1qSiXRTJoNTEiW4AIn0IwLqXjah9vmYeP3qlNGXFoYjogNBGIDgRgh0IeeeaTtovp9-jBowBDXdLSOhUvM_Czqnf-9aQD7eKxX-UQuoF18bPQlaglFclWixxc2lrV1
link.rule.ids 315,786,790,802,27957,27958,55109
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjR27TsMw8FTKAAwUWhClPCKBmEibh-3EIwqtWvqYUqlblDguqpASRNOFr-ecR6mAgc2DLVv3vvM9AO6VjYtWstCNGJmcuMzUXcbRVYkoikuTxMxS1cjTGRvOycuCLmrwuK2FkVLmyWeyq5b5X36cio0KlfUocwgS3R7so543eFGttZW7rpOPJEX_Aq-yifPdUbPnj5_7RRqXpQrsdzRQPlLllxzOlcugAdPqWUVOyVt3k0Vd8fmjY-N_330Cx6WVqT0VZHEKNZk0oVFNcNBKhm7C0U47whbceak38qXupeiRrlSYMHnVivKDtbZKNB8F-RnMB33fG-rlEAVdoLLOdDRXUAmJmMchkYTRkHIeLtEllurDURARSbQCEGMitJl00WEitqX-7mJm28ulY59DPUkTeQGaNCUhoS1pqA5QHtIINxiupIyHgsVteKgAG7wXvTKC3McweKAwECgMBIiBNrQUkLabSvi0oVOhISg5aR1w01YTtIh1-fehWzgY-tNJMBnNxh04xCusIkJyBfXsYyOv0WbIopucVL4A_Zy4yw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CoCITe--Coordinating+Changes+in+Text&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Wright%2C+Jeremy&rft.au=Grothendieck%2C+John&rft.date=2012-01-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.issn=1041-4347&rft.eissn=1558-2191&rft.volume=24&rft.issue=1&rft.spage=15&rft_id=info:doi/10.1109%2FTKDE.2010.250&rft.externalDBID=NO_FULL_TEXT&rft.externalDocID=2553145531
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon