RECOUNT: EXPECTATION MAXIMIZATION BASED ERROR CORRECTION TOOL FOR NEXT GENERATION SEQUENCING DATA

Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods dev...

Full description

Saved in:

Bibliographic Details
Published in	Genome Informatics 2009 Vol. 23; no. 1; pp. 189 - 201
Main Authors	WIJAYA, EDWARD, FRITH, MARTIN C., SUZUKI, YUTAKA, HORTON, PAUL
Format	Book Chapter Journal Article
Language	English
Published	Japan PUBLISHED BY IMPERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING CO 01.10.2009
Subjects	Genome Models, Statistical Part A Full Papers Probability Sequence Analysis, DNA - methods transcriptomics tag count correction sequence analysis next generation sequencing
Online Access	Get full text
ISBN	9781848165625 1848165633 9781848165632 9781908978011 1848165625 1908978015
ISSN	0919-9454
DOI	10.1142/9781848165632_0018

Cover

Abstract	Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.
AbstractList	Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount. Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount. Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible sequencing error rate, which biases their outputs by introducing false reads and reducing the quantity of the real reads. Although methods developed for SAGE data can reduce these false counts to a considerable degree, until now they have not been implemented in a scalable way. Recently, a program named FREC has been developed to address this problem for next generation sequencing data. In this paper, we introduce RECOUNT, our implementation of an Expectation Maximization algorithm for tag count correction and compare it to FREC. Using both the reference genome and simulated data, we find that RECOUNT performs as well or better than FREC, while using much less memory (e.g. 5GB vs. 75GB). Furthermore, we report the first analysis of tag count correction with real data in the context of gene expression analysis. Our results show that tag count correction not only increases the number of mappable tags, but can make a real difference in the biological interpretation of next generation sequencing data. RECOUNT is an open-source C++ program available at http://seq.cbrc.jp/recount.
Author	WIJAYA, EDWARD FRITH, MARTIN C. HORTON, PAUL SUZUKI, YUTAKA
Author_xml	– sequence: 1 givenname: EDWARD surname: WIJAYA fullname: WIJAYA, EDWARD email: e-wijaya@aist.go.jp organization: AIST, Computational Biology Research Center, 2-42 Aomi, Koutou-Ku, Tokyo 135-0064 – sequence: 2 givenname: MARTIN C. surname: FRITH fullname: FRITH, MARTIN C. email: m.frith@aist.go.jp organization: AIST, Computational Biology Research Center, 2-42 Aomi, Koutou-Ku, Tokyo 135-0064 – sequence: 3 givenname: YUTAKA surname: SUZUKI fullname: SUZUKI, YUTAKA email: ysuzuki@k.u-tokyo.ac.jp organization: Department of Medical Genome Sciences, Graduate School of Frontier Sciences, the University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562 – sequence: 4 givenname: PAUL surname: HORTON fullname: HORTON, PAUL email: horton-p@aist.go.jp organization: AIST, Computational Biology Research Center, 2-42 Aomi, Koutou-Ku, Tokyo 135-0064
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/20180274$$D View this record in MEDLINE/PubMed
BookMark	eNqdkU1Pg0AQhtdY40ftH_BguHmq7rC7LJh4QNxWkgqKNGm8bFjYTdAWKrQx_nupVWPizblM5s3zzGHmCPWqutIInQA-B6D2hcddcKkLDnOILTEGdwcd_SRkFw1-ETbroUPsgTf0KKMHaNC2z7grijFh7j46sDsf25weoiwRQTyN0ktLzO5FkPppGEfWnT8L78Kn7XDtP4obSyRJnFhBnHTCZ5zG8cQadVkkZqk1FpFItvyjeJiKKAijsXXjp_4x2jPZvNWDr95H05FIg9vhJB6HgT8ZPtseMUPCFBjGTGYKg_M815rnwB3MSUY9h2IwRGPgUGBH2QaUUlkBNiUedlXOMpf00dl277KpX9e6XclF2eZ6Ps8qXa9byQkBYJxuyNMvcq0WupDLplxkzbv8vkoH8C3wVjfzos1LXa1KU-ZS1fVLKwHLzU_k35905tX_TKmaUhvyAeHyhf0
ContentType	Book Chapter Journal Article
Copyright	Japanese Society for Bioinformatics
Copyright_xml	– notice: Japanese Society for Bioinformatics
DBID	CGR CUY CVF ECM EIF NPM 7X8
DOI	10.1142/9781848165632_0018
DatabaseName	Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic
DatabaseTitle	MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic MEDLINE
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Biology
EISBN	1848165633 9781848165632 9781908978011 1908978015
Editor	Lee, Sang Yup Morishita, Shinichi Sakakibara, Yasubumi
Editor_xml	– sequence: 1 givenname: Shinichi surname: Morishita fullname: Morishita, Shinichi organization: University of Tokyo – sequence: 2 givenname: Sang Yup surname: Lee fullname: Lee, Sang Yup organization: Korea Advanced Institute of Science & Technology – sequence: 3 givenname: Yasubumi surname: Sakakibara fullname: Sakakibara, Yasubumi organization: Keio University
EndPage	201
ExternalDocumentID	20180274 10.1142/9781848165632_0018
Genre	Research Support, Non-U.S. Gov't Journal Article
GroupedDBID	089 20A 38. 92K 9WS AABBV AATMT ABCYV ACZWY ADCHV AIQUZ ALMA_UNASSIGNED_HOLDINGS ALUEM AZZ BBABE CZZ JJU MYL PE1 TM9 V1H WMAQA 53G ADBBV BAWUL CGR CUY CVF DIK ECM EIF FRP JSF JSH KQ8 NPM OK1 RJT RZJ W2D 7X8
ID	FETCH-LOGICAL-j293f-35b1f55fafdf0cccee7c176073a496401f3e0171d06b2f1bbbad1243908bc5a83
ISBN	9781848165625 1848165633 9781848165632 9781908978011 1848165625 1908978015
ISSN	0919-9454
IngestDate	Fri Jul 11 02:05:55 EDT 2025 Thu Apr 03 07:02:03 EDT 2025 Sat Mar 08 06:32:14 EST 2025
IsPeerReviewed	false
IsScholarly	true
Issue	1
Keywords	transcriptomics tag count correction sequence analysis next generation sequencing
Language	English
LinkModel	OpenURL
MeetingName	Proceedings of the 20th International Conference
MergedId	FETCHMERGED-LOGICAL-j293f-35b1f55fafdf0cccee7c176073a496401f3e0171d06b2f1bbbad1243908bc5a83
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
PMID	20180274
PQID	733115748
PQPubID	23479
PageCount	13
ParticipantIDs	worldscientific_books_10_1142_9781848165632_0018 proquest_miscellaneous_733115748 worldscientific_books_10_1142_9781848165632_0018_brief pubmed_primary_20180274
PublicationCentury	2000
PublicationDate	20091000
PublicationDateYYYYMMDD	2009-10-01
PublicationDate_xml	– month: 10 year: 2009 text: 20091000
PublicationDecade	2000
PublicationPlace	Japan
PublicationPlace_xml	– name: Japan
PublicationSubtitle	Genome Informatics Series Vol. 23
PublicationTitle	Genome Informatics 2009
PublicationTitleAlternate	Genome Inform
PublicationYear	2009
Publisher	PUBLISHED BY IMPERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING CO
Publisher_xml	– name: PUBLISHED BY IMPERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING CO
SSID	ssj0000400358 ssj0036957
Score	1.9109057
Snippet	Next generation sequencing technologies enable rapid, large-scale production of sequence data sets. Unfortunately these technologies also have a non-neglible...
SourceID	proquest pubmed worldscientific
SourceType	Aggregation Database Index Database Enrichment Source Publisher
StartPage	189
SubjectTerms	Genome Models, Statistical Part A Full Papers Probability Sequence Analysis, DNA - methods
Title	RECOUNT: EXPECTATION MAXIMIZATION BASED ERROR CORRECTION TOOL FOR NEXT GENERATION SEQUENCING DATA
URI	https://www.worldscientific.com/doi/10.1142/9781848165632_0018 https://www.ncbi.nlm.nih.gov/pubmed/20180274 https://www.proquest.com/docview/733115748
Volume	23
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bb9owFLag0qRdNO3WjV0qP-wN0ZF7srcUUpGWSxucQfsSxUksddJgauFh-4n7VTsnDkmg1ST2EmHj2Mbn4_j4-FwI-WxkXFgCk2Zwm3d0LrKOo2VKRyRCaEZsWZyj7_BobA5C_WxuzBuNPzWrpfWKHye_H_Qr-R-qQh3QFb1k96Bs2SlUwGegLzyBwvDcEX631aw3RRK45Q-0fCzEzuQOwwOW5_qZf-ZeuTmr68_coIq6GPhskKtBQZT1x-3ecalkCa_Dcz_nySFzz0vSDiYBm4ylxi4c1jEWeL1JOGaoVfDmF16PyRw-I3cOLPFaFk7cqddve0EwCVArBq_k1QyTfMAJtD325qwtDejyL6beZeiNe6hE67vM3dJKOKV9m8RRoXCD_k-u2v7oAg6l7hBGGQ49vCdDcOXhs_r-lAX-Schky9kkGPaleo35pwjDuuJu6-irYB4AE49vNfaryHRExU6uygnd3yR0VdqFbPrQVDTts6stcWMGsLNTlvaL_-qlSZqWjVkcvimDUtuHrFIzbBnDoJh3raBhuhEFr2AtkBiMIhRZreWzrXGK8qa9snEG09Uv9-fz0NHpCXmeB-SVTrdok1YTqtgL8hQdbSh6wMCvfkka2eIVeSSTo_56TeICXV9pDVu0ji2aY4vm2KIVtihiiwK2KGKLVtiiFbYoYusNCU891ht0ioQgne8glYqOZnBFGIaIRSq6SQLynZUolgm7VKw7pt5VhJZh_Ke0a3JVKBz4TAryqwYLxRMjtrVDcrBYLrJ3hDpprBm6lqSaaeqZo0DHluXEdqymKXecrEXoZtUiYLh4ixYvsuX6LsIkp4ph6XaLvJWrGf2UgWEiFaPhqZbeIt2d5Y2QTdxFMgqAGt2nUouY-74S8dubTLzff6wP5HH1l_1IDla36-wTiNYrfkSa55f2UY7dv5yAoa0
linkProvider	Open Access Publishing in European Networks
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Genome+Informatics+2009&rft.au=WIJAYA%2C+EDWARD&rft.au=FRITH%2C+MARTIN+C.&rft.au=SUZUKI%2C+YUTAKA&rft.au=HORTON%2C+PAUL&rft.atitle=RECOUNT%3A+EXPECTATION+MAXIMIZATION+BASED+ERROR+CORRECTION+TOOL+FOR+NEXT+GENERATION+SEQUENCING+DATA&rft.date=2009-10-01&rft.pub=PUBLISHED+BY+IMPERIAL+COLLEGE+PRESS+AND+DISTRIBUTED+BY+WORLD+SCIENTIFIC+PUBLISHING+CO&rft.isbn=9781848165625&rft.spage=189&rft.epage=201&rft_id=info:doi/10.1142%2F9781848165632_0018&rft.externalDBID=n%2Fa&rft.externalDocID=10.1142%2F9781848165632_0018
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Fwww.worldscientific.com%2Faction%2FshowCoverImage%3Fdoi%3D10.1142%2F9781848165632_0018