Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm
Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are i...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , , , , , |
Format | Paper Journal Article |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
07.03.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes. We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts. |
---|---|
AbstractList | Bioinformatics . 2020 Jun 1;36(12):3669-3679 Long reads produced by third-generation sequencing technologies are used to
construct an assembly (i.e., the subject's genome), which is further used in
downstream genome analysis. Unfortunately, long reads have high sequencing
error rates and a large proportion of bps in these long reads are incorrectly
identified. These errors propagate to the assembly and affect the accuracy of
genome analysis. Assembly polishing algorithms minimize such error propagation
by polishing or fixing errors in the assembly by using information from
alignments between reads and the assembly (i.e., read-to-assembly alignment
information). However, assembly polishing algorithms can only polish an
assembly using reads either from a certain sequencing technology or from a
small assembly. Such technology-dependency and assembly-size dependency require
researchers to 1) run multiple polishing algorithms and 2) use small chunks of
a large genome to use all available read sets and polish large genomes. We
introduce Apollo, a universal assembly polishing algorithm that scales well to
polish an assembly of any size (i.e., both large and small genomes) using reads
from all sequencing technologies (i.e., second- and third-generation). Our goal
is to provide a single algorithm that uses read sets from all available
sequencing technologies to improve the accuracy of assembly polishing and that
can polish large genomes. Apollo 1) models an assembly as a profile hidden
Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with
the Forward-Backward algorithm, and 3) decodes the trained model with the
Viterbi algorithm to produce a polished assembly. Our experiments with real
read sets demonstrate that Apollo is the only algorithm that 1) uses reads from
any sequencing technology within a single run and 2) scales well to polish
large assemblies without splitting the assembly into multiple parts. Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes. We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts. |
Author | Alkan, Can Firtina, Can Cali, Damla Senol Cicek, A Ercument Mutlu, Onur Kim, Jeremie S Alser, Mohammed |
Author_xml | – sequence: 1 givenname: Can surname: Firtina fullname: Firtina, Can – sequence: 2 givenname: Jeremie surname: Kim middlename: S fullname: Kim, Jeremie S – sequence: 3 givenname: Mohammed surname: Alser fullname: Alser, Mohammed – sequence: 4 givenname: Damla surname: Cali middlename: Senol fullname: Cali, Damla Senol – sequence: 5 givenname: A surname: Cicek middlename: Ercument fullname: Cicek, A Ercument – sequence: 6 givenname: Can surname: Alkan fullname: Alkan, Can – sequence: 7 givenname: Onur surname: Mutlu fullname: Mutlu, Onur |
BackLink | https://doi.org/10.48550/arXiv.1902.04341$$DView paper in arXiv https://doi.org/10.1093/bioinformatics/btaa179$$DView published paper (Access to full text may be restricted) |
BookMark | eNotUEtPwzAYixBIjLEfwIlIXNeRR7Mm3KqJx6RJIG33kqRft05ZUtIWsX9P2bjYB1uW7Rt06YMHhO4omaVSCPKo40_9PaOKsBlJeUov0IhxThOZMnaNJm27J4SwecaE4CP0mTfBufCEc7yGrx68rf022YDd-eDC9pgsfQkNDOC7KV5b7bRxMMXalzi3to-6A5y3LRyMO-KP4Op2NyTg3G1DrLvd4RZdVdq1MPnnMdq8PG8Wb8nq_XW5yFeJFowmPOMlGGNhnhIKWSWFMiLNSquNpDa1RElWcag4NUBKobhUouSSzCtqBl3zMbo_x57mF02sDzoei78bitMNg-Ph7GhiGIa2XbEPffRDp4LRTFGZKUH5LyH3YhU |
ContentType | Paper Journal Article |
Copyright | 2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: 2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS AKY ALC GOX |
DOI | 10.48550/arxiv.1902.04341 |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One ProQuest Central ProQuest SciTech Premium Collection ProQuest Engineering Collection Engineering Database ProQuest Central Premium ProQuest One Academic (New) Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection arXiv Computer Science arXiv Quantitative Biology arXiv.org |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest One Academic Middle East (New) ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository – sequence: 2 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
ExternalDocumentID | 1902_04341 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS AKY ALC GOX |
ID | FETCH-LOGICAL-a521-373debbce6401e7f859b547dcab81c4c0982f3ef31be0d593895d3806f1b1c4a3 |
IEDL.DBID | BENPR |
IngestDate | Tue Jul 22 23:40:25 EDT 2025 Mon Jun 30 09:13:46 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a521-373debbce6401e7f859b547dcab81c4c0982f3ef31be0d593895d3806f1b1c4a3 |
Notes | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 |
OpenAccessLink | https://www.proquest.com/docview/2179187951?pq-origsite=%requestingapplication% |
PQID | 2179187951 |
PQPubID | 2050157 |
ParticipantIDs | arxiv_primary_1902_04341 proquest_journals_2179187951 |
PublicationCentury | 2000 |
PublicationDate | 20200307 |
PublicationDateYYYYMMDD | 2020-03-07 |
PublicationDate_xml | – month: 03 year: 2020 text: 20200307 day: 07 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2020 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 1.7181611 |
SecondaryResourceType | preprint |
Snippet | Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in... Bioinformatics . 2020 Jun 1;36(12):3669-3679 Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the... |
SourceID | arxiv proquest |
SourceType | Open Access Repository Aggregation Database |
SubjectTerms | Algorithms Alignment Assemblies Assembly Computer Science - Computational Engineering, Finance, and Science Computer Science - Learning Dependence Error analysis Gene sequencing Genomes Markov chains Polishes Polishing Quantitative Biology - Genomics State of the art Viterbi algorithm detectors |
SummonAdditionalLinks | – databaseName: arXiv.org dbid: GOX link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwELVKJxYEAtRCQR4Ya0jsJHbYIkQpSMDQInUL_gSkfqkfCP49ZyelA2J1nOXZybtn371D6AJIITOJ4kQ5oUliWUok5ylxQB2gRySEDL7e-fEp678kD6N01EB4UwsjF18fn5U_sFpeAVvRyyhhvjJ9h1KfsnX3PKouJ4MVVz1_Ow9izDD059ca-KK3j_bqQA8X1cocoIadHqLXYg64z65xgQdVDjMwB9keb5P73660qy4eAH6-sqmLQe7jQuu193XA_p52osbf2OeuhRMkXIzfZqDy3ydHaNi7Hd70Sd3jgEggTvi8mbFKaZuBzrHciTRXacKNlkrEOtFRLqhj1rFY2cikOYQXqWEiylys4Llkx6g5nU1tC-FMKqotqAnpeJIbJrkyzjgqACcrDG2jVkCmnFc2FqUHrQygtVFnA1ZZb-FlSb1xqW9FHp_8_-Yp2qVegPqkLN5BzdVibc-ApVfqPCzVD6Xukr4 priority: 102 providerName: Cornell University |
Title | Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm |
URI | https://www.proquest.com/docview/2179187951 https://arxiv.org/abs/1902.04341 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1JTwIxFG4EYuLNNaBIevBIWaYz0-LFoGHRBCSCCbexq5qwyWL04m_3dRjgYOJlkpme5mvzvrf1fQhdASmE2peMSMsV8Q0NiGAsIBaoA-IRAS6Du-_c6YbtZ_9hGAyThNsiaavc2MTYUOupcjnysufmaDpl7OrN7IM41ShXXU0kNFIoAyaYQ_CVuW10e0_bLIsXMvCZ6bqcGQ_vKov51_tnCXjQK1V86pTgM_GnP8Y4ZpjmIcr0xMzMj9CemRyj_bgxUy1O0Et9Bjs1vcZ13F93PQPXkF1CnNxvdWyXRdwHxN1dqCIWE43rSq3cJAjsKrtjOfrGrtstzjnh-ugV_m75Nj5Fg2ZjcNcmiSoCEUC1YBCoNlIqE0JkZJjlQU0GPtNKSF5VvqrUuGepsbQqTUUHNXBIAk15JbRVCeuCnqH0ZDoxWYRDIT1lIP4Qlvk1TQWT2mrrccDJcO3lUDZGJpqtB19EDrQoBi2H8huwouTQL6LdFp3_v3yBDjwXtrpWLpZH6eV8ZS6B25eygFK82Sok2whvrcchPDs_jV8i7KgJ |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LSwMxEB60RfTmE6tVc9Cb0TbZ3ewKIkWtrS8EK3hb81RB29rW14_yPzpJrR4Eb143sIdvJvPNKzMA60gKiYmUoMqlmkaWx1QKEVOH1IHxiESXwb93PjtPGlfR8XV8PQYfo7cwvq1yZBODoTYd7XPk28zP0fSbsat73Sfqt0b56upohcZQLU7s-yuGbP3d5gHKd4Ox-mFrv0G_tgpQiVSFF4obq5S2CUYWVrg0zlQcCaOlSqs60pUsZY5bx6vKVkycIaHHhqeVxFUVnkuOvx2HYsR55i9UWj_6TumwRKCDzoe10zApbFv23u5ftpB02VYl4n7tfDF8-mX5A53Vp6F4Ibu2NwNjtj0LE6ELVPfn4KbWRbXo7JAauRy2WCOx0Z_sO21-L80dbJJLFK9_eLVJZNuQmtbPfuwE8WXkR_XwTnxrXUhwkdrDLUI5uHuch9Z_gLUAhXanbReBJFIxbTHYkU5EmeFSKOOMYyniZFPDSrAYkMm7wykbuQctD6CVoDwCK_-6Yf38Rx-W_j5eg8lG6-w0P22enyzDFPPxsu8hE2UoDHrPdgWdioFaDaIkkP-z6nwCoOvf-Q |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Apollo%3A+A+Sequencing-Technology-Independent%2C+Scalable%2C+and+Accurate+Assembly+Polishing+Algorithm&rft.jtitle=arXiv.org&rft.au=Firtina%2C+Can&rft.au=Kim%2C+Jeremie+S&rft.au=Alser%2C+Mohammed&rft.au=Cali%2C+Damla+Senol&rft.date=2020-03-07&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.1902.04341 |