WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads
The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, wh...
Saved in:
Published in | Journal of computational biology Vol. 22; no. 6; p. 498 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
01.06.2015
|
Subjects | |
Online Access | Get more information |
ISSN | 1557-8666 |
DOI | 10.1089/cmb.2014.0157 |
Cover
Abstract | The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers. |
---|---|
AbstractList | The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers. |
Author | Schönhuth, Alexander Marschall, Tobias van Iersel, Leo Stougie, Leen Patterson, Murray Pisanti, Nadia Klau, Gunnar W |
Author_xml | – sequence: 1 givenname: Murray surname: Patterson fullname: Patterson, Murray organization: 1Laboratoire de Biométrie et Biologie Évolutive (LBBE : UMR CNRS 5558), Université de Lyon 1, Villeurbanne, France – sequence: 2 givenname: Tobias surname: Marschall fullname: Marschall, Tobias organization: 3Max Planck Institute for Informatics, Saarbrücken, Germany – sequence: 3 givenname: Nadia surname: Pisanti fullname: Pisanti, Nadia organization: 7Erable Team, INRIA – sequence: 4 givenname: Leo surname: van Iersel fullname: van Iersel, Leo – sequence: 5 givenname: Leen surname: Stougie fullname: Stougie, Leen organization: 7Erable Team, INRIA – sequence: 6 givenname: Gunnar W surname: Klau fullname: Klau, Gunnar W organization: 7Erable Team, INRIA – sequence: 7 givenname: Alexander surname: Schönhuth fullname: Schönhuth, Alexander |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/25658651$$D View this record in MEDLINE/PubMed |
BookMark | eNo1j8tKxDAUQIMozkOXbiU_0JqbJmnqbhicGaEg-GCWQx6300pfNu2if6-grg5nc-CsyGXbtUjIHbAYmM4eXGNjzkDEDGR6QZYgZRpppdSCrEL4ZAwSxdJrsuBSSa0kLEl-LM0YDqZ_pEeszuWInv5Y3Y1zj3QTAja2nmnRDXQ3jdOA0R5bHMxYdS19w68JW1e1Z_qKxocbclWYOuDtH9fkY_f0vj1E-cv-ebvJozKRbIwMWNQiKTigBOm1Be0t48bx1Hpp00QYJ1ADeAuQOiMY2iTJmMicEwwUX5P7324_2Qb9qR-qxgzz6X-LfwMn5U8g |
ContentType | Journal Article |
DBID | CGR CUY CVF ECM EIF NPM |
DOI | 10.1089/cmb.2014.0157 |
DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed |
DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) |
DatabaseTitleList | MEDLINE |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | no_fulltext_linktorsrc |
Discipline | Biology Mathematics |
EISSN | 1557-8666 |
ExternalDocumentID | 25658651 |
Genre | Research Support, Non-U.S. Gov't Journal Article |
GroupedDBID | --- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1 |
ID | FETCH-LOGICAL-h350t-a1be843f21e515d8b18db02ac27bd5b734ac4e811db117ca40eb339049cc40162 |
IngestDate | Thu Apr 03 07:04:53 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 6 |
Keywords | dynamic programming algorithms combinatorial optimization haplotypes next generation sequencing |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-h350t-a1be843f21e515d8b18db02ac27bd5b734ac4e811db117ca40eb339049cc40162 |
OpenAccessLink | https://inria.hal.science/hal-01225988 |
PMID | 25658651 |
ParticipantIDs | pubmed_primary_25658651 |
PublicationCentury | 2000 |
PublicationDate | 2015-06-01 |
PublicationDateYYYYMMDD | 2015-06-01 |
PublicationDate_xml | – month: 06 year: 2015 text: 2015-06-01 day: 01 |
PublicationDecade | 2010 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States |
PublicationTitle | Journal of computational biology |
PublicationTitleAlternate | J Comput Biol |
PublicationYear | 2015 |
SSID | ssj0013607 |
Score | 2.5657468 |
Snippet | The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting... |
SourceID | pubmed |
SourceType | Index Database |
StartPage | 498 |
SubjectTerms | Diploidy Genetics, Population - methods Genome, Human - genetics Haplotypes - genetics High-Throughput Nucleotide Sequencing - methods Humans Polymorphism, Single Nucleotide - genetics Sequence Analysis, DNA - methods |
Title | WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads |
URI | https://www.ncbi.nlm.nih.gov/pubmed/25658651 |
Volume | 22 |
hasFullText | |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV05T8MwFLYKCAkGBOW-5IE1Jc7llA0hqoJKxVAEG_JzXFGpl3oMsPDXebaTNJSCgCVK4zSJ_D6_y8-fCTlDk-DKKlQdYK5w0AQIJ46EdDwcSuALSFxbbdGM6g_B7VP4VCq9F6qWphOoyLeF60r-I1W8hnLVq2T_INn8oXgBz1G-eEQJ4_FXMta82-O6GOqw_tHkONF9xN_dgcms6gndHnRtSWbNkIc4lmY6VROmitrW3wm74HeBnyrNvg9ZzjAlbZrNPBl6zrQEfzoaiSJBAUbOwk5rtAbQEbn7ft8Zo0A7Vrsnndwy6LVUN_g4WznQUINiToKFs9qpikr1aIjGL7IbqmSK1vMKgCpqzcBuRP1Fm7uxJkOVPdAleEEFX8SL92HnDntGtOi3hXFkmWt_bp0j186alsgS51qvN3WyJ5uEilye0rLil5x_-g5NIp3-dy4gMY5Ja5NspJKilxYeW6Sk-mWyavcYfS2T9bucmHe8TRoZZC5oBhiaA4ZmgKEIGPoFMHQGGGoAs0Meatetq7qTbqjhvPihO3EEAxUHfttjCt3YJAYWJ-B6QnockhC4HwgZqJixBBjjUgQujle_ikGklBiHR94uWe4P-mqfUMUFR2MnMV6PAsUZxDoQBnSJMA6K2skB2bOd8jy0rCnPWXcdfttyRNZmYDomK20cpuoEfb4JnBrJfADw_Fhm |
linkProvider | National Library of Medicine |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=WhatsHap%3A+Weighted+Haplotype+Assembly+for+Future-Generation+Sequencing+Reads&rft.jtitle=Journal+of+computational+biology&rft.au=Patterson%2C+Murray&rft.au=Marschall%2C+Tobias&rft.au=Pisanti%2C+Nadia&rft.au=van+Iersel%2C+Leo&rft.date=2015-06-01&rft.eissn=1557-8666&rft.volume=22&rft.issue=6&rft.spage=498&rft_id=info:doi/10.1089%2Fcmb.2014.0157&rft_id=info%3Apmid%2F25658651&rft_id=info%3Apmid%2F25658651&rft.externalDocID=25658651 |