WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads

The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, wh...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computational biology Vol. 22; no. 6; p. 498
Main Authors	Patterson, Murray, Marschall, Tobias, Pisanti, Nadia, van Iersel, Leo, Stougie, Leen, Klau, Gunnar W, Schönhuth, Alexander
Format	Journal Article
Language	English
Published	United States 01.06.2015
Subjects	Diploidy Genetics, Population - methods Genome, Human - genetics Haplotypes - genetics High-Throughput Nucleotide Sequencing - methods Humans Polymorphism, Single Nucleotide - genetics Sequence Analysis, DNA - methods dynamic programming algorithms combinatorial optimization haplotypes next generation sequencing
Online Access	Get more information
ISSN	1557-8666
DOI	10.1089/cmb.2014.0157

Cover

Abstract	The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.
AbstractList	The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.
Author	Schönhuth, Alexander Marschall, Tobias van Iersel, Leo Stougie, Leen Patterson, Murray Pisanti, Nadia Klau, Gunnar W
Author_xml	– sequence: 1 givenname: Murray surname: Patterson fullname: Patterson, Murray organization: 1Laboratoire de Biométrie et Biologie Évolutive (LBBE : UMR CNRS 5558), Université de Lyon 1, Villeurbanne, France – sequence: 2 givenname: Tobias surname: Marschall fullname: Marschall, Tobias organization: 3Max Planck Institute for Informatics, Saarbrücken, Germany – sequence: 3 givenname: Nadia surname: Pisanti fullname: Pisanti, Nadia organization: 7Erable Team, INRIA – sequence: 4 givenname: Leo surname: van Iersel fullname: van Iersel, Leo – sequence: 5 givenname: Leen surname: Stougie fullname: Stougie, Leen organization: 7Erable Team, INRIA – sequence: 6 givenname: Gunnar W surname: Klau fullname: Klau, Gunnar W organization: 7Erable Team, INRIA – sequence: 7 givenname: Alexander surname: Schönhuth fullname: Schönhuth, Alexander
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/25658651$$D View this record in MEDLINE/PubMed
BookMark	eNo1j8tKxDAUQIMozkOXbiU_0JqbJmnqbhicGaEg-GCWQx6300pfNu2if6-grg5nc-CsyGXbtUjIHbAYmM4eXGNjzkDEDGR6QZYgZRpppdSCrEL4ZAwSxdJrsuBSSa0kLEl-LM0YDqZ_pEeszuWInv5Y3Y1zj3QTAja2nmnRDXQ3jdOA0R5bHMxYdS19w68JW1e1Z_qKxocbclWYOuDtH9fkY_f0vj1E-cv-ebvJozKRbIwMWNQiKTigBOm1Be0t48bx1Hpp00QYJ1ADeAuQOiMY2iTJmMicEwwUX5P7324_2Qb9qR-qxgzz6X-LfwMn5U8g
ContentType	Journal Article
DBID	CGR CUY CVF ECM EIF NPM
DOI	10.1089/cmb.2014.0157
DatabaseName	Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed
DatabaseTitle	MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid)
DatabaseTitleList	MEDLINE
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	no_fulltext_linktorsrc
Discipline	Biology Mathematics
EISSN	1557-8666
ExternalDocumentID	25658651
Genre	Research Support, Non-U.S. Gov't Journal Article
GroupedDBID	--- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1
ID	FETCH-LOGICAL-h350t-a1be843f21e515d8b18db02ac27bd5b734ac4e811db117ca40eb339049cc40162
IngestDate	Thu Apr 03 07:04:53 EDT 2025
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	6
Keywords	dynamic programming algorithms combinatorial optimization haplotypes next generation sequencing
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-h350t-a1be843f21e515d8b18db02ac27bd5b734ac4e811db117ca40eb339049cc40162
OpenAccessLink	https://inria.hal.science/hal-01225988
PMID	25658651
ParticipantIDs	pubmed_primary_25658651
PublicationCentury	2000
PublicationDate	2015-06-01
PublicationDateYYYYMMDD	2015-06-01
PublicationDate_xml	– month: 06 year: 2015 text: 2015-06-01 day: 01
PublicationDecade	2010
PublicationPlace	United States
PublicationPlace_xml	– name: United States
PublicationTitle	Journal of computational biology
PublicationTitleAlternate	J Comput Biol
PublicationYear	2015
SSID	ssj0013607
Score	2.5657468
Snippet	The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting...
SourceID	pubmed
SourceType	Index Database
StartPage	498
SubjectTerms	Diploidy Genetics, Population - methods Genome, Human - genetics Haplotypes - genetics High-Throughput Nucleotide Sequencing - methods Humans Polymorphism, Single Nucleotide - genetics Sequence Analysis, DNA - methods
Title	WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads
URI	https://www.ncbi.nlm.nih.gov/pubmed/25658651
Volume	22
hasFullText
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV05T8MwFLYKCAkGBOW-5IE1Jc7llA0hqoJKxVAEG_JzXFGpl3oMsPDXebaTNJSCgCVK4zSJ_D6_y8-fCTlDk-DKKlQdYK5w0AQIJ46EdDwcSuALSFxbbdGM6g_B7VP4VCq9F6qWphOoyLeF60r-I1W8hnLVq2T_INn8oXgBz1G-eEQJ4_FXMta82-O6GOqw_tHkONF9xN_dgcms6gndHnRtSWbNkIc4lmY6VROmitrW3wm74HeBnyrNvg9ZzjAlbZrNPBl6zrQEfzoaiSJBAUbOwk5rtAbQEbn7ft8Zo0A7Vrsnndwy6LVUN_g4WznQUINiToKFs9qpikr1aIjGL7IbqmSK1vMKgCpqzcBuRP1Fm7uxJkOVPdAleEEFX8SL92HnDntGtOi3hXFkmWt_bp0j186alsgS51qvN3WyJ5uEilye0rLil5x_-g5NIp3-dy4gMY5Ja5NspJKilxYeW6Sk-mWyavcYfS2T9bucmHe8TRoZZC5oBhiaA4ZmgKEIGPoFMHQGGGoAs0Meatetq7qTbqjhvPihO3EEAxUHfttjCt3YJAYWJ-B6QnockhC4HwgZqJixBBjjUgQujle_ikGklBiHR94uWe4P-mqfUMUFR2MnMV6PAsUZxDoQBnSJMA6K2skB2bOd8jy0rCnPWXcdfttyRNZmYDomK20cpuoEfb4JnBrJfADw_Fhm
linkProvider	National Library of Medicine
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=WhatsHap%3A+Weighted+Haplotype+Assembly+for+Future-Generation+Sequencing+Reads&rft.jtitle=Journal+of+computational+biology&rft.au=Patterson%2C+Murray&rft.au=Marschall%2C+Tobias&rft.au=Pisanti%2C+Nadia&rft.au=van+Iersel%2C+Leo&rft.date=2015-06-01&rft.eissn=1557-8666&rft.volume=22&rft.issue=6&rft.spage=498&rft_id=info:doi/10.1089%2Fcmb.2014.0157&rft_id=info%3Apmid%2F25658651&rft_id=info%3Apmid%2F25658651&rft.externalDocID=25658651