WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads

The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, wh...

Full description

Saved in:
Bibliographic Details
Published inJournal of computational biology Vol. 22; no. 6; p. 498
Main Authors Patterson, Murray, Marschall, Tobias, Pisanti, Nadia, van Iersel, Leo, Stougie, Leen, Klau, Gunnar W, Schönhuth, Alexander
Format Journal Article
LanguageEnglish
Published United States 01.06.2015
Subjects
Online AccessGet more information
ISSN1557-8666
DOI10.1089/cmb.2014.0157

Cover

Abstract The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.
AbstractList The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.
Author Schönhuth, Alexander
Marschall, Tobias
van Iersel, Leo
Stougie, Leen
Patterson, Murray
Pisanti, Nadia
Klau, Gunnar W
Author_xml – sequence: 1
  givenname: Murray
  surname: Patterson
  fullname: Patterson, Murray
  organization: 1Laboratoire de Biométrie et Biologie Évolutive (LBBE : UMR CNRS 5558), Université de Lyon 1, Villeurbanne, France
– sequence: 2
  givenname: Tobias
  surname: Marschall
  fullname: Marschall, Tobias
  organization: 3Max Planck Institute for Informatics, Saarbrücken, Germany
– sequence: 3
  givenname: Nadia
  surname: Pisanti
  fullname: Pisanti, Nadia
  organization: 7Erable Team, INRIA
– sequence: 4
  givenname: Leo
  surname: van Iersel
  fullname: van Iersel, Leo
– sequence: 5
  givenname: Leen
  surname: Stougie
  fullname: Stougie, Leen
  organization: 7Erable Team, INRIA
– sequence: 6
  givenname: Gunnar W
  surname: Klau
  fullname: Klau, Gunnar W
  organization: 7Erable Team, INRIA
– sequence: 7
  givenname: Alexander
  surname: Schönhuth
  fullname: Schönhuth, Alexander
BackLink https://www.ncbi.nlm.nih.gov/pubmed/25658651$$D View this record in MEDLINE/PubMed
BookMark eNo1j8tKxDAUQIMozkOXbiU_0JqbJmnqbhicGaEg-GCWQx6300pfNu2if6-grg5nc-CsyGXbtUjIHbAYmM4eXGNjzkDEDGR6QZYgZRpppdSCrEL4ZAwSxdJrsuBSSa0kLEl-LM0YDqZ_pEeszuWInv5Y3Y1zj3QTAja2nmnRDXQ3jdOA0R5bHMxYdS19w68JW1e1Z_qKxocbclWYOuDtH9fkY_f0vj1E-cv-ebvJozKRbIwMWNQiKTigBOm1Be0t48bx1Hpp00QYJ1ADeAuQOiMY2iTJmMicEwwUX5P7324_2Qb9qR-qxgzz6X-LfwMn5U8g
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
DOI 10.1089/cmb.2014.0157
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
DatabaseTitleList MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
Mathematics
EISSN 1557-8666
ExternalDocumentID 25658651
Genre Research Support, Non-U.S. Gov't
Journal Article
GroupedDBID ---
0R~
29K
34G
39C
4.4
53G
5GY
ABBKN
ABEFU
ACGFO
ADBBV
AENEX
AFOSN
AI.
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CAG
CGR
COF
CS3
CUY
CVF
D-I
DIK
DU5
EBS
ECM
EIF
EJD
F5P
IAO
IER
IGS
IHR
IM4
ITC
MV1
NPM
NQHIM
O9-
P2P
R.V
RIG
RML
RMSOB
RNS
TN5
TR2
UE5
VH1
ID FETCH-LOGICAL-h350t-a1be843f21e515d8b18db02ac27bd5b734ac4e811db117ca40eb339049cc40162
IngestDate Thu Apr 03 07:04:53 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 6
Keywords dynamic programming
algorithms
combinatorial optimization
haplotypes
next generation sequencing
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-h350t-a1be843f21e515d8b18db02ac27bd5b734ac4e811db117ca40eb339049cc40162
OpenAccessLink https://inria.hal.science/hal-01225988
PMID 25658651
ParticipantIDs pubmed_primary_25658651
PublicationCentury 2000
PublicationDate 2015-06-01
PublicationDateYYYYMMDD 2015-06-01
PublicationDate_xml – month: 06
  year: 2015
  text: 2015-06-01
  day: 01
PublicationDecade 2010
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2015
SSID ssj0013607
Score 2.5657468
Snippet The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting...
SourceID pubmed
SourceType Index Database
StartPage 498
SubjectTerms Diploidy
Genetics, Population - methods
Genome, Human - genetics
Haplotypes - genetics
High-Throughput Nucleotide Sequencing - methods
Humans
Polymorphism, Single Nucleotide - genetics
Sequence Analysis, DNA - methods
Title WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads
URI https://www.ncbi.nlm.nih.gov/pubmed/25658651
Volume 22
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV05T8MwFLYKCAkGBOW-5IE1Jc7llA0hqoJKxVAEG_JzXFGpl3oMsPDXebaTNJSCgCVK4zSJ_D6_y8-fCTlDk-DKKlQdYK5w0AQIJ46EdDwcSuALSFxbbdGM6g_B7VP4VCq9F6qWphOoyLeF60r-I1W8hnLVq2T_INn8oXgBz1G-eEQJ4_FXMta82-O6GOqw_tHkONF9xN_dgcms6gndHnRtSWbNkIc4lmY6VROmitrW3wm74HeBnyrNvg9ZzjAlbZrNPBl6zrQEfzoaiSJBAUbOwk5rtAbQEbn7ft8Zo0A7Vrsnndwy6LVUN_g4WznQUINiToKFs9qpikr1aIjGL7IbqmSK1vMKgCpqzcBuRP1Fm7uxJkOVPdAleEEFX8SL92HnDntGtOi3hXFkmWt_bp0j186alsgS51qvN3WyJ5uEilye0rLil5x_-g5NIp3-dy4gMY5Ja5NspJKilxYeW6Sk-mWyavcYfS2T9bucmHe8TRoZZC5oBhiaA4ZmgKEIGPoFMHQGGGoAs0Meatetq7qTbqjhvPihO3EEAxUHfttjCt3YJAYWJ-B6QnockhC4HwgZqJixBBjjUgQujle_ikGklBiHR94uWe4P-mqfUMUFR2MnMV6PAsUZxDoQBnSJMA6K2skB2bOd8jy0rCnPWXcdfttyRNZmYDomK20cpuoEfb4JnBrJfADw_Fhm
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=WhatsHap%3A+Weighted+Haplotype+Assembly+for+Future-Generation+Sequencing+Reads&rft.jtitle=Journal+of+computational+biology&rft.au=Patterson%2C+Murray&rft.au=Marschall%2C+Tobias&rft.au=Pisanti%2C+Nadia&rft.au=van+Iersel%2C+Leo&rft.date=2015-06-01&rft.eissn=1557-8666&rft.volume=22&rft.issue=6&rft.spage=498&rft_id=info:doi/10.1089%2Fcmb.2014.0157&rft_id=info%3Apmid%2F25658651&rft_id=info%3Apmid%2F25658651&rft.externalDocID=25658651