Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss

Abstract Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search...

Full description

Saved in:

Bibliographic Details
Published in	Genetics (Austin) Vol. 216; no. 2; pp. 353 - 358
Main Authors	Wang, Mengchi, Wang, David, Zhang, Kai, Ngo, Vu, Fan, Shicai, Wang, Wei
Format	Journal Article
Language	English
Published	United States Oxford University Press 01.10.2020 Genetics Society of America
Subjects	Algorithms Amino acids Binding sites Conserved sequence Divergence DNA methylation Genetics Genome, Human Genomes Humans Information theory Investigations Logos Mathematical analysis Matrix methods Methods Nucleotide sequence Nucleotides Position-Specific Scoring Matrices Representations Sequence analysis Sequence Analysis, DNA - methods Sequence Analysis, DNA - standards Sequences Transcription factors Transcription Factors - genetics Weight consensus transcription factor binding information theory motif sequence logo
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 These authors contributed equally to this work.
ISSN:	1943-2631 0016-6731 1943-2631
DOI:	10.1534/genetics.120.303597