Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of la...

Full description

Saved in:
Bibliographic Details
Published inBioMedInformatics Vol. 4; no. 2; pp. 1556 - 1571
Main Authors An, Weizhi, Guo, Yuzhi, Bian, Yatao, Ma, Hehuan, Yang, Jinyu, Li, Chunyuan, Huang, Junzhou
Format Journal Article
LanguageEnglish
Published MDPI AG 01.06.2024
Subjects
Online AccessGet full text
ISSN2673-7426
2673-7426
DOI10.3390/biomedinformatics4020085

Cover

Loading…
Abstract Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.
AbstractList Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.
Author Huang, Junzhou
Ma, Hehuan
Bian, Yatao
Li, Chunyuan
Guo, Yuzhi
Yang, Jinyu
An, Weizhi
Author_xml – sequence: 1
  givenname: Weizhi
  surname: An
  fullname: An, Weizhi
– sequence: 2
  givenname: Yuzhi
  orcidid: 0000-0002-8993-1818
  surname: Guo
  fullname: Guo, Yuzhi
– sequence: 3
  givenname: Yatao
  surname: Bian
  fullname: Bian, Yatao
– sequence: 4
  givenname: Hehuan
  orcidid: 0000-0002-5971-0053
  surname: Ma
  fullname: Ma, Hehuan
– sequence: 5
  givenname: Jinyu
  surname: Yang
  fullname: Yang, Jinyu
– sequence: 6
  givenname: Chunyuan
  surname: Li
  fullname: Li, Chunyuan
– sequence: 7
  givenname: Junzhou
  orcidid: 0000-0002-9548-1227
  surname: Huang
  fullname: Huang, Junzhou
BookMark eNp1kMtOwzAQRS1UJErpP-QHAhPbiZ1lVV6VCmVR1tHEnqSuWhs5KYi_J6UIsWE1M1dzz-JcspEPnhhLMrgWooSb2oU9WeebEPfYO9NJ4AA6P2NjXiiRKsmL0Z_9gk27bgsAXCvBSz1my5l9R2-cb5Pb51myRN8esKXkKVjadUm_ieHQboazd026io58TzZ5iZSuIzp_7H24_vgw1K_YeYO7jqY_c8Je7-_W88d0uXpYzGfL1HCR52mBWJAua0UkhcqlrXVuBZVQcqWNbEhITjyjQpU5EDfSKFAKDRZQWl4bMWGLE9cG3FZv0e0xflYBXfUdhNhWGAcdO6pyzEDXEq2ETPJMoyJBYJQdlBR50wwsfWKZGLouUvPLy6A6Sq7-kyy-AJMjdiA
Cites_doi 10.1038/nature11247
10.1093/nar/gkac326
10.1038/nature05874
10.1038/s41592-021-01252-x
10.1186/s13059-020-1929-3
10.1101/2021.04.27.441365
10.1145/3535508.3545512
10.1093/nar/gks1233
10.1093/nar/gkz672
10.1186/1471-2105-8-S7-S21
10.1038/nbt.3300
10.1186/s12859-019-2927-x
10.1093/nar/gkp335
10.1109/ACCESS.2021.3110269
10.1186/s12918-017-0386-4
10.1214/aoms/1177729694
10.1109/JBHI.2021.3062322
10.1038/s41586-021-04043-8
10.1093/bib/bbu018
10.1093/bioinformatics/btab083
10.1093/bib/bbab060
10.1126/science.aba7612
10.3389/fgene.2016.00024
10.1093/bioinformatics/bty1068
10.1007/s13042-019-00990-x
10.1101/gr.200535.115
10.1038/nmeth.3547
10.3389/fgene.2019.00286
10.1101/gr.135350.111
10.1038/s41576-019-0173-8
10.1093/bioinformatics/btaa003
10.1093/nar/gkx1106
10.1371/journal.pcbi.1003731
10.1103/PhysRevLett.73.3169
10.1007/978-3-319-69923-3_51
10.1093/nar/12.5.2561
10.1093/nar/gkw226
10.1038/nbt0406-423
ContentType Journal Article
DBID AAYXX
CITATION
DOA
DOI 10.3390/biomedinformatics4020085
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
EISSN 2673-7426
EndPage 1571
ExternalDocumentID oai_doaj_org_article_5a108b4ad4014218a7e3e0c7d67365ff
10_3390_biomedinformatics4020085
GroupedDBID AAYXX
ABDBF
AFZYC
ALMA_UNASSIGNED_HOLDINGS
CITATION
GROUPED_DOAJ
MODMG
M~E
ID FETCH-LOGICAL-c2355-6aa6e89b7ee43754db85d3e909278c4fe342e21e67950e2c4c7077aca609d2bc3
IEDL.DBID DOA
ISSN 2673-7426
IngestDate Wed Aug 27 01:28:10 EDT 2025
Tue Jul 01 03:25:49 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
License https://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2355-6aa6e89b7ee43754db85d3e909278c4fe342e21e67950e2c4c7077aca609d2bc3
ORCID 0000-0002-5971-0053
0000-0002-9548-1227
0000-0002-8993-1818
OpenAccessLink https://doaj.org/article/5a108b4ad4014218a7e3e0c7d67365ff
PageCount 16
ParticipantIDs doaj_primary_oai_doaj_org_article_5a108b4ad4014218a7e3e0c7d67365ff
crossref_primary_10_3390_biomedinformatics4020085
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-06-01
PublicationDateYYYYMMDD 2024-06-01
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-06-01
  day: 01
PublicationDecade 2020
PublicationTitle BioMedInformatics
PublicationYear 2024
Publisher MDPI AG
Publisher_xml – name: MDPI AG
References ref_11
Kulakovskiy (ref_44) 2018; 46
ref_19
Frazer (ref_43) 2021; 599
ref_16
Mantegna (ref_8) 1994; 73
Kelley (ref_51) 2016; 26
Ji (ref_23) 2021; 37
Quang (ref_17) 2016; 44
Bailey (ref_37) 1994; 2
Corso (ref_10) 2021; 34
Oubounyt (ref_6) 2019; 10
Li (ref_2) 2015; 16
ref_24
ref_22
Guo (ref_12) 2022; 36
ref_20
Avsec (ref_18) 2021; 18
ref_28
Dreos (ref_45) 2013; 41
Yang (ref_49) 2019; 47
ref_26
Min (ref_25) 2021; 9
Bailey (ref_40) 2009; 37
Kullback (ref_42) 1951; 22
Zhang (ref_7) 2020; 11
Boeva (ref_29) 2016; 7
ref_36
ref_35
ref_34
ref_33
ref_32
ref_31
(ref_30) 2006; 24
ref_39
ref_38
Yang (ref_13) 2021; 50
Domcke (ref_27) 2020; 370
Umarov (ref_15) 2019; 35
Zhou (ref_50) 2015; 12
Brendel (ref_9) 1984; 12
ref_47
Harrow (ref_48) 2012; 22
ref_46
ref_41
Alipanahi (ref_1) 2015; 33
ref_3
Andersson (ref_5) 2020; 21
Strodthoff (ref_14) 2020; 36
Gao (ref_21) 2021; 25
ref_4
References_xml – volume: 2
  start-page: 28
  year: 1994
  ident: ref_37
  article-title: Fitting a mixture model by expectation maximization to discover motifs in bipolymers
  publication-title: ISMB
– ident: ref_47
  doi: 10.1038/nature11247
– volume: 50
  start-page: e81
  year: 2021
  ident: ref_13
  article-title: Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkac326
– ident: ref_26
– ident: ref_4
  doi: 10.1038/nature05874
– volume: 18
  start-page: 1196
  year: 2021
  ident: ref_18
  article-title: Effective gene expression prediction from sequence by integrating long-range interactions
  publication-title: Nat. Methods
  doi: 10.1038/s41592-021-01252-x
– ident: ref_38
  doi: 10.1186/s13059-020-1929-3
– ident: ref_32
  doi: 10.1101/2021.04.27.441365
– ident: ref_28
  doi: 10.1145/3535508.3545512
– volume: 41
  start-page: D157
  year: 2013
  ident: ref_45
  article-title: EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gks1233
– volume: 47
  start-page: 7809
  year: 2019
  ident: ref_49
  article-title: Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkz672
– ident: ref_39
  doi: 10.1186/1471-2105-8-S7-S21
– volume: 33
  start-page: 831
  year: 2015
  ident: ref_1
  article-title: Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning
  publication-title: Nat. Biotechnol.
  doi: 10.1038/nbt.3300
– ident: ref_16
  doi: 10.1186/s12859-019-2927-x
– volume: 37
  start-page: W202
  year: 2009
  ident: ref_40
  article-title: MEME SUITE: Tools for motif discovery and searching
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkp335
– volume: 9
  start-page: 123912
  year: 2021
  ident: ref_25
  article-title: Pre-training of deep bidirectional protein sequence representations with structural information
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2021.3110269
– ident: ref_31
– ident: ref_35
  doi: 10.1186/s12918-017-0386-4
– volume: 22
  start-page: 79
  year: 1951
  ident: ref_42
  article-title: On information and sufficiency
  publication-title: Ann. Math. Stat.
  doi: 10.1214/aoms/1177729694
– volume: 25
  start-page: 3596
  year: 2021
  ident: ref_21
  article-title: Limitations of Transformers on Clinical Text Classification
  publication-title: IEEE J. Biomed. Health Inform.
  doi: 10.1109/JBHI.2021.3062322
– volume: 599
  start-page: 91
  year: 2021
  ident: ref_43
  article-title: Disease variant prediction with deep generative models of evolutionary data
  publication-title: Nature
  doi: 10.1038/s41586-021-04043-8
– volume: 16
  start-page: 393
  year: 2015
  ident: ref_2
  article-title: Exploring the function of genetic variants in the non-coding genomic regions: Approaches for identifying human regulatory variants affecting gene expression
  publication-title: Briefings Bioinform.
  doi: 10.1093/bib/bbu018
– volume: 36
  start-page: 6801
  year: 2022
  ident: ref_12
  article-title: Self-supervised pre-training for protein embeddings using tertiary structures
  publication-title: Proc. AAAI Conf. Artif. Intell.
– volume: 34
  start-page: 18539
  year: 2021
  ident: ref_10
  article-title: Neural Distance Embeddings for Biological Sequences
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 37
  start-page: 2112
  year: 2021
  ident: ref_23
  article-title: DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btab083
– ident: ref_20
– ident: ref_3
  doi: 10.1093/bib/bbab060
– volume: 370
  start-page: eaba7612
  year: 2020
  ident: ref_27
  article-title: A human cell atlas of fetal chromatin accessibility
  publication-title: Science
  doi: 10.1126/science.aba7612
– ident: ref_24
– ident: ref_34
– volume: 7
  start-page: 24
  year: 2016
  ident: ref_29
  article-title: Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells
  publication-title: Front. Genet.
  doi: 10.3389/fgene.2016.00024
– volume: 35
  start-page: 2730
  year: 2019
  ident: ref_15
  article-title: Promoter analysis and prediction in the human genome using sequence-based deep learning models
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bty1068
– volume: 11
  start-page: 841
  year: 2020
  ident: ref_7
  article-title: DeepSite: Bidirectional LSTM and CNN models for predicting DNA–protein binding
  publication-title: Int. J. Mach. Learn. Cybern.
  doi: 10.1007/s13042-019-00990-x
– volume: 26
  start-page: 990
  year: 2016
  ident: ref_51
  article-title: Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks
  publication-title: Genome Res.
  doi: 10.1101/gr.200535.115
– volume: 12
  start-page: 931
  year: 2015
  ident: ref_50
  article-title: Predicting effects of noncoding variants with deep learning–based sequence model
  publication-title: Nat. Methods
  doi: 10.1038/nmeth.3547
– volume: 10
  start-page: 286
  year: 2019
  ident: ref_6
  article-title: DeePromoter: Robust promoter predictor using deep learning
  publication-title: Front. Genet.
  doi: 10.3389/fgene.2019.00286
– volume: 22
  start-page: 1760
  year: 2012
  ident: ref_48
  article-title: GENCODE: The reference human genome annotation for the ENCODE Project
  publication-title: Genome Res.
  doi: 10.1101/gr.135350.111
– ident: ref_33
– ident: ref_46
– volume: 21
  start-page: 71
  year: 2020
  ident: ref_5
  article-title: Determinants of enhancer and promoter activities of regulatory elements
  publication-title: Nat. Rev. Genet.
  doi: 10.1038/s41576-019-0173-8
– volume: 36
  start-page: 2401
  year: 2020
  ident: ref_14
  article-title: UDSMProt: Universal deep sequence models for protein classification
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btaa003
– volume: 46
  start-page: D252
  year: 2018
  ident: ref_44
  article-title: HOCOMOCO: Towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkx1106
– ident: ref_36
– ident: ref_19
– ident: ref_41
  doi: 10.1371/journal.pcbi.1003731
– volume: 73
  start-page: 3169
  year: 1994
  ident: ref_8
  article-title: Linguistic features of noncoding DNA sequences
  publication-title: Phys. Rev. Lett.
  doi: 10.1103/PhysRevLett.73.3169
– ident: ref_11
  doi: 10.1007/978-3-319-69923-3_51
– ident: ref_22
– volume: 12
  start-page: 2561
  year: 1984
  ident: ref_9
  article-title: Genome structure described by formal languages
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/12.5.2561
– volume: 44
  start-page: e107
  year: 2016
  ident: ref_17
  article-title: DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkw226
– volume: 24
  start-page: 423
  year: 2006
  ident: ref_30
  article-title: What are DNA sequence motifs?
  publication-title: Nat. Biotechnol.
  doi: 10.1038/nbt0406-423
SSID ssj0002873298
Score 2.25821
Snippet Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters...
SourceID doaj
crossref
SourceType Open Website
Index Database
StartPage 1556
SubjectTerms genomic predictive modeling
self-supervised learning
transformer
Title Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA
URI https://doaj.org/article/5a108b4ad4014218a7e3e0c7d67365ff
Volume 4
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3PS8MwFA4yL15EUXH-GDl4DUvTpEmOm3MMGdPDBruVJH0FRabM-f8vr6nSi3jx2JCW8F7I-9Ev30fIna18PBB5zVTlcyZ9MMwbUEwIZOM2Fn82IdpiUcxW8nGt1h2pL8SEJXrgZLihchk3XroqFgIyxiOnIQcedIWAJFXXePpyyzvF1GvTMtK5sCZBd_JY1w_TbfaWjRQZkLFw4iih3IlHHdr-Jr5MT8hxmxjSUVrQKTmAzRmZN6LHIYYXOlmM6LxtLlJUMHv7pK3ITnzcvdTsCSmLYwJJn7fAlq30A8VGa5wQXz8nq-nD8n7GWv0DFkRMA1jhXAHGeg0gUam28kZVOVhuhTZB1pBLASKDQlvFQQQZNNfaBVdwWwkf8gvS27xv4JJQEXxwwUCw2iHxp681Xp70UjqfFd72SfZthfIj0VyUsTxAy5W_Wa5Pxmiun_lIVN0MRPeVrfvKv9x39R8fuSZHIuYaCcF1Q3q77Rfcxlxh5wfkcDSejKeDZnvsAUAdvtg
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Advancing+DNA+Language+Models+through+Motif-Oriented+Pre-Training+with+MoDNA&rft.jtitle=BioMedInformatics&rft.au=An%2C+Weizhi&rft.au=Guo%2C+Yuzhi&rft.au=Bian%2C+Yatao&rft.au=Ma%2C+Hehuan&rft.date=2024-06-01&rft.issn=2673-7426&rft.eissn=2673-7426&rft.volume=4&rft.issue=2&rft.spage=1556&rft.epage=1571&rft_id=info:doi/10.3390%2Fbiomedinformatics4020085&rft.externalDBID=n%2Fa&rft.externalDocID=10_3390_biomedinformatics4020085
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2673-7426&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2673-7426&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2673-7426&client=summon