Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA
Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of la...
Saved in:
Published in | BioMedInformatics Vol. 4; no. 2; pp. 1556 - 1571 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
MDPI AG
01.06.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 2673-7426 2673-7426 |
DOI | 10.3390/biomedinformatics4020085 |
Cover
Loading…
Abstract | Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling. |
---|---|
AbstractList | Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling. |
Author | Huang, Junzhou Ma, Hehuan Bian, Yatao Li, Chunyuan Guo, Yuzhi Yang, Jinyu An, Weizhi |
Author_xml | – sequence: 1 givenname: Weizhi surname: An fullname: An, Weizhi – sequence: 2 givenname: Yuzhi orcidid: 0000-0002-8993-1818 surname: Guo fullname: Guo, Yuzhi – sequence: 3 givenname: Yatao surname: Bian fullname: Bian, Yatao – sequence: 4 givenname: Hehuan orcidid: 0000-0002-5971-0053 surname: Ma fullname: Ma, Hehuan – sequence: 5 givenname: Jinyu surname: Yang fullname: Yang, Jinyu – sequence: 6 givenname: Chunyuan surname: Li fullname: Li, Chunyuan – sequence: 7 givenname: Junzhou orcidid: 0000-0002-9548-1227 surname: Huang fullname: Huang, Junzhou |
BookMark | eNp1kMtOwzAQRS1UJErpP-QHAhPbiZ1lVV6VCmVR1tHEnqSuWhs5KYi_J6UIsWE1M1dzz-JcspEPnhhLMrgWooSb2oU9WeebEPfYO9NJ4AA6P2NjXiiRKsmL0Z_9gk27bgsAXCvBSz1my5l9R2-cb5Pb51myRN8esKXkKVjadUm_ieHQboazd026io58TzZ5iZSuIzp_7H24_vgw1K_YeYO7jqY_c8Je7-_W88d0uXpYzGfL1HCR52mBWJAua0UkhcqlrXVuBZVQcqWNbEhITjyjQpU5EDfSKFAKDRZQWl4bMWGLE9cG3FZv0e0xflYBXfUdhNhWGAcdO6pyzEDXEq2ETPJMoyJBYJQdlBR50wwsfWKZGLouUvPLy6A6Sq7-kyy-AJMjdiA |
Cites_doi | 10.1038/nature11247 10.1093/nar/gkac326 10.1038/nature05874 10.1038/s41592-021-01252-x 10.1186/s13059-020-1929-3 10.1101/2021.04.27.441365 10.1145/3535508.3545512 10.1093/nar/gks1233 10.1093/nar/gkz672 10.1186/1471-2105-8-S7-S21 10.1038/nbt.3300 10.1186/s12859-019-2927-x 10.1093/nar/gkp335 10.1109/ACCESS.2021.3110269 10.1186/s12918-017-0386-4 10.1214/aoms/1177729694 10.1109/JBHI.2021.3062322 10.1038/s41586-021-04043-8 10.1093/bib/bbu018 10.1093/bioinformatics/btab083 10.1093/bib/bbab060 10.1126/science.aba7612 10.3389/fgene.2016.00024 10.1093/bioinformatics/bty1068 10.1007/s13042-019-00990-x 10.1101/gr.200535.115 10.1038/nmeth.3547 10.3389/fgene.2019.00286 10.1101/gr.135350.111 10.1038/s41576-019-0173-8 10.1093/bioinformatics/btaa003 10.1093/nar/gkx1106 10.1371/journal.pcbi.1003731 10.1103/PhysRevLett.73.3169 10.1007/978-3-319-69923-3_51 10.1093/nar/12.5.2561 10.1093/nar/gkw226 10.1038/nbt0406-423 |
ContentType | Journal Article |
DBID | AAYXX CITATION DOA |
DOI | 10.3390/biomedinformatics4020085 |
DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
EISSN | 2673-7426 |
EndPage | 1571 |
ExternalDocumentID | oai_doaj_org_article_5a108b4ad4014218a7e3e0c7d67365ff 10_3390_biomedinformatics4020085 |
GroupedDBID | AAYXX ABDBF AFZYC ALMA_UNASSIGNED_HOLDINGS CITATION GROUPED_DOAJ MODMG M~E |
ID | FETCH-LOGICAL-c2355-6aa6e89b7ee43754db85d3e909278c4fe342e21e67950e2c4c7077aca609d2bc3 |
IEDL.DBID | DOA |
ISSN | 2673-7426 |
IngestDate | Wed Aug 27 01:28:10 EDT 2025 Tue Jul 01 03:25:49 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 2 |
Language | English |
License | https://creativecommons.org/licenses/by/4.0 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c2355-6aa6e89b7ee43754db85d3e909278c4fe342e21e67950e2c4c7077aca609d2bc3 |
ORCID | 0000-0002-5971-0053 0000-0002-9548-1227 0000-0002-8993-1818 |
OpenAccessLink | https://doaj.org/article/5a108b4ad4014218a7e3e0c7d67365ff |
PageCount | 16 |
ParticipantIDs | doaj_primary_oai_doaj_org_article_5a108b4ad4014218a7e3e0c7d67365ff crossref_primary_10_3390_biomedinformatics4020085 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-06-01 |
PublicationDateYYYYMMDD | 2024-06-01 |
PublicationDate_xml | – month: 06 year: 2024 text: 2024-06-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | BioMedInformatics |
PublicationYear | 2024 |
Publisher | MDPI AG |
Publisher_xml | – name: MDPI AG |
References | ref_11 Kulakovskiy (ref_44) 2018; 46 ref_19 Frazer (ref_43) 2021; 599 ref_16 Mantegna (ref_8) 1994; 73 Kelley (ref_51) 2016; 26 Ji (ref_23) 2021; 37 Quang (ref_17) 2016; 44 Bailey (ref_37) 1994; 2 Corso (ref_10) 2021; 34 Oubounyt (ref_6) 2019; 10 Li (ref_2) 2015; 16 ref_24 ref_22 Guo (ref_12) 2022; 36 ref_20 Avsec (ref_18) 2021; 18 ref_28 Dreos (ref_45) 2013; 41 Yang (ref_49) 2019; 47 ref_26 Min (ref_25) 2021; 9 Bailey (ref_40) 2009; 37 Kullback (ref_42) 1951; 22 Zhang (ref_7) 2020; 11 Boeva (ref_29) 2016; 7 ref_36 ref_35 ref_34 ref_33 ref_32 ref_31 (ref_30) 2006; 24 ref_39 ref_38 Yang (ref_13) 2021; 50 Domcke (ref_27) 2020; 370 Umarov (ref_15) 2019; 35 Zhou (ref_50) 2015; 12 Brendel (ref_9) 1984; 12 ref_47 Harrow (ref_48) 2012; 22 ref_46 ref_41 Alipanahi (ref_1) 2015; 33 ref_3 Andersson (ref_5) 2020; 21 Strodthoff (ref_14) 2020; 36 Gao (ref_21) 2021; 25 ref_4 |
References_xml | – volume: 2 start-page: 28 year: 1994 ident: ref_37 article-title: Fitting a mixture model by expectation maximization to discover motifs in bipolymers publication-title: ISMB – ident: ref_47 doi: 10.1038/nature11247 – volume: 50 start-page: e81 year: 2021 ident: ref_13 article-title: Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkac326 – ident: ref_26 – ident: ref_4 doi: 10.1038/nature05874 – volume: 18 start-page: 1196 year: 2021 ident: ref_18 article-title: Effective gene expression prediction from sequence by integrating long-range interactions publication-title: Nat. Methods doi: 10.1038/s41592-021-01252-x – ident: ref_38 doi: 10.1186/s13059-020-1929-3 – ident: ref_32 doi: 10.1101/2021.04.27.441365 – ident: ref_28 doi: 10.1145/3535508.3545512 – volume: 41 start-page: D157 year: 2013 ident: ref_45 article-title: EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era publication-title: Nucleic Acids Res. doi: 10.1093/nar/gks1233 – volume: 47 start-page: 7809 year: 2019 ident: ref_49 article-title: Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkz672 – ident: ref_39 doi: 10.1186/1471-2105-8-S7-S21 – volume: 33 start-page: 831 year: 2015 ident: ref_1 article-title: Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning publication-title: Nat. Biotechnol. doi: 10.1038/nbt.3300 – ident: ref_16 doi: 10.1186/s12859-019-2927-x – volume: 37 start-page: W202 year: 2009 ident: ref_40 article-title: MEME SUITE: Tools for motif discovery and searching publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkp335 – volume: 9 start-page: 123912 year: 2021 ident: ref_25 article-title: Pre-training of deep bidirectional protein sequence representations with structural information publication-title: IEEE Access doi: 10.1109/ACCESS.2021.3110269 – ident: ref_31 – ident: ref_35 doi: 10.1186/s12918-017-0386-4 – volume: 22 start-page: 79 year: 1951 ident: ref_42 article-title: On information and sufficiency publication-title: Ann. Math. Stat. doi: 10.1214/aoms/1177729694 – volume: 25 start-page: 3596 year: 2021 ident: ref_21 article-title: Limitations of Transformers on Clinical Text Classification publication-title: IEEE J. Biomed. Health Inform. doi: 10.1109/JBHI.2021.3062322 – volume: 599 start-page: 91 year: 2021 ident: ref_43 article-title: Disease variant prediction with deep generative models of evolutionary data publication-title: Nature doi: 10.1038/s41586-021-04043-8 – volume: 16 start-page: 393 year: 2015 ident: ref_2 article-title: Exploring the function of genetic variants in the non-coding genomic regions: Approaches for identifying human regulatory variants affecting gene expression publication-title: Briefings Bioinform. doi: 10.1093/bib/bbu018 – volume: 36 start-page: 6801 year: 2022 ident: ref_12 article-title: Self-supervised pre-training for protein embeddings using tertiary structures publication-title: Proc. AAAI Conf. Artif. Intell. – volume: 34 start-page: 18539 year: 2021 ident: ref_10 article-title: Neural Distance Embeddings for Biological Sequences publication-title: Adv. Neural Inf. Process. Syst. – volume: 37 start-page: 2112 year: 2021 ident: ref_23 article-title: DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome publication-title: Bioinformatics doi: 10.1093/bioinformatics/btab083 – ident: ref_20 – ident: ref_3 doi: 10.1093/bib/bbab060 – volume: 370 start-page: eaba7612 year: 2020 ident: ref_27 article-title: A human cell atlas of fetal chromatin accessibility publication-title: Science doi: 10.1126/science.aba7612 – ident: ref_24 – ident: ref_34 – volume: 7 start-page: 24 year: 2016 ident: ref_29 article-title: Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells publication-title: Front. Genet. doi: 10.3389/fgene.2016.00024 – volume: 35 start-page: 2730 year: 2019 ident: ref_15 article-title: Promoter analysis and prediction in the human genome using sequence-based deep learning models publication-title: Bioinformatics doi: 10.1093/bioinformatics/bty1068 – volume: 11 start-page: 841 year: 2020 ident: ref_7 article-title: DeepSite: Bidirectional LSTM and CNN models for predicting DNA–protein binding publication-title: Int. J. Mach. Learn. Cybern. doi: 10.1007/s13042-019-00990-x – volume: 26 start-page: 990 year: 2016 ident: ref_51 article-title: Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks publication-title: Genome Res. doi: 10.1101/gr.200535.115 – volume: 12 start-page: 931 year: 2015 ident: ref_50 article-title: Predicting effects of noncoding variants with deep learning–based sequence model publication-title: Nat. Methods doi: 10.1038/nmeth.3547 – volume: 10 start-page: 286 year: 2019 ident: ref_6 article-title: DeePromoter: Robust promoter predictor using deep learning publication-title: Front. Genet. doi: 10.3389/fgene.2019.00286 – volume: 22 start-page: 1760 year: 2012 ident: ref_48 article-title: GENCODE: The reference human genome annotation for the ENCODE Project publication-title: Genome Res. doi: 10.1101/gr.135350.111 – ident: ref_33 – ident: ref_46 – volume: 21 start-page: 71 year: 2020 ident: ref_5 article-title: Determinants of enhancer and promoter activities of regulatory elements publication-title: Nat. Rev. Genet. doi: 10.1038/s41576-019-0173-8 – volume: 36 start-page: 2401 year: 2020 ident: ref_14 article-title: UDSMProt: Universal deep sequence models for protein classification publication-title: Bioinformatics doi: 10.1093/bioinformatics/btaa003 – volume: 46 start-page: D252 year: 2018 ident: ref_44 article-title: HOCOMOCO: Towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkx1106 – ident: ref_36 – ident: ref_19 – ident: ref_41 doi: 10.1371/journal.pcbi.1003731 – volume: 73 start-page: 3169 year: 1994 ident: ref_8 article-title: Linguistic features of noncoding DNA sequences publication-title: Phys. Rev. Lett. doi: 10.1103/PhysRevLett.73.3169 – ident: ref_11 doi: 10.1007/978-3-319-69923-3_51 – ident: ref_22 – volume: 12 start-page: 2561 year: 1984 ident: ref_9 article-title: Genome structure described by formal languages publication-title: Nucleic Acids Res. doi: 10.1093/nar/12.5.2561 – volume: 44 start-page: e107 year: 2016 ident: ref_17 article-title: DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences publication-title: Nucleic Acids Res. doi: 10.1093/nar/gkw226 – volume: 24 start-page: 423 year: 2006 ident: ref_30 article-title: What are DNA sequence motifs? publication-title: Nat. Biotechnol. doi: 10.1038/nbt0406-423 |
SSID | ssj0002873298 |
Score | 2.25821 |
Snippet | Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters... |
SourceID | doaj crossref |
SourceType | Open Website Index Database |
StartPage | 1556 |
SubjectTerms | genomic predictive modeling self-supervised learning transformer |
Title | Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA |
URI | https://doaj.org/article/5a108b4ad4014218a7e3e0c7d67365ff |
Volume | 4 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3PS8MwFA4yL15EUXH-GDl4DUvTpEmOm3MMGdPDBruVJH0FRabM-f8vr6nSi3jx2JCW8F7I-9Ev30fIna18PBB5zVTlcyZ9MMwbUEwIZOM2Fn82IdpiUcxW8nGt1h2pL8SEJXrgZLihchk3XroqFgIyxiOnIQcedIWAJFXXePpyyzvF1GvTMtK5sCZBd_JY1w_TbfaWjRQZkLFw4iih3IlHHdr-Jr5MT8hxmxjSUVrQKTmAzRmZN6LHIYYXOlmM6LxtLlJUMHv7pK3ITnzcvdTsCSmLYwJJn7fAlq30A8VGa5wQXz8nq-nD8n7GWv0DFkRMA1jhXAHGeg0gUam28kZVOVhuhTZB1pBLASKDQlvFQQQZNNfaBVdwWwkf8gvS27xv4JJQEXxwwUCw2iHxp681Xp70UjqfFd72SfZthfIj0VyUsTxAy5W_Wa5Pxmiun_lIVN0MRPeVrfvKv9x39R8fuSZHIuYaCcF1Q3q77Rfcxlxh5wfkcDSejKeDZnvsAUAdvtg |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Advancing+DNA+Language+Models+through+Motif-Oriented+Pre-Training+with+MoDNA&rft.jtitle=BioMedInformatics&rft.au=An%2C+Weizhi&rft.au=Guo%2C+Yuzhi&rft.au=Bian%2C+Yatao&rft.au=Ma%2C+Hehuan&rft.date=2024-06-01&rft.issn=2673-7426&rft.eissn=2673-7426&rft.volume=4&rft.issue=2&rft.spage=1556&rft.epage=1571&rft_id=info:doi/10.3390%2Fbiomedinformatics4020085&rft.externalDBID=n%2Fa&rft.externalDocID=10_3390_biomedinformatics4020085 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2673-7426&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2673-7426&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2673-7426&client=summon |