DNABERT-based explainable lncRNA identification in plant genome assemblies

Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small p...

Full description

Saved in:
Bibliographic Details
Published inComputational and structural biotechnology journal Vol. 21; pp. 5676 - 5685
Main Authors Danilevicz, Monica F., Gill, Mitchell, Fernandez, Cassandria G. Tay, Petereit, Jakob, Upadhyaya, Shriprabha R., Batley, Jacqueline, Bennamoun, Mohammed, Edwards, David, Bayer, Philipp E.
Format Journal Article
LanguageEnglish
Published Netherlands Elsevier B.V 01.01.2023
Research Network of Computational and Structural Biotechnology
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regulation, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA prediction and that these motifs frequently flanked the lncRNA sequence. [Display omitted] •Pioneering identification of lncRNAs from genomic sequences allowing the identification of lowly expressed lncRNAs.•A deep learning model (natural language processing) was used to predict lncRNAs in two monocot and five dicot plant species.•Explainable AI was used for extracting genomic motifs associated with lncRNA detection and potentially conserved structures.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
These authors contributed equally to this work
ISSN:2001-0370
2001-0370
DOI:10.1016/j.csbj.2023.11.025