RepeatModeler2 for automated genomic discovery of transposable element families

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation a...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the National Academy of Sciences - PNAS Vol. 117; no. 17; pp. 9451 - 9457
Main Authors	Flynn, Jullien M, Hubley, Robert, Goubert, Clément, Rosen, Jeb, Clark, Andrew G, Feschotte, Cédric, Smit, Arian F
Format	Journal Article
Language	English
Published	United States National Academy of Sciences 28.04.2020
Subjects	Animals Annotations Automation Biological Sciences Conserved sequence Consortia Danio rerio DNA Transposable Elements - genetics Drosophila melanogaster Drosophila melanogaster - genetics Fruit flies Gene sequencing Genome Genomes Genomics - methods Long terminal repeat Oryza - genetics Oryza sativa Software Source code Species Toolkits Transposons Zebrafish Zebrafish - genetics genome annotation mobile genetic elements transposon families
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: (fruit fly), (zebrafish), and (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Author contributions: J.M.F., R.H., C.G., C.F., and A.F.S. designed research; J.M.F., R.H., and J.R. performed research; A.G.C., C.F., and A.F.S. supervised the research; and J.M.F., R.H., C.G., J.R., A.G.C., C.F., and A.F.S. wrote the paper. Reviewers: I.R.A., Marine Biological Laboratory; and M.C.G.H., Cold Spring Harbor Laboratory. 1J.M.F. and R.H. contributed equally to this work. Contributed by Andrew G. Clark, March 5, 2020 (sent for review December 2, 2019; reviewed by Irina R. Arkhipova and Molly C. Gale Hammell)
ISSN:	0027-8424 1091-6490 1091-6490
DOI:	10.1073/pnas.1921046117