Automated Integration of Genomic Metadata with Sequence-to-Sequence Models

While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with di...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track Vol. 12461; pp. 187 - 203
Main Authors	Cannizzaro, Giuseppe, Leone, Michele, Bernasconi, Anna, Canakoglu, Arif, Carman, Mark J.
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2021 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Deep Learning Genomics High-throughput sequencing Metadata integration Natural language processing Translation models
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While exponential growth in public genomic data can afford great insights into biological processes underlying diseases, a lack of structured metadata often impedes its timely discovery for analysis. In the Gene Expression Omnibus, for example, descriptions of genomic samples lack structure, with different terminology (such as “breast cancer”, “breast tumor”, and “malignant neoplasm of breast”) used to express the same concept. To remedy this, we learn models to extract salient information from this textual metadata. Rather than treating the problem as classification or named entity recognition, we model it as machine translation, leveraging state-of-the-art sequence-to-sequence (seq2seq) models to directly map unstructured input into a structured text format. The application of such models greatly simplifies training and allows for imputation of output fields that are implied but never explicitly mentioned in the input text. We experiment with two types of seq2seq models: an LSTM with attention and a transformer (in particular GPT-2), noting that the latter outperforms both the former and also a multi-label classification approach based on a similar transformer architecture (RoBERTa). The GPT-2 model showed a surprising ability to predict attributes with a large set of possible values, often inferring the correct value for unmentioned attributes. The models were evaluated in both homogeneous and heterogenous training/testing environments, indicating the efficacy of the transformer-based seq2seq approach for real data integration applications.
ISBN:	9783030676698 3030676692
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-67670-4_12