Using text mining techniques to gather gene-specific information from the biomedical literature

Life science researchers need to find descriptions of genes quickly, in order to understand and interpret the results of their experiments. For this reason, life scientists refer constantly to the biomedical literature to search for articles describing genes they might not be familiar with. Learning...

Full description

Saved in:
Bibliographic Details
Main Author Tudor, Catalina O
Format Dissertation
LanguageEnglish
Published ProQuest Dissertations & Theses 01.01.2011
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Life science researchers need to find descriptions of genes quickly, in order to understand and interpret the results of their experiments. For this reason, life scientists refer constantly to the biomedical literature to search for articles describing genes they might not be familiar with. Learning facts about genes by reading these documents can be an arduous and time consuming task. Also, searching in millions of documents can return many irrelevant results, as gene names can be highly ambiguous. In this dissertation, we seek to help biologists quickly find information about genes. We start by finding article abstracts that mention a genes names and synonyms, and automatically filtering out irrelevant abstracts that are introduced due to gene name ambiguities or that only mention the gene in passing. We then mine informative terms about the gene, by identifying terms that have a disproportionately higher frequency when mentioned with the gene than alone. Since some of these terms are meaningful only in context, we automatically identify sentences that succinctly and clearly describe their relations to the gene. Put together, a genes abstracts, informative terms, and descriptive sentences could provide as an overview of the gene, as well as a gateway to the literature for further exploration. Our evaluations show that the retrieval of gene-centric abstracts is accurate and has high recall, that the terms mined from these documents are relevant to their corresponding genes, and that the sentences describing the relations between genes and their informative terms are rated high by biologists. The system presented in this dissertation is available online and has been already integrated in a gene annotation pipeline.
ISBN:9781124883359
1124883355