SPECTRa-T: Machine-Based Data Extraction and Semantic Searching of Chemistry e-Theses

The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (e-theses). Although NCEs were readily iden...

Full description

Saved in:

Bibliographic Details
Published in	Journal of chemical information and modeling Vol. 50; no. 2; pp. 251 - 261
Main Authors	Downing, Jim, Harvey, Matt J, Morgan, Peter B, Murray-Rust, Peter, Rzepa, Henry S, Stewart, Diana C, Tonge, Alan P, Townsend, Joe A
Format	Journal Article
Language	English
Published	Washington, DC American Chemical Society 22.02.2010
Subjects	Academic Dissertations as Topic Applied sciences Artificial intelligence Automatic Data Processing Chemical Information Chemistry - education Computer science; control theory; systems Data Mining - methods Databases, Factual Exact sciences and technology False Positive Reactions Information systems. Data bases Memory organisation. Data processing Physical chemistry Physical properties Semantics Software Speech and sound recognition and synthesis. Linguistics Electronic properties Semantic analysis Semantics Spectral properties Physical chemistry Document structure Metadata Resource description framework Text Experimental study Data mining Physical properties
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (e-theses). Although NCEs were readily identified within the two major document formats studied, only the use of structured documents enabled identification of chemical objects and their association with the relevant chemical entity (e.g., systematic chemical name). A corpus of theses was analyzed and it is shown that a high degree of semantic information can be extracted from structured documents. This integrated information has been deposited in a persistent Resource Description Framework (RDF) triple-store that allows users to conduct semantic searches. The strength and weaknesses of several document formats are reviewed.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1549-9596 1549-960X
DOI:	10.1021/ci9003688