Standardizing free-text data exemplified by two fields from the Immune Epitope Database

While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured...

Full description

Saved in:

Bibliographic Details
Published in	Journal of biomedical semantics Vol. 16; no. 1; pp. 5 - 18
Main Authors	Duesing, Sebastian, Bennett, Jason, Overton, James A., Vita, Randi, Peters, Bjoern
Format	Journal Article
Language	English
Published	England BioMed Central Ltd 22.03.2025 BioMed Central BMC
Subjects	Age Antigenic determinants Automation Biological Ontologies Biomedical data Data normalization Data standardization Databases, Factual Datasets Epitopes Epitopes - immunology Free-text data Humans Immune epitope database Information management Medical research Medicine, Experimental Ontology Python Social networks Standardization Unstructured data Words (language) United States Immune epitope database Ontology Data standardization Free-text data Data normalization Unstructured data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e., removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different fields curated from the literature in the Immune Epitope Database (IEDB): "age" and "data-location" (the part of a paper in which data was found). Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity. We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2041-1480 2041-1480
DOI:	10.1186/s13326-025-00324-7