ELECTRONIC DOCUMENT SOURCE INGESTION FOR NATURAL LANGUAGE PROCESSING SYSTEMS

The data store for a natural-language computing system may include information that originates from a plurality of different data sources-e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single,...

Full description

Saved in:
Bibliographic Details
Main Author DUBBELS JOEL C
Format Patent
LanguageEnglish
Published 12.06.2014
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The data store for a natural-language computing system may include information that originates from a plurality of different data sources-e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents with their respective formats, a natural language processing system may perform preprocessing to change the different formats into a normalized format. When a new text document is received, the text may be correlated to a particular properties file which includes instructions specifying how the preprocessor should interpret the received text. Based on these instructions, a preprocessor identifies relevant portions of the text document and assigns these portions to formatting elements in the normalized format. The text may then be stored in the objects based on this assignment.
Bibliography:Application Number: US201213709413