ELECTRONIC DOCUMENT SOURCE INGESTION FOR NATURAL LANGUAGE PROCESSING SYSTEMS

The data store for a natural-language computing system may include information that originates from a plurality of different data sources-e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single,...

Full description

Saved in:

Bibliographic Details
Main Author	DUBBELS JOEL C
Format	Patent
Language	English
Published	12.06.2014
Subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The data store for a natural-language computing system may include information that originates from a plurality of different data sources-e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents with their respective formats, a natural language processing system may perform preprocessing to change the different formats into a normalized format. When a new text document is received, the text may be correlated to a particular properties file which includes instructions specifying how the preprocessor should interpret the received text. Based on these instructions, a preprocessor identifies relevant portions of the text document and assigns these portions to formatting elements in the normalized format. The text may then be stored in the objects based on this assignment.
Bibliography:	Application Number: US201213709413