ELECTRONIC DOCUMENT SOURCE INGESTION FOR NATURAL LANGUAGE PROCESSING SYSTEMS
The data store for a natural-language computing system may include information that originates from a plurality of different data sources-e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single,...
Saved in:
Main Author | |
---|---|
Format | Patent |
Language | English |
Published |
12.06.2014
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The data store for a natural-language computing system may include information that originates from a plurality of different data sources-e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents with their respective formats, a natural language processing system may perform preprocessing to change the different formats into a normalized format. When a new text document is received, the text may be correlated to a particular properties file which includes instructions specifying how the preprocessor should interpret the received text. Based on these instructions, a preprocessor identifies relevant portions of the text document and assigns these portions to formatting elements in the normalized format. The text may then be stored in the objects based on this assignment. |
---|---|
Bibliography: | Application Number: US201213709413 |