Analyzing deduplicated data blocks associated with unstructured documents

Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data...

Full description

Saved in:
Bibliographic Details
Main Authors Hampp-Bahnmueller, Thomas, Saillet, Yannick, Baessler, Michael
Format Patent
LanguageEnglish
Published 05.03.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Techniques are described relating to unstructured document processing. An associated computer-implemented method includes identifying a plurality of deduplicated data blocks associated with a collection of unstructured documents. The method further includes sorting the plurality of deduplicated data blocks in descending order based upon at least one block frequency metric, selecting a highest sorted unprocessed deduplicated data block, applying text analytics to the selected deduplicated data block, and applying at least one result of the text analytics to any document among the collection of unstructured documents including the selected deduplicated data block. The method is terminated responsive to satisfaction of at least one stopping condition.
Bibliography:Application Number: US202117537470