METHOD FOR EXTRACTING AND STRUCTURING INFORMATION

The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separato...

Full description

Saved in:
Bibliographic Details
Main Authors RODRIGUES, Max de Castro, MENDOZA, Leonardo Alfredo Forero, VILLALLOBOS, Cristian Enrique Munoz, ALEXANDRE, Antonio Marcelo Azevedo, BATISTA, Vitor Alcantara, CORDEIRO, Fabio Correa, GOMES, Diogo da Silva Magalhães, ROMEU, Régis Kruel, PACHECO, Marco Aurélio Cavalcanti, DA ROCHA, Renato Sayão Crystallino, ROSERO, Jose Eduardo Ruiz
Format Patent
LanguageEnglish
French
German
Published 02.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separator model, (2) block detection and segmentation model, (3) table extractor, (4) image extractor, (5) image classification model, (6) text extractor, (7) computer vision model for improving the image quality of the texts, (8) optical character recognition model, (09) model for spelling correction, (10) models for semantic enrichment of the text, (11) output file organizer and (12) metadata aggregator for information enrichment. There is also part of the invention a synthetic document generator that serves to create a training base made up of millions of synthetic documents, which emulate real documents commonly used by the O&G industry in different layout variations. These synthetic documents are used to train and update the artificial intelligence models used in the main process of extracting information. Accordingly, it comprises the following steps: (1) generation of synthetic documents, in different layout configurations; (2) training/tuning of computer vision and classification models; (3) quality control of the models under synthetic and real sets; (4) assessment of extraction results in the O&G domain; (5) identification of new formats or alterations to existing formats; (6) adjustment of parameters and configuration of new synthetic formats.
Bibliography:Application Number: EP20220896898