METHOD FOR EXTRACTING AND STRUCTURING INFORMATION

The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separato...

Full description

Saved in:
Bibliographic Details
Main Authors RODRIGUES, Max de Castro, MENDOZA, Leonardo Alfredo Forero, VILLALLOBOS, Cristian Enrique Munoz, ALEXANDRE, Antonio Marcelo Azevedo, BATISTA, Vitor Alcantara, CORDEIRO, Fabio Correa, GOMES, Diogo da Silva Magalhães, ROMEU, Régis Kruel, PACHECO, Marco Aurélio Cavalcanti, DA ROCHA, Renato Sayão Crystallino, ROSERO, Jose Eduardo Ruiz
Format Patent
LanguageEnglish
French
German
Published 02.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separator model, (2) block detection and segmentation model, (3) table extractor, (4) image extractor, (5) image classification model, (6) text extractor, (7) computer vision model for improving the image quality of the texts, (8) optical character recognition model, (09) model for spelling correction, (10) models for semantic enrichment of the text, (11) output file organizer and (12) metadata aggregator for information enrichment. There is also part of the invention a synthetic document generator that serves to create a training base made up of millions of synthetic documents, which emulate real documents commonly used by the O&G industry in different layout variations. These synthetic documents are used to train and update the artificial intelligence models used in the main process of extracting information. Accordingly, it comprises the following steps: (1) generation of synthetic documents, in different layout configurations; (2) training/tuning of computer vision and classification models; (3) quality control of the models under synthetic and real sets; (4) assessment of extraction results in the O&G domain; (5) identification of new formats or alterations to existing formats; (6) adjustment of parameters and configuration of new synthetic formats.
AbstractList The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separator model, (2) block detection and segmentation model, (3) table extractor, (4) image extractor, (5) image classification model, (6) text extractor, (7) computer vision model for improving the image quality of the texts, (8) optical character recognition model, (09) model for spelling correction, (10) models for semantic enrichment of the text, (11) output file organizer and (12) metadata aggregator for information enrichment. There is also part of the invention a synthetic document generator that serves to create a training base made up of millions of synthetic documents, which emulate real documents commonly used by the O&G industry in different layout variations. These synthetic documents are used to train and update the artificial intelligence models used in the main process of extracting information. Accordingly, it comprises the following steps: (1) generation of synthetic documents, in different layout configurations; (2) training/tuning of computer vision and classification models; (3) quality control of the models under synthetic and real sets; (4) assessment of extraction results in the O&G domain; (5) identification of new formats or alterations to existing formats; (6) adjustment of parameters and configuration of new synthetic formats.
Author RODRIGUES, Max de Castro
MENDOZA, Leonardo Alfredo Forero
ALEXANDRE, Antonio Marcelo Azevedo
DA ROCHA, Renato Sayão Crystallino
VILLALLOBOS, Cristian Enrique Munoz
BATISTA, Vitor Alcantara
PACHECO, Marco Aurélio Cavalcanti
CORDEIRO, Fabio Correa
ROSERO, Jose Eduardo Ruiz
ROMEU, Régis Kruel
GOMES, Diogo da Silva Magalhães
Author_xml – fullname: RODRIGUES, Max de Castro
– fullname: MENDOZA, Leonardo Alfredo Forero
– fullname: VILLALLOBOS, Cristian Enrique Munoz
– fullname: ALEXANDRE, Antonio Marcelo Azevedo
– fullname: BATISTA, Vitor Alcantara
– fullname: CORDEIRO, Fabio Correa
– fullname: GOMES, Diogo da Silva Magalhães
– fullname: ROMEU, Régis Kruel
– fullname: PACHECO, Marco Aurélio Cavalcanti
– fullname: DA ROCHA, Renato Sayão Crystallino
– fullname: ROSERO, Jose Eduardo Ruiz
BookMark eNrjYmDJy89L5WQw9HUN8fB3UXDzD1JwjQgJcnQO8fRzV3D0c1EIDgkKdQ4JDQLxPf2ACnwdQzz9_XgYWNMSc4pTeaE0N4OCm2uIs4duakF-fGpxQWJyal5qSbxrgImJsaWJpYmjoTERSgAfaihM
ContentType Patent
DBID EVB
DatabaseName esp@cenet
DatabaseTitleList
Database_xml – sequence: 1
  dbid: EVB
  name: esp@cenet
  url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Chemistry
Sciences
Physics
DocumentTitleAlternate VERFAHREN ZUR EXTRAKTION UND STRUKTURIERUNG VON INFORMATIONEN
PROCÉDÉ POUR L'EXTRACTION ET LA STRUCTURATION D'INFORMATIONS
ExternalDocumentID EP4439494A1
GroupedDBID EVB
ID FETCH-epo_espacenet_EP4439494A13
IEDL.DBID EVB
IngestDate Fri Oct 11 05:30:58 EDT 2024
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
French
German
LinkModel DirectLink
MergedId FETCHMERGED-epo_espacenet_EP4439494A13
Notes Application Number: EP20220896898
OpenAccessLink https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20241002&DB=EPODOC&CC=EP&NR=4439494A1
ParticipantIDs epo_espacenet_EP4439494A1
PublicationCentury 2000
PublicationDate 20241002
PublicationDateYYYYMMDD 2024-10-02
PublicationDate_xml – month: 10
  year: 2024
  text: 20241002
  day: 02
PublicationDecade 2020
PublicationYear 2024
RelatedCompanies Petroleo Brasileiro S.A. - PETROBRAS
Faculdades Católicas
RelatedCompanies_xml – name: Petroleo Brasileiro S.A. - PETROBRAS
– name: Faculdades Católicas
Score 3.5649898
Snippet The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available...
SourceID epo
SourceType Open Access Repository
SubjectTerms CALCULATING
COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
COMPUTING
COUNTING
PHYSICS
Title METHOD FOR EXTRACTING AND STRUCTURING INFORMATION
URI https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20241002&DB=EPODOC&locale=&CC=EP&NR=4439494A1
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LT8MwDLam8bxBATFeygH1VtGtoV0PFdr6oCD1odKi3qalS6VduokV8fdxom5wgVuURJFj6Yvt5LMDcG_Zc3PEFlyj9pBq1KKVZhvVXGM1tRlDmzeqRe5wFJthQV_Lx7IHy20ujKwT-iWLIyKiKsR7K8_r9c8llie5lZsHtsSu1VOQO57aRcdojhDhqjd1_DTxEld1XWypceZQkQFq0wkGSnvoRVsCDP77VCSlrH9blOAE9lNcrGlPoccbBY7c7cdrChxG3Xu3AgeSoFltsLMD4eYMhpGfh4lHMH4jfplnggcSP5NJ7JG3PCvcvBAMB_IS44RIXkGdAwn83A01lGK22_HMT3fyGhfQb1YNvwTCa3R3ar3mtqjQUltj0-AmwwBnMR_rhqEPYPDnMlf_jF3DsVCdpKiNbqDffnzyWzS1LbuTSvoG-O191Q
link.rule.ids 230,309,783,888,25576,76876
linkProvider European Patent Office
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dT8IwEL8Q_MA3nRrxsw9mb4vA6sYeFgP7cCgbBIvhjayjS3gZRGb89702A33Rt6Ztmuslv95d-7srwL3tpFaHL4RBnTY1qE0zwzGz1OA5dThHm9fJZe5wnFjRlL7MHmc1WG5zYVSd0C9VHBERlSHeS3Ver38usXzFrdw88CV2rZ5C5vp6FR2jOUKE637fDcYjf-TpnoctPZm4VGaAOrSHgdIeeti2BEPw3pdJKevfFiU8hv0xLlaUJ1AThQYNb_vxmgaHcfXercGBImhmG-ysQLg5hXYcsGjkE4zfSDBjE8kDSZ5JL_HJG5tMPTaVDAcySHBCrK6gzoCEAfMiA6WY73Y8D8Y7ec1zqBerQlwAETm6O3krF46s0JLbXcsUFscAZ5F2W6bZakLzz2Uu_xm7g0bE4uF8OEher-BIqlHR1TrXUC8_PsUNmt2S3yqFfQMLaYDI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=METHOD+FOR+EXTRACTING+AND+STRUCTURING+INFORMATION&rft.inventor=RODRIGUES%2C+Max+de+Castro&rft.inventor=MENDOZA%2C+Leonardo+Alfredo+Forero&rft.inventor=VILLALLOBOS%2C+Cristian+Enrique+Munoz&rft.inventor=ALEXANDRE%2C+Antonio+Marcelo+Azevedo&rft.inventor=BATISTA%2C+Vitor+Alcantara&rft.inventor=CORDEIRO%2C+Fabio+Correa&rft.inventor=GOMES%2C+Diogo+da+Silva+Magalh%C3%A3es&rft.inventor=ROMEU%2C+R%C3%A9gis+Kruel&rft.inventor=PACHECO%2C+Marco+Aur%C3%A9lio+Cavalcanti&rft.inventor=DA+ROCHA%2C+Renato+Say%C3%A3o+Crystallino&rft.inventor=ROSERO%2C+Jose+Eduardo+Ruiz&rft.date=2024-10-02&rft.externalDBID=A1&rft.externalDocID=EP4439494A1