METHOD FOR EXTRACTING AND STRUCTURING INFORMATION
The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separato...
Saved in:
Main Authors | , , , , , , , , , , |
---|---|
Format | Patent |
Language | English French German |
Published |
02.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separator model, (2) block detection and segmentation model, (3) table extractor, (4) image extractor, (5) image classification model, (6) text extractor, (7) computer vision model for improving the image quality of the texts, (8) optical character recognition model, (09) model for spelling correction, (10) models for semantic enrichment of the text, (11) output file organizer and (12) metadata aggregator for information enrichment. There is also part of the invention a synthetic document generator that serves to create a training base made up of millions of synthetic documents, which emulate real documents commonly used by the O&G industry in different layout variations. These synthetic documents are used to train and update the artificial intelligence models used in the main process of extracting information. Accordingly, it comprises the following steps: (1) generation of synthetic documents, in different layout configurations; (2) training/tuning of computer vision and classification models; (3) quality control of the models under synthetic and real sets; (4) assessment of extraction results in the O&G domain; (5) identification of new formats or alterations to existing formats; (6) adjustment of parameters and configuration of new synthetic formats. |
---|---|
AbstractList | The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available in files that can be consumed by other systems. The method for extracting and structuring information comprises a (1) document page separator model, (2) block detection and segmentation model, (3) table extractor, (4) image extractor, (5) image classification model, (6) text extractor, (7) computer vision model for improving the image quality of the texts, (8) optical character recognition model, (09) model for spelling correction, (10) models for semantic enrichment of the text, (11) output file organizer and (12) metadata aggregator for information enrichment. There is also part of the invention a synthetic document generator that serves to create a training base made up of millions of synthetic documents, which emulate real documents commonly used by the O&G industry in different layout variations. These synthetic documents are used to train and update the artificial intelligence models used in the main process of extracting information. Accordingly, it comprises the following steps: (1) generation of synthetic documents, in different layout configurations; (2) training/tuning of computer vision and classification models; (3) quality control of the models under synthetic and real sets; (4) assessment of extraction results in the O&G domain; (5) identification of new formats or alterations to existing formats; (6) adjustment of parameters and configuration of new synthetic formats. |
Author | RODRIGUES, Max de Castro MENDOZA, Leonardo Alfredo Forero ALEXANDRE, Antonio Marcelo Azevedo DA ROCHA, Renato Sayão Crystallino VILLALLOBOS, Cristian Enrique Munoz BATISTA, Vitor Alcantara PACHECO, Marco Aurélio Cavalcanti CORDEIRO, Fabio Correa ROSERO, Jose Eduardo Ruiz ROMEU, Régis Kruel GOMES, Diogo da Silva Magalhães |
Author_xml | – fullname: RODRIGUES, Max de Castro – fullname: MENDOZA, Leonardo Alfredo Forero – fullname: VILLALLOBOS, Cristian Enrique Munoz – fullname: ALEXANDRE, Antonio Marcelo Azevedo – fullname: BATISTA, Vitor Alcantara – fullname: CORDEIRO, Fabio Correa – fullname: GOMES, Diogo da Silva Magalhães – fullname: ROMEU, Régis Kruel – fullname: PACHECO, Marco Aurélio Cavalcanti – fullname: DA ROCHA, Renato Sayão Crystallino – fullname: ROSERO, Jose Eduardo Ruiz |
BookMark | eNrjYmDJy89L5WQw9HUN8fB3UXDzD1JwjQgJcnQO8fRzV3D0c1EIDgkKdQ4JDQLxPf2ACnwdQzz9_XgYWNMSc4pTeaE0N4OCm2uIs4duakF-fGpxQWJyal5qSbxrgImJsaWJpYmjoTERSgAfaihM |
ContentType | Patent |
DBID | EVB |
DatabaseName | esp@cenet |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: EVB name: esp@cenet url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Chemistry Sciences Physics |
DocumentTitleAlternate | VERFAHREN ZUR EXTRAKTION UND STRUKTURIERUNG VON INFORMATIONEN PROCÉDÉ POUR L'EXTRACTION ET LA STRUCTURATION D'INFORMATIONS |
ExternalDocumentID | EP4439494A1 |
GroupedDBID | EVB |
ID | FETCH-epo_espacenet_EP4439494A13 |
IEDL.DBID | EVB |
IngestDate | Fri Oct 11 05:30:58 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English French German |
LinkModel | DirectLink |
MergedId | FETCHMERGED-epo_espacenet_EP4439494A13 |
Notes | Application Number: EP20220896898 |
OpenAccessLink | https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20241002&DB=EPODOC&CC=EP&NR=4439494A1 |
ParticipantIDs | epo_espacenet_EP4439494A1 |
PublicationCentury | 2000 |
PublicationDate | 20241002 |
PublicationDateYYYYMMDD | 2024-10-02 |
PublicationDate_xml | – month: 10 year: 2024 text: 20241002 day: 02 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
RelatedCompanies | Petroleo Brasileiro S.A. - PETROBRAS Faculdades Católicas |
RelatedCompanies_xml | – name: Petroleo Brasileiro S.A. - PETROBRAS – name: Faculdades Católicas |
Score | 3.5649898 |
Snippet | The invention proposes a method that receives an unstructured document at the input, extracts its information, reorganizes and makes this information available... |
SourceID | epo |
SourceType | Open Access Repository |
SubjectTerms | CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING PHYSICS |
Title | METHOD FOR EXTRACTING AND STRUCTURING INFORMATION |
URI | https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20241002&DB=EPODOC&locale=&CC=EP&NR=4439494A1 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LT8MwDLam8bxBATFeygH1VtGtoV0PFdr6oCD1odKi3qalS6VduokV8fdxom5wgVuURJFj6Yvt5LMDcG_Zc3PEFlyj9pBq1KKVZhvVXGM1tRlDmzeqRe5wFJthQV_Lx7IHy20ujKwT-iWLIyKiKsR7K8_r9c8llie5lZsHtsSu1VOQO57aRcdojhDhqjd1_DTxEld1XWypceZQkQFq0wkGSnvoRVsCDP77VCSlrH9blOAE9lNcrGlPoccbBY7c7cdrChxG3Xu3AgeSoFltsLMD4eYMhpGfh4lHMH4jfplnggcSP5NJ7JG3PCvcvBAMB_IS44RIXkGdAwn83A01lGK22_HMT3fyGhfQb1YNvwTCa3R3ar3mtqjQUltj0-AmwwBnMR_rhqEPYPDnMlf_jF3DsVCdpKiNbqDffnzyWzS1LbuTSvoG-O191Q |
link.rule.ids | 230,309,783,888,25576,76876 |
linkProvider | European Patent Office |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dT8IwEL8Q_MA3nRrxsw9mb4vA6sYeFgP7cCgbBIvhjayjS3gZRGb89702A33Rt6Ztmuslv95d-7srwL3tpFaHL4RBnTY1qE0zwzGz1OA5dThHm9fJZe5wnFjRlL7MHmc1WG5zYVSd0C9VHBERlSHeS3Ver38usXzFrdw88CV2rZ5C5vp6FR2jOUKE637fDcYjf-TpnoctPZm4VGaAOrSHgdIeeti2BEPw3pdJKevfFiU8hv0xLlaUJ1AThQYNb_vxmgaHcfXercGBImhmG-ysQLg5hXYcsGjkE4zfSDBjE8kDSZ5JL_HJG5tMPTaVDAcySHBCrK6gzoCEAfMiA6WY73Y8D8Y7ec1zqBerQlwAETm6O3krF46s0JLbXcsUFscAZ5F2W6bZakLzz2Uu_xm7g0bE4uF8OEher-BIqlHR1TrXUC8_PsUNmt2S3yqFfQMLaYDI |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=METHOD+FOR+EXTRACTING+AND+STRUCTURING+INFORMATION&rft.inventor=RODRIGUES%2C+Max+de+Castro&rft.inventor=MENDOZA%2C+Leonardo+Alfredo+Forero&rft.inventor=VILLALLOBOS%2C+Cristian+Enrique+Munoz&rft.inventor=ALEXANDRE%2C+Antonio+Marcelo+Azevedo&rft.inventor=BATISTA%2C+Vitor+Alcantara&rft.inventor=CORDEIRO%2C+Fabio+Correa&rft.inventor=GOMES%2C+Diogo+da+Silva+Magalh%C3%A3es&rft.inventor=ROMEU%2C+R%C3%A9gis+Kruel&rft.inventor=PACHECO%2C+Marco+Aur%C3%A9lio+Cavalcanti&rft.inventor=DA+ROCHA%2C+Renato+Say%C3%A3o+Crystallino&rft.inventor=ROSERO%2C+Jose+Eduardo+Ruiz&rft.date=2024-10-02&rft.externalDBID=A1&rft.externalDocID=EP4439494A1 |