System and method for extracting text from portable document format data

Described herein is a computer implemented method. The method includes accessing, by a computer system including a processing unit, portable document format (PDF) data defining a plurality of glyphs, classifying the plurality of glyphs into one or more glyphs sets, and calculating an extended glyphs...

Full description

Saved in:
Bibliographic Details
Main Authors YANCHENA VADIM, SCHWIBERT STEFAN, IGUARO, HELENE
Format Patent
LanguageChinese
English
Published 02.09.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Described herein is a computer implemented method. The method includes accessing, by a computer system including a processing unit, portable document format (PDF) data defining a plurality of glyphs, classifying the plurality of glyphs into one or more glyphs sets, and calculating an extended glyphs bounding box for each glyphs. Each set of glyphs is processed to determine one or more text regions, each text region associated with one or more glyphs from the set of glyphs, the one or more glyphs having commonly overlapping extended bounding boxes. 本文所描述的是一种计算机实施的方法。该方法包括:由包括处理单元的计算机系统访问对多个字形进行定义的可移植文档格式(PDF)数据,将多个字形分类成一个或多个字形集,以及计算每个字形的扩展字形边界框。处理每个字形集以确定一个或多个文本区域,每个文本区域与来自该字形集的一个或多个字形相关联,该一个或多个字形具有共同重叠的扩展边界框。
Bibliography:Application Number: CN202210195184