System and method for extracting text from portable document format data
Described herein is a computer implemented method. The method includes accessing, by a computer system including a processing unit, portable document format (PDF) data defining a plurality of glyphs, classifying the plurality of glyphs into one or more glyphs sets, and calculating an extended glyphs...
Saved in:
Main Authors | , , |
---|---|
Format | Patent |
Language | Chinese English |
Published |
02.09.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Described herein is a computer implemented method. The method includes accessing, by a computer system including a processing unit, portable document format (PDF) data defining a plurality of glyphs, classifying the plurality of glyphs into one or more glyphs sets, and calculating an extended glyphs bounding box for each glyphs. Each set of glyphs is processed to determine one or more text regions, each text region associated with one or more glyphs from the set of glyphs, the one or more glyphs having commonly overlapping extended bounding boxes.
本文所描述的是一种计算机实施的方法。该方法包括:由包括处理单元的计算机系统访问对多个字形进行定义的可移植文档格式(PDF)数据,将多个字形分类成一个或多个字形集,以及计算每个字形的扩展字形边界框。处理每个字形集以确定一个或多个文本区域,每个文本区域与来自该字形集的一个或多个字形相关联,该一个或多个字形具有共同重叠的扩展边界框。 |
---|---|
Bibliography: | Application Number: CN202210195184 |