PDF Entity Annotation Tool (PEAT)

While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled da...

Full description

Saved in:

Bibliographic Details
Published in	Journal of open source software Vol. 10; no. 108; p. 5336
Main Authors	Stahl, Christopher G., Markey, Kristan J., Jewell, Brian C., Shams, Dahnish, Taylor, Michele M., Wilkins, A. Amina, Watford, Sean, Shapiro, Andy, Angrish, Michelle
Format	Journal Article
Language	English
Published	United States Open Source Initiative - NumFOCUS 08.04.2025
Subjects	annotation pdf Python text extraction
Online Access	Get full text

Cover

Loading…

Abstract	While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.
AbstractList	While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning. While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.
Author	Shams, Dahnish Wilkins, A. Amina Watford, Sean Taylor, Michele M. Angrish, Michelle Stahl, Christopher G. Jewell, Brian C. Shapiro, Andy Markey, Kristan J.
Author_xml	– sequence: 1 givenname: Christopher G. orcidid: 0000-0002-2070-1555 surname: Stahl fullname: Stahl, Christopher G. – sequence: 2 givenname: Kristan J. orcidid: 0000-0003-3911-2969 surname: Markey fullname: Markey, Kristan J. – sequence: 3 givenname: Brian C. orcidid: 0000-0003-3712-6523 surname: Jewell fullname: Jewell, Brian C. – sequence: 4 givenname: Dahnish orcidid: 0000-0002-9859-0859 surname: Shams fullname: Shams, Dahnish – sequence: 5 givenname: Michele M. orcidid: 0000-0001-5049-0499 surname: Taylor fullname: Taylor, Michele M. – sequence: 6 givenname: A. Amina orcidid: 0000-0001-9292-833X surname: Wilkins fullname: Wilkins, A. Amina – sequence: 7 givenname: Sean orcidid: 0000-0003-0888-5029 surname: Watford fullname: Watford, Sean – sequence: 8 givenname: Andy orcidid: 0000-0002-5233-8092 surname: Shapiro fullname: Shapiro, Andy – sequence: 9 givenname: Michelle orcidid: 0000-0002-4956-4806 surname: Angrish fullname: Angrish, Michelle
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/40547228$$D View this record in MEDLINE/PubMed https://www.osti.gov/servlets/purl/2573694$$D View this record in Osti.gov
BookMark	eNpN0E1PAjEQgOHGaASRk3ezesKYxX53eyQIakIiBzw3pczGJUuL23Lg37sCGk8zhyeTzHuFzn3wgNANwUNKCBZP6xDjEAvG5BnqUq5ErrGU5__2DurHuMYYk0JSScgl6nAsuKK06KK7-fM0m_hUpX028j4km6rgs0UIdTaYT0aLh2t0Udo6Qv80e-hjOlmMX_PZ-8vbeDTLHRFS5o6XBRTMMasUVxqksnbFsSpKqZTVQlFt9crpQoPmREu7XGpWMEWdAuo4sB66P94NMVUmuiqB-3TBe3DJUKGY1LxFgyPaNuFrBzGZTRUd1LX1EHbRMEqZZJgL3dLbE90tN7Ay26ba2GZvfn9vweMRuKZt2ED5Rwg2h7jmJ645xGXfqiFnZw
Cites_doi	10.1117/12.2005608 10.1145/3620665.3640366 10.1016/j.envint.2021.107025 10.1093/nar/gkaa333 10.5281/zenodo.1212303
ContentType	Journal Article
CorporateAuthor	Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
CorporateAuthor_xml	– name: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
DBID	AAYXX CITATION NPM 7X8 OIOZB OTOTI
DOI	10.21105/joss.05336
DatabaseName	CrossRef PubMed MEDLINE - Academic OSTI.GOV - Hybrid OSTI.GOV
DatabaseTitle	CrossRef PubMed MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic PubMed
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	2475-9066
ExternalDocumentID	2573694 40547228 10_21105_joss_05336
Genre	Journal Article
GrantInformation_xml	– fundername: Intramural EPA grantid: EPA999999
GroupedDBID	AAFWJ AAYXX ADBBV AFPKN ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ M~E OK1 NPM 7X8 OIOZB OTOTI
ID	FETCH-LOGICAL-c1566-c4f8e83c3a77479e67aad4078f677a95729a9dc989e94196abb938372c7e2c4e3
ISSN	2475-9066
IngestDate	Mon Aug 11 02:20:42 EDT 2025 Tue Jun 24 17:31:47 EDT 2025 Sat Jun 28 01:34:02 EDT 2025 Tue Jul 01 05:14:13 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	true
Issue	108
Language	English
License	http://creativecommons.org/licenses/by/4.0
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c1566-c4f8e83c3a77479e67aad4078f677a95729a9dc989e94196abb938372c7e2c4e3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 USDOE AC05-00OR22725
ORCID	0000-0001-5049-0499 0000-0002-5233-8092 0000-0002-4956-4806 0000-0003-3712-6523 0000-0003-3911-2969 0000-0002-9859-0859 0000-0003-0888-5029 0000-0002-2070-1555 0000-0001-9292-833X 0000000150490499 0000000220701555 0000000298590859 0000000339112969 0000000308885029 000000019292833X 0000000249564806 0000000337126523 0000000252338092
OpenAccessLink	http://dx.doi.org/10.21105/joss.05336
PMID	40547228
PQID	3223630459
PQPubID	23479
ParticipantIDs	osti_scitechconnect_2573694 proquest_miscellaneous_3223630459 pubmed_primary_40547228 crossref_primary_10_21105_joss_05336
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2025-04-08 2025-Apr-08 20250408
PublicationDateYYYYMMDD	2025-04-08
PublicationDate_xml	– month: 04 year: 2025 text: 2025-04-08 day: 08
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States
PublicationTitle	Journal of open source software
PublicationTitleAlternate	J Open Source Softw
PublicationYear	2025
Publisher	Open Source Initiative - NumFOCUS
Publisher_xml	– name: Open Source Initiative - NumFOCUS
References	Xu (Xu:2013) 2013 Honnibal (Honnibal:2020) 2020 (gate) 2022 White (White:2021) 2021 Ansel (Ansel:2024) 2024 Electron Authors (electron) 2025 Stenetorp (Stenetorp:2012) 2012 Islamaj (Islamaj:2020) 2020; 48 Tyurin (highlight) 2022 Mozilla (pdfjs) 2025 Shindo (Shindo:2018) 2018 International Organization for Standardization (pdf) 2020 Walker (Walker:2022) 2022
References_xml	– year: 2022 ident: gate article-title: GATE: A full-lifecycle open source solution for text processing – year: 2013 ident: Xu:2013 article-title: Graph-based layout analysis for PDF documents publication-title: Imaging and Printing in a Web 2.0 World IV doi: 10.1117/12.2005608 – year: 2024 ident: Ansel:2024 article-title: PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation publication-title: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24) doi: 10.1145/3620665.3640366 – year: 2022 ident: Walker:2022 article-title: Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr publication-title: Environ Int. doi: 10.1016/j.envint.2021.107025 – year: 2021 ident: White:2021 article-title: Publications Output: U.S. Trends and International Comparisons publication-title: National Center for Science and Engineering Statistics – year: 2020 ident: pdf article-title: ISO 32000-2:2020 document management – portable document format – part 2: PDF 2.0 – volume: 48 year: 2020 ident: Islamaj:2020 article-title: TeamTat: a collaborative text annotation tool publication-title: Nucleic Acids Research doi: 10.1093/nar/gkaa333 – year: 2018 ident: Shindo:2018 article-title: PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents publication-title: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) – year: 2022 ident: highlight article-title: React-pdf-highlighter publication-title: GitHub repository – year: 2020 ident: Honnibal:2020 article-title: spaCy: Industrial-strength Natural Language Processing in Python. doi: 10.5281/zenodo.1212303 – year: 2012 ident: Stenetorp:2012 article-title: brat: A web-based tool for NLP-assisted text annotation publication-title: Proceedings of the demonstrations session at EACL 2012 – year: 2025 ident: electron article-title: Electron – year: 2025 ident: pdfjs article-title: PDF.js
SSID	ssj0001862611
Score	2.2879326
Snippet	While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid... While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid...
SourceID	osti proquest pubmed crossref
SourceType	Open Access Repository Aggregation Database Index Database
StartPage	5336
SubjectTerms	annotation pdf Python text extraction
Title	PDF Entity Annotation Tool (PEAT)
URI	https://www.ncbi.nlm.nih.gov/pubmed/40547228 https://www.proquest.com/docview/3223630459 https://www.osti.gov/servlets/purl/2573694
Volume	10
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELegvPAC43NlgDJpD6DIo0tsx34sXcuYxJjUVtpb5LiOChsp6lJN21_PnZ2PUgYCXpLKkRzr7vrL_c53Z0L2Mi6ynuGC2lnGKYNPMs3AUaZcWJ3xAymMa-rz6UQcTdnxGT9rj7dy1SVltm9ubq0r-R-twhjoFatk_0GzzaQwAL9Bv3AFDcP1r3R8ejgKh1hnex32i2JRJQ5OsK0mOI6nw_6kJvq_up94bFboQ_dwy8sr3abBggc6v9hoPRB-2G-j18tzH-l2EIEFUM2zY1unWb9fInYMmifjeRXWOdTz4svlfD3eEHGXpiJbdoqrG_vVfcT8Jt-fnIYnq2-jz4PpuAWwiCWcqp6oWl3fMlYjcG_d0vzLNqEdiSq2wfiKtZdYQLzRQNt9kgF_YqHYXXIvAtYQrTFsF3JD9uZOZG5W4Ss23dzv2pl_8lE6C8Da3_MP54dMtsiDSoNB31vDI3LHFo_Jw_pwjqDC6idkF4wj8MYRtMYRoHEEb9A03j4l09FwMjii1XkY1CDLpobl0srYxBp89kRZkWg9w33YXCSJVhx4klYzo6SyigGy6ixTGICITGIjw2z8jHSKRWG3SRDNBNYw59zIiOUHXForJYB7DP9pqZnskr1aAul33_YkBbroBJWioFInqC7ZQemk4K1hy2GDuVmmTCs1dMluLbQUQAt3onRhF6vLFL4iscA9etUlz700m9cAg8AGpvLFH-feIfdb43xJOuVyZV-Be1hmr53SfwDRpV-r
linkProvider	Directory of Open Access Journals
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=PDF+Entity+Annotation+Tool+%28PEAT%29&rft.jtitle=Journal+of+open+source+software&rft.au=Stahl%2C+Christopher+G.&rft.au=Markey%2C+Kristan+J.&rft.au=Jewell%2C+Brian+C.&rft.au=Shams%2C+Dahnish&rft.date=2025-04-08&rft.pub=Open+Source+Initiative+-+NumFOCUS&rft.issn=2475-9066&rft.eissn=2475-9066&rft.volume=10&rft.issue=108&rft_id=info:doi/10.21105%2Fjoss.05336&rft.externalDocID=2573694
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2475-9066&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2475-9066&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2475-9066&client=summon