PDF Entity Annotation Tool (PEAT)

While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled da...

Full description

Saved in:
Bibliographic Details
Published inJournal of open source software Vol. 10; no. 108; p. 5336
Main Authors Stahl, Christopher G., Markey, Kristan J., Jewell, Brian C., Shams, Dahnish, Taylor, Michele M., Wilkins, A. Amina, Watford, Sean, Shapiro, Andy, Angrish, Michelle
Format Journal Article
LanguageEnglish
Published United States Open Source Initiative - NumFOCUS 08.04.2025
Subjects
Online AccessGet full text

Cover

Loading…
Abstract While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.
AbstractList While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.
While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid pace, the tools used by researchers to create the labeled datasets required for training, modeling, and evaluation remain rudimentary. Labeled datasets contain the target attributes the machine is going to learn; for example, training an algorithm to delineate between images of a car or truck would generally require a set of images with a quantitative description of the underlying features of each vehicle type. Development of labeled textual data that can be used to build natural language machine learning models for scientific literature is not currently integrated into existing manual workflows used by domain experts. Published literature is rich with important information, such as different types of embedded text, plots, and tables that can all be used as inputs to train ML/natural language processing (NLP) models, when extracted and prepared in machine readable formats. Currently, both normalized data extraction of use to domain experts and extraction to support development of ML/NLP models are labor intensive and cumbersome manual processes. Automatic extraction of data and information from formats such as PDFs that are optimized for layout and human readability, not machine readability. The PDF (Portable Document Format) Entity Annotation Tool (PEAT) was developed with the goal of allowing users to annotate publications within their current print format, while also allowing those annotations to be captured in a machine-readable format. One of the main issues with traditional annotation tools is that they require transforming the PDF into plain text to facilitate the annotation process. While doing so lessens the technical challenges of annotating data, the user loses all structure and provenance that was inherent in the underlying PDF. Also, textual data extraction from PDFs can be an error prone process. Challenges include identifying sequential blocks of text and a multitude of document formats (multiple columns, font encodings, etc.). As a result of these challenges, using existing tools for development of NLP/ML models directly from PDFs is difficult because the generated outputs are not interoperable. We created a system that allows annotations to be completed on the original PDF document structure, with no plain text extraction. The result is an application that allows for easier and more accurate annotations. In addition, by including a feature that grants the user the ability to easily create a schema, we have developed a system that can be used to annotate text for different domain-centric schemas of relevance to subject matter experts. Different knowledge domains require distinct schemas and annotation tags to support machine learning.
Author Shams, Dahnish
Wilkins, A. Amina
Watford, Sean
Taylor, Michele M.
Angrish, Michelle
Stahl, Christopher G.
Jewell, Brian C.
Shapiro, Andy
Markey, Kristan J.
Author_xml – sequence: 1
  givenname: Christopher G.
  orcidid: 0000-0002-2070-1555
  surname: Stahl
  fullname: Stahl, Christopher G.
– sequence: 2
  givenname: Kristan J.
  orcidid: 0000-0003-3911-2969
  surname: Markey
  fullname: Markey, Kristan J.
– sequence: 3
  givenname: Brian C.
  orcidid: 0000-0003-3712-6523
  surname: Jewell
  fullname: Jewell, Brian C.
– sequence: 4
  givenname: Dahnish
  orcidid: 0000-0002-9859-0859
  surname: Shams
  fullname: Shams, Dahnish
– sequence: 5
  givenname: Michele M.
  orcidid: 0000-0001-5049-0499
  surname: Taylor
  fullname: Taylor, Michele M.
– sequence: 6
  givenname: A. Amina
  orcidid: 0000-0001-9292-833X
  surname: Wilkins
  fullname: Wilkins, A. Amina
– sequence: 7
  givenname: Sean
  orcidid: 0000-0003-0888-5029
  surname: Watford
  fullname: Watford, Sean
– sequence: 8
  givenname: Andy
  orcidid: 0000-0002-5233-8092
  surname: Shapiro
  fullname: Shapiro, Andy
– sequence: 9
  givenname: Michelle
  orcidid: 0000-0002-4956-4806
  surname: Angrish
  fullname: Angrish, Michelle
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40547228$$D View this record in MEDLINE/PubMed
https://www.osti.gov/servlets/purl/2573694$$D View this record in Osti.gov
BookMark eNpN0E1PAjEQgOHGaASRk3ezesKYxX53eyQIakIiBzw3pczGJUuL23Lg37sCGk8zhyeTzHuFzn3wgNANwUNKCBZP6xDjEAvG5BnqUq5ErrGU5__2DurHuMYYk0JSScgl6nAsuKK06KK7-fM0m_hUpX028j4km6rgs0UIdTaYT0aLh2t0Udo6Qv80e-hjOlmMX_PZ-8vbeDTLHRFS5o6XBRTMMasUVxqksnbFsSpKqZTVQlFt9crpQoPmREu7XGpWMEWdAuo4sB66P94NMVUmuiqB-3TBe3DJUKGY1LxFgyPaNuFrBzGZTRUd1LX1EHbRMEqZZJgL3dLbE90tN7Ay26ba2GZvfn9vweMRuKZt2ED5Rwg2h7jmJ645xGXfqiFnZw
Cites_doi 10.1117/12.2005608
10.1145/3620665.3640366
10.1016/j.envint.2021.107025
10.1093/nar/gkaa333
10.5281/zenodo.1212303
ContentType Journal Article
CorporateAuthor Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
CorporateAuthor_xml – name: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
DBID AAYXX
CITATION
NPM
7X8
OIOZB
OTOTI
DOI 10.21105/joss.05336
DatabaseName CrossRef
PubMed
MEDLINE - Academic
OSTI.GOV - Hybrid
OSTI.GOV
DatabaseTitle CrossRef
PubMed
MEDLINE - Academic
DatabaseTitleList
MEDLINE - Academic
PubMed
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2475-9066
ExternalDocumentID 2573694
40547228
10_21105_joss_05336
Genre Journal Article
GrantInformation_xml – fundername: Intramural EPA
  grantid: EPA999999
GroupedDBID AAFWJ
AAYXX
ADBBV
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GROUPED_DOAJ
M~E
OK1
NPM
7X8
OIOZB
OTOTI
ID FETCH-LOGICAL-c1566-c4f8e83c3a77479e67aad4078f677a95729a9dc989e94196abb938372c7e2c4e3
ISSN 2475-9066
IngestDate Mon Aug 11 02:20:42 EDT 2025
Tue Jun 24 17:31:47 EDT 2025
Sat Jun 28 01:34:02 EDT 2025
Tue Jul 01 05:14:13 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 108
Language English
License http://creativecommons.org/licenses/by/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c1566-c4f8e83c3a77479e67aad4078f677a95729a9dc989e94196abb938372c7e2c4e3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
USDOE
AC05-00OR22725
ORCID 0000-0001-5049-0499
0000-0002-5233-8092
0000-0002-4956-4806
0000-0003-3712-6523
0000-0003-3911-2969
0000-0002-9859-0859
0000-0003-0888-5029
0000-0002-2070-1555
0000-0001-9292-833X
0000000150490499
0000000220701555
0000000298590859
0000000339112969
0000000308885029
000000019292833X
0000000249564806
0000000337126523
0000000252338092
OpenAccessLink http://dx.doi.org/10.21105/joss.05336
PMID 40547228
PQID 3223630459
PQPubID 23479
ParticipantIDs osti_scitechconnect_2573694
proquest_miscellaneous_3223630459
pubmed_primary_40547228
crossref_primary_10_21105_joss_05336
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2025-04-08
2025-Apr-08
20250408
PublicationDateYYYYMMDD 2025-04-08
PublicationDate_xml – month: 04
  year: 2025
  text: 2025-04-08
  day: 08
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of open source software
PublicationTitleAlternate J Open Source Softw
PublicationYear 2025
Publisher Open Source Initiative - NumFOCUS
Publisher_xml – name: Open Source Initiative - NumFOCUS
References Xu (Xu:2013) 2013
Honnibal (Honnibal:2020) 2020
(gate) 2022
White (White:2021) 2021
Ansel (Ansel:2024) 2024
Electron Authors (electron) 2025
Stenetorp (Stenetorp:2012) 2012
Islamaj (Islamaj:2020) 2020; 48
Tyurin (highlight) 2022
Mozilla (pdfjs) 2025
Shindo (Shindo:2018) 2018
International Organization for Standardization (pdf) 2020
Walker (Walker:2022) 2022
References_xml – year: 2022
  ident: gate
  article-title: GATE: A full-lifecycle open source solution for text processing
– year: 2013
  ident: Xu:2013
  article-title: Graph-based layout analysis for PDF documents
  publication-title: Imaging and Printing in a Web 2.0 World IV
  doi: 10.1117/12.2005608
– year: 2024
  ident: Ansel:2024
  article-title: PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
  publication-title: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24)
  doi: 10.1145/3620665.3640366
– year: 2022
  ident: Walker:2022
  article-title: Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
  publication-title: Environ Int.
  doi: 10.1016/j.envint.2021.107025
– year: 2021
  ident: White:2021
  article-title: Publications Output: U.S. Trends and International Comparisons
  publication-title: National Center for Science and Engineering Statistics
– year: 2020
  ident: pdf
  article-title: ISO 32000-2:2020 document management – portable document format – part 2: PDF 2.0
– volume: 48
  year: 2020
  ident: Islamaj:2020
  article-title: TeamTat: a collaborative text annotation tool
  publication-title: Nucleic Acids Research
  doi: 10.1093/nar/gkaa333
– year: 2018
  ident: Shindo:2018
  article-title: PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents
  publication-title: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
– year: 2022
  ident: highlight
  article-title: React-pdf-highlighter
  publication-title: GitHub repository
– year: 2020
  ident: Honnibal:2020
  article-title: spaCy: Industrial-strength Natural Language Processing in Python.
  doi: 10.5281/zenodo.1212303
– year: 2012
  ident: Stenetorp:2012
  article-title: brat: A web-based tool for NLP-assisted text annotation
  publication-title: Proceedings of the demonstrations session at EACL 2012
– year: 2025
  ident: electron
  article-title: Electron
– year: 2025
  ident: pdfjs
  article-title: PDF.js
SSID ssj0001862611
Score 2.2879326
Snippet While different text mining approaches - including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid...
While different text mining approaches – including the use of Artificial Intelligence (AI) and other machine based methods - continue to expand at a rapid...
SourceID osti
proquest
pubmed
crossref
SourceType Open Access Repository
Aggregation Database
Index Database
StartPage 5336
SubjectTerms annotation
pdf
Python
text extraction
Title PDF Entity Annotation Tool (PEAT)
URI https://www.ncbi.nlm.nih.gov/pubmed/40547228
https://www.proquest.com/docview/3223630459
https://www.osti.gov/servlets/purl/2573694
Volume 10
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELegvPAC43NlgDJpD6DIo0tsx34sXcuYxJjUVtpb5LiOChsp6lJN21_PnZ2PUgYCXpLKkRzr7vrL_c53Z0L2Mi6ynuGC2lnGKYNPMs3AUaZcWJ3xAymMa-rz6UQcTdnxGT9rj7dy1SVltm9ubq0r-R-twhjoFatk_0GzzaQwAL9Bv3AFDcP1r3R8ejgKh1hnex32i2JRJQ5OsK0mOI6nw_6kJvq_up94bFboQ_dwy8sr3abBggc6v9hoPRB-2G-j18tzH-l2EIEFUM2zY1unWb9fInYMmifjeRXWOdTz4svlfD3eEHGXpiJbdoqrG_vVfcT8Jt-fnIYnq2-jz4PpuAWwiCWcqp6oWl3fMlYjcG_d0vzLNqEdiSq2wfiKtZdYQLzRQNt9kgF_YqHYXXIvAtYQrTFsF3JD9uZOZG5W4Ss23dzv2pl_8lE6C8Da3_MP54dMtsiDSoNB31vDI3LHFo_Jw_pwjqDC6idkF4wj8MYRtMYRoHEEb9A03j4l09FwMjii1XkY1CDLpobl0srYxBp89kRZkWg9w33YXCSJVhx4klYzo6SyigGy6ixTGICITGIjw2z8jHSKRWG3SRDNBNYw59zIiOUHXForJYB7DP9pqZnskr1aAul33_YkBbroBJWioFInqC7ZQemk4K1hy2GDuVmmTCs1dMluLbQUQAt3onRhF6vLFL4iscA9etUlz700m9cAg8AGpvLFH-feIfdb43xJOuVyZV-Be1hmr53SfwDRpV-r
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=PDF+Entity+Annotation+Tool+%28PEAT%29&rft.jtitle=Journal+of+open+source+software&rft.au=Stahl%2C+Christopher+G.&rft.au=Markey%2C+Kristan+J.&rft.au=Jewell%2C+Brian+C.&rft.au=Shams%2C+Dahnish&rft.date=2025-04-08&rft.pub=Open+Source+Initiative+-+NumFOCUS&rft.issn=2475-9066&rft.eissn=2475-9066&rft.volume=10&rft.issue=108&rft_id=info:doi/10.21105%2Fjoss.05336&rft.externalDocID=2573694
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2475-9066&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2475-9066&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2475-9066&client=summon